A major challenge in Natural Language Processing is to teach machines to generate natural language and actions. However, existing deep learning approaches to language generation do not impose structural constraints in the generation process, often producing low-quality results. In this context, we will introduce our attempt of imposing structural constraints for video captioning via hierarchical reinforcement learning. Moreover, we observe that most of the automated metrics for generation could be gamed, and therefore, we propose an adversarial reward learning method to automatically learn the reward via inverse reinforcement learning. Furthermore, I will discuss our recent attempts in connecting language and vision to actions via a language grounding task for robot navigation, and introduce new algorithms on scheduled policy optimization and combining model-free and model-based reinforcement learning. I will conclude by introducing other exciting research projects at UCSB's NLP Group.