by: Alexander (Sasha) Vezhnevets, Volodymyr Mnih, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, Koray Kavukcuoglu
This paper develops a novel deep RNN architecture that builds an internal plan of high-level temporally abstracted macro-actions in an end-to-end manner. They demonstrate that their model, STRategic Attentive Writer (STRAW), makes improvements on several ATARI games requiring temporally extended planning strategies and generalize the model to work on sequence data tasks including text prediction.
Much work has been done on using reinforcement learning to train deep neural network controllers which can learn abstract and general spatial representations of the world but not much has been done to create a scalable and successful architecture for learning temporal abstractions. Learning macro-actions (actions extended over a period of time) should enable high-level behavior and more efficient exploration. Unlike most of other reinforcement learning approaches which output a single action per time step, STRAW maintains a multi-step action plan and periodically updates this plan based on observations. Other work on learning temporally extended actions and temporal abstractions have used pre-defined subgoals or pseudo-rewards that are provided explicitly.
The main strength of STRAW is that the model learns macro-actions implicitly in an end-to-end fashion without requiring pseudo-rewards or hand-crafted subgoals. The model has two modules. The first module translates observations into an action-plan which is an AxT grid where each column is an action distribution at time t. The second module creates a commitment plan C which is a T length row vector. At each time step, a binary variable is drawn from C and either the plan is updated or the plan is committed which means an action will be executed from A without observing the environment and replanning. To generate the multi-step plan, the paper uses an attentive mechanism for reading and writing which operates over the temporal extent of the plan. This paper also introduces STRAW-explorer (STRAWe) which injects noise between the feature extractor and planning module to encourage exploration. To train the model they use Asynchronous Advantage Actor-Critic (A3C) which directly optimizes the policy with a policy gradient that uses a value function estimate to reduce the gradient variance.
The experiments on Atari games show significant improvement using STRAWe on games requiring long term planning like Frostbite and Ms. Pacman, however, STRAWe performs worse on more reactive games like Breakout, indicating that there is some undesirable tradeoff in introducing this architecture for long term vs short term decision making. It seems like this tradeoff would become even worse for more reactive games like Pong, but unfortunately the paper does not provide any comparison. The read and write patches are A x 10 dimensional which restricts the length of planning to 10 time steps into the future which still seems short sighted. Figure 6 also shows that STRAWe is not much better than replanning at random indicating the algorithm might not be learning the best times to replan. It iss unclear how complex the multi-step plans can be such as plans involving many different actions versus a sequence of the same repeated action. They also test their model with only one reinforcement learning method, A3C, when their model might work better with other current state of the art reinforcement learning methods. For example, Dueling Network Architectures for Deep Reinforcement Learning (http://arxiv.org/abs/1511.06581) achieves comparable results to STRAW. Finally, it would also be an interesting experiment to see if their method can work on continuous control problems requiring long term planning such as manipulation tasks like stacking blocks with a gripper.
This paper is worth reading for anyone interested in hierarchical reinforcement learning problems or more generally sequence prediction tasks with long time horizons. Learning temporal abstractions over actions will most likely be required for enabling systems with complex behavior which can plan over temporally extended episodes.
In conclusion, the paper proposes a novel architecture for learning temporally abstracted macro-actions and demonstrates significant performance improvements on several Atari games requiring long term planning. Unlike most other work, their model learns macro-actions in an end-to-end fashion without hand-crafted subgoals. It is unclear of how far they can push their architecture in making longer term plans and in continuous domains but it is a step in the right direction toward enabling more complex behavior.