Strategic Attentive Writer for Learning Macro-Actions

by: Alexander (Sasha) Vezhnevets, Volodymyr Mnih, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, Koray Kavukcuoglu

NIPS 2016

This paper develops a novel deep RNN architecture that builds an internal plan of high-level temporally abstracted macro-actions in an end-to-end manner. They demonstrate that their model, STRategic Attentive Writer (STRAW), makes improvements on several ATARI games requiring temporally extended planning strategies and generalize the model to work on sequence data tasks including text prediction.

Much work has been done on using reinforcement learning to train deep neural network controllers which can learn abstract and general spatial representations of the world but not much has been done to create a scalable and successful architecture for learning temporal abstractions. Learning macro-actions (actions extended over a period of time) should enable high-level behavior and more efficient exploration. Unlike most of other reinforcement learning approaches which output a single action per time step, STRAW maintains a multi-step action plan and periodically updates this plan based on observations. Other work on learning temporally extended actions and temporal abstractions have used pre-defined subgoals or pseudo-rewards that are provided explicitly.

The main strength of STRAW is that the model learns macro-actions implicitly in an end-to-end fashion without requiring pseudo-rewards or hand-crafted subgoals. The model has two modules. The first module translates observations into an action-plan which is an AxT grid where each column is an action distribution at time t. The second module creates a commitment plan C which is a T length row vector. At each time step, a binary variable is drawn from C and either the plan is updated or the plan is committed which means an action will be executed from A without observing the environment and replanning. To generate the multi-step plan, the paper uses an attentive mechanism for reading and writing which operates over the temporal extent of the plan. This paper also introduces STRAW-explorer (STRAWe) which injects noise between the feature extractor and planning module to encourage exploration. To train the model they use Asynchronous Advantage Actor-Critic (A3C) which directly optimizes the policy with a policy gradient that uses a value function estimate to reduce the gradient variance.

The experiments on Atari games show significant improvement using STRAWe on games requiring long term planning like Frostbite and Ms. Pacman, however, STRAWe performs worse on more reactive games like Breakout, indicating that there is some undesirable tradeoff in introducing this architecture for long term vs short term decision making. It seems like this tradeoff would become even worse for more reactive games like Pong, but unfortunately the paper does not provide any comparison. The read and write patches are A x 10 dimensional which restricts the length of planning to 10 time steps into the future which still seems short sighted. Figure 6 also shows that STRAWe is not much better than replanning at random indicating the algorithm might not be learning the best times to replan. It iss unclear how complex the multi-step plans can be such as plans involving many different actions versus a sequence of the same repeated action. They also test their model with only one reinforcement learning method, A3C, when their model might work better with other current state of the art reinforcement learning methods. For example, Dueling Network Architectures for Deep Reinforcement Learning (http://arxiv.org/abs/1511.06581) achieves comparable results to STRAW. Finally, it would also be an interesting experiment to see if their method can work on continuous control problems requiring long term planning such as manipulation tasks like stacking blocks with a gripper.

This paper is worth reading for anyone interested in hierarchical reinforcement learning problems or more generally sequence prediction tasks with long time horizons. Learning temporal abstractions over actions will most likely be required for enabling systems with complex behavior which can plan over temporally extended episodes.

In conclusion, the paper proposes a novel architecture for learning temporally abstracted macro-actions and demonstrates significant performance improvements on several Atari games requiring long term planning. Unlike most other work, their model learns macro-actions in an end-to-end fashion without hand-crafted subgoals. It is unclear of how far they can push their architecture in making longer term plans and in continuous domains but it is a step in the right direction toward enabling more complex behavior.

“The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables” and “Categorical Reparameterization with Gumbel-Softmax”

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
by Chris J. Maddison, Andriy Mnih, Yee Whye Teh
ArXiv, 2016

Categorical Reparameterization with Gumbel-Softmax
by Eric Jang, Shixiang Gu, Ben Poole
ArXiv, 2016

The two papers present a method for the estimation of gradients in a computation graph with discrete stochastic nodes. The method is based on a combination of two established tricks: the reparametrization trick (in which stochastic variables are rewritten as a deterministic function of some (learned) parameters and a known fixed noise distribution) and the Gumbel-Max trick (in which a sample from a discrete distribution is produced by adding Gumbel-distributed noise to a logit observation and taking the argmax).

In particular, using the Gumbel-Max trick, we can refactor the sampling of a discrete random variable into a deterministic function of some parameters and the Gumbel distribution. By doing so, we get around the fact that simple back propagation cannot be applied in the case of a stochastic computation graph with discrete random variables, as the function defined by the graph is not differentiable. The trick allows a low-variance but biased estimator of the gradient to be obtained via Monte Carlo sampling.

The estimator performs on par with previously established methods in tasks of density estimation. However, neither paper presents results for datasets more complex than MNIST. The reason for this is not made clear. An immediate extension of these works therefore is to consider more complex data that deal with discrete random variables. For example, the Gumbel-Softmax trick could be applied in the case of sequences such as language data.

As for the difference between the papers, those are slight: Jang et al. use a vector of Gumbels passed through a softmax activation, instead of the Gumbel density, in defining a variational lower bound, which results in a looser bound. Maddison et al. compared with a stronger baseline and found the Concrete / Gumbel-Softmax estimator to perform on par with other estimators on average (whereas Jang et al. find it performs slightly better, in terms of both accuracy and efficiency, than a weaker baseline). Lastly, Jang et al. describe the Straight-Through (ST) Gumbel Estimator which allows discrete samples (useful for robotics tasks) with a continuous backward pass.

A final point about both pieces of work is that there is a bias-variance tradeoff in the choice of the temperature parameter: with a small temperature, samples are close to a Categorical variable but the variance of the estimates of the gradients are large, and with large temperatures, samples are smooth (non-Categorical) but the variance of the gradients is small. The papers keep the temperature fixed and thus do not fully investigate how performance varies as a function of this parameter. However, they do suggest an annealing schedule such that the Gumbel-Softmax nodes behave more and more like Categorical random variables further on in training.