Strategic Attentive Writer for Learning Macro-Actions

by: Alexander (Sasha) Vezhnevets, Volodymyr Mnih, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, Koray Kavukcuoglu

NIPS 2016

This paper develops a novel deep RNN architecture that builds an internal plan of high-level temporally abstracted macro-actions in an end-to-end manner. They demonstrate that their model, STRategic Attentive Writer (STRAW), makes improvements on several ATARI games requiring temporally extended planning strategies and generalize the model to work on sequence data tasks including text prediction.

Much work has been done on using reinforcement learning to train deep neural network controllers which can learn abstract and general spatial representations of the world but not much has been done to create a scalable and successful architecture for learning temporal abstractions. Learning macro-actions (actions extended over a period of time) should enable high-level behavior and more efficient exploration. Unlike most of other reinforcement learning approaches which output a single action per time step, STRAW maintains a multi-step action plan and periodically updates this plan based on observations. Other work on learning temporally extended actions and temporal abstractions have used pre-defined subgoals or pseudo-rewards that are provided explicitly.

The main strength of STRAW is that the model learns macro-actions implicitly in an end-to-end fashion without requiring pseudo-rewards or hand-crafted subgoals. The model has two modules. The first module translates observations into an action-plan which is an AxT grid where each column is an action distribution at time t. The second module creates a commitment plan C which is a T length row vector. At each time step, a binary variable is drawn from C and either the plan is updated or the plan is committed which means an action will be executed from A without observing the environment and replanning. To generate the multi-step plan, the paper uses an attentive mechanism for reading and writing which operates over the temporal extent of the plan. This paper also introduces STRAW-explorer (STRAWe) which injects noise between the feature extractor and planning module to encourage exploration. To train the model they use Asynchronous Advantage Actor-Critic (A3C) which directly optimizes the policy with a policy gradient that uses a value function estimate to reduce the gradient variance.

The experiments on Atari games show significant improvement using STRAWe on games requiring long term planning like Frostbite and Ms. Pacman, however, STRAWe performs worse on more reactive games like Breakout, indicating that there is some undesirable tradeoff in introducing this architecture for long term vs short term decision making. It seems like this tradeoff would become even worse for more reactive games like Pong, but unfortunately the paper does not provide any comparison. The read and write patches are A x 10 dimensional which restricts the length of planning to 10 time steps into the future which still seems short sighted. Figure 6 also shows that STRAWe is not much better than replanning at random indicating the algorithm might not be learning the best times to replan. It iss unclear how complex the multi-step plans can be such as plans involving many different actions versus a sequence of the same repeated action. They also test their model with only one reinforcement learning method, A3C, when their model might work better with other current state of the art reinforcement learning methods. For example, Dueling Network Architectures for Deep Reinforcement Learning ( achieves comparable results to STRAW. Finally, it would also be an interesting experiment to see if their method can work on continuous control problems requiring long term planning such as manipulation tasks like stacking blocks with a gripper.

This paper is worth reading for anyone interested in hierarchical reinforcement learning problems or more generally sequence prediction tasks with long time horizons. Learning temporal abstractions over actions will most likely be required for enabling systems with complex behavior which can plan over temporally extended episodes.

In conclusion, the paper proposes a novel architecture for learning temporally abstracted macro-actions and demonstrates significant performance improvements on several Atari games requiring long term planning. Unlike most other work, their model learns macro-actions in an end-to-end fashion without hand-crafted subgoals. It is unclear of how far they can push their architecture in making longer term plans and in continuous domains but it is a step in the right direction toward enabling more complex behavior.


“The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables” and “Categorical Reparameterization with Gumbel-Softmax”

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
by Chris J. Maddison, Andriy Mnih, Yee Whye Teh
ArXiv, 2016

Categorical Reparameterization with Gumbel-Softmax
by Eric Jang, Shixiang Gu, Ben Poole
ArXiv, 2016

The two papers present a method for the estimation of gradients in a computation graph with discrete stochastic nodes. The method is based on a combination of two established tricks: the reparametrization trick (in which stochastic variables are rewritten as a deterministic function of some (learned) parameters and a known fixed noise distribution) and the Gumbel-Max trick (in which a sample from a discrete distribution is produced by adding Gumbel-distributed noise to a logit observation and taking the argmax).

In particular, using the Gumbel-Max trick, we can refactor the sampling of a discrete random variable into a deterministic function of some parameters and the Gumbel distribution. By doing so, we get around the fact that simple back propagation cannot be applied in the case of a stochastic computation graph with discrete random variables, as the function defined by the graph is not differentiable. The trick allows a low-variance but biased estimator of the gradient to be obtained via Monte Carlo sampling.

The estimator performs on par with previously established methods in tasks of density estimation. However, neither paper presents results for datasets more complex than MNIST. The reason for this is not made clear. An immediate extension of these works therefore is to consider more complex data that deal with discrete random variables. For example, the Gumbel-Softmax trick could be applied in the case of sequences such as language data.

As for the difference between the papers, those are slight: Jang et al. use a vector of Gumbels passed through a softmax activation, instead of the Gumbel density, in defining a variational lower bound, which results in a looser bound. Maddison et al. compared with a stronger baseline and found the Concrete / Gumbel-Softmax estimator to perform on par with other estimators on average (whereas Jang et al. find it performs slightly better, in terms of both accuracy and efficiency, than a weaker baseline). Lastly, Jang et al. describe the Straight-Through (ST) Gumbel Estimator which allows discrete samples (useful for robotics tasks) with a continuous backward pass.

A final point about both pieces of work is that there is a bias-variance tradeoff in the choice of the temperature parameter: with a small temperature, samples are close to a Categorical variable but the variance of the estimates of the gradients are large, and with large temperatures, samples are smooth (non-Categorical) but the variance of the gradients is small. The papers keep the temperature fixed and thus do not fully investigate how performance varies as a function of this parameter. However, they do suggest an annealing schedule such that the Gumbel-Softmax nodes behave more and more like Categorical random variables further on in training.

SceneNet: Understanding Real World Indoor Scenes with Synthetic Data

by: Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan, Simon Stent & Roberto Cipolla

ArXiv, 2015

This paper addresses the problem of insufficient training data for indoor scene understanding deep learning models. Presently, the only indoor depth datasets with per-pixel labels are NYUv2 and SUN RGB-D which contain only 795 and 5285 training images, respectively. Since both of these datasets relied on humans to label them, dataset creation in this fashion is expensive and time consuming. The datasets also suffer from human error with missing or incorrect labels. To overcome some of these challenges, the authors compile a new set of fully annotated large 3D (basis) scenes from the internet and generate new scenes by adding many objects from shape repositories. This allows the authors to render many videos of the scenes.

To add variation to the basis scenes, objects can be removed, added, or perturbed from their original positions. In order to make the scenes as physically realistic as possible, many different constraints are used on an object’s potential location. By solving an optimization of bounding box intersection, pairwise distance, visibility, distance to wall, and angle to wall a realistic scene can be generated. Since the objects are retrieved from an object repository, the label of each object is already known, meaning that any additions to the scene will not affect the completely labeled nature of the basis scenes. The authors recognize that the noise distributions for these newly generated scenes may not mimic the real world, so they apply the simulated Kinect noise model. This paper is not concerned with correctly texturing the objects as it only uses depth maps for the model training.

In order to see the improvements synthetic data has on semantic segmentation, the authors use a state of the art semantic segmentation algorithm build on a VGG network. Since these algorithms are for RGB images, they are modified to work on a three channel depth based input (DHA) and are then trained on synthetic data and fine-tuned using existing datasets. Fine tuning is required to make the results compelling, and when used on NYUv2, yields a 5 point advantage over the dataset without additional information. A similar trend holds for fine tuning on SUN RGB-D data. While the results are not always as good as the results of Eigen et. al [1] or Hermans et al [2] they are comparable on a fair number of classes. The main failure classes are paintings, televisions, and windows. This can be accounted for by relying on only depth information.

It would be interesting to see the effects of adding in object textures to these models. Since the models perform well based off only depth, adding color data should further improve performance. The authors say that applying textures from OpenSurfaces did not mimic the real world and ray tracing was too time consuming. It might be interesting to see if models using these incorrect textures would help performance, it should at the minimum improve detection of the “flat” objects that this technical had trouble with before.

In summary, this paper proposes an interesting solution to the problem of not having enough data to adequately train deep models for indoor scene understanding. The method, even without using RGB data performs reasonably well and can train faster than models using only real-world data.

[1] David Eigen, Rob Fergus. “Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture”

[2] A. Hermans, G. Floros and B. Leibe, “Dense 3D semantic mapping of indoor scenes from RGB-D images”

Recurrent Batch Normalization

by: Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar Gülçehre & Aaron Courville

ArXiv, 2016

This paper investigates the effect of batch normalization in recurrent neural networks. Since first proposed in Google Inception v2 network, batch normalization has become a standard technique in training deep neural networks. However, it has been reported in previous work [2] that hidden-to-hidden translation batch normalization may hurt the performance of LSTM. This paper, in contrast with previous work [2], shows that batch normalization can improve the performance of LSTMs when applied properly, and the scale parameter (gamma) in the batch normalization is crucial to avoid gradient vanishing.

Techniques such as drop out and batch normalization have been widely applied in training convolutional neural networks. In theory, one could directly apply dropout and batch normalization to each time step in recurrent neural networks, since a recurrent neural network is just a very deep feed-forward network with shared parameters over time steps (and is often so-implemented through unrolling), so. In practice, however, the large depth (time steps) of recurrent networks compared with ordinary ConvNets may prohibit naive application of these techniques.

Previous work [1] applies dropout only to input-to-hidden transitions in recurrent networks, but not hidden-to-hidden transitions, so that input data are only corrupted by a fixed number of dropout not related to the actual number of time steps. [2] follows the same intuition and only applied batch normalizations to input-to-hidden transitions.

This paper, however, applied batch normalization to both input-to-hidden and hidden-to-hidden transition in recurrent LSTM networks, and also to the cell state vector before the output. In other words, in this paper batch normalization is applied everywhere, except for the cell state update (so that the dynamics of LSTM cell is still preserved).

The paper empirically shows that when the scale parameter (gamma) is initialized to 1.0 as common practice in batch normalization, the gradient vanishes when back propagating through time. However, this problem can be fixed by initializing gamma to a smaller value. It is conjectured that tanh is the main reason for gradient vanishing, the paper recommends using 0.1 as initial scale parameter and 0 as initial bias parameter. Then, several experiments shows that the batch normalized LSTM proposed in this paper outperforms vanilla LSTM in different scenarios.

In its analysis, the paper empirically investigates the gradient vanishing problem in batch normalized LSTM, and links it to tanh function. However, it will be better to see mathematically analysis to the gradient flow. Also, it seems to me that the cell state normalization in output (BN in Eqn. 8 in the paper) is especially worth investigating, since all other batch normalization terms (those in Eqn. 6) can be merged into weight matrix (at test time), but the one in Eqn. 8 cannot and an extra affine transform has to be introduced into LSTM at test time. It is also worth investigating with ablation study the effect of each individual batch normalization term in the final performance.

In summary, this paper provides useful advice on how batch normalization can be applied to LSTM networks. It would be interesting to see further investigation on the gradient flow in recurrent neural network with batch normalization.

[1] W. Zaremba, I. Sutskever, O. Vinyals, and G. Brain, “Recurrent Neural Network Regularization,” arXiv Prepr., arXiv:1409.2329, 2014.
[2] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch Normalized Recurrent Neural Networks,” arXiv Prepr., arXiv:1510.01378, 2015.

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

by Justin Johnson, Alexandre Alahi & Fei-Fei Li
Arxiv, 2016

This paper proposes the use of “perceptual losses” for training a feed-forward network for the applications of style transfer and superresolution. The perceptual losses used in these applications are raw features and 2nd order statistics computed from a pre-defined CNN, namely the VGG network. This work builds primarily on the work by Gatys et al. in style transfer. Gatys et al. proposed a method for transferring the style of one image to the content of another. The method synthesizes a new image which matches the Gram matrix statistics from multiple layers of the style image and the raw feature of a single higher layer in the content image. This was performed in an iterative optimization framework, which can take on the order of seconds. In this paper, a feed-forward network is trained to perform this task. The observation that an optimization framework can be replaced by a feed-forward network is similar to the work from Dosovitskiy and Brox, which trained a feed-forward network to invert features, which was previously proposed by Manhendran and Vedaldi.

One central claim of the paper is that the method produces “similar qualitative results but…three orders of magnitude faster”, as highlighted in the abstract. Table 1 shows the timing between the feed-forward method proposed compared to the iterative optimization framework proposed by Gatys et al, and the relative speed-up factor. However, the results from the feed-forward method are of lower quality, as indicated quantitatively by the loss values in Figure 5, and qualitatively from the examples in Figure 6. Table 1 would be more meaningful if the time the Gatys et al. method takes to get to the same quality as the feed-forward network, rather than the time the method takes to meet convergence criteria, were highlighted. From Figure 5, this seems to be at less than 100 iterations, which means the speedup factor is closer to 150x rather than 1000x, as highlighted in the table and claimed in the abstract. This is a fairer representation of the results and nonetheless an impressive result.

In addition, a human study, such as a 2AFC test, would help quantify the dropoff in quality between the feed-forward network versus the iterative optimization framework and substantiate the claim that the feed-forward network is of similar quality to the Gatys et al method. An unmentioned application for this work is also as an initialization for the Gatys et al. method, should the user desire results of the higher quality. In this case, the speedup factor of the overall system becomes less impressive, as increasing iterations of the original slower optimization method have to be run to achieve the desired quality. Thus, a speed-up factor vs performance curve would provide a good reference for a potential user.

The paper also trains a similar framework for the application of superresolution, with some results shown in Figure 8. The paper mentions that automated metrics such as the standard PSNR and SSIM metrics do not correlate perceptually well with actual image quality. This is indeed a common problem in image synthesis problems. However, the paper does not offer any alternatives, such as a human study. Though the results are undoubtedly sharper, as pointed on the in the paper, there is a very apparent stippling pattern in the results, which may be displeasing for a human evaluator. This is possibly due to the use of only a 1st order perceptual loss.

In summary, the paper proposes the novel use of perceptual losses, extracted from pre-trained networks, to train feed-forward networks and currently applies the loss to train network for the tasks of style transfer and superresolution. The use of the proposed perceptual losses as a general framework for other structured output tasks, such as colorization, semantic segmentation, and surface normal prediction, as mentioned in the paper, certainly seems like a plausible and worthy direction of further research and exploration for the community.

Asynchronous Methods for Deep Reinforcement Learning

by Volodymyr Mnih, Adria Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy Lillicrap, David Silver & Koray Kavokcuoglu
Arxiv, 2016

The paper uses asynchronous gradient descent to perform deep reinforcement learning.  The authors execute multiple agents in parallel, on multiple instances of the environment to de-correlate the agents’ learning process in order to create a more stationary process. The authors observe that multiple parallel agents tend to explore different spaces in the environment,  thus each agent would learn a policy much different from each other.  This indicates that a proper asynchronous combination of each parallel agent’s policy can create a better shared global policy.  The authors choose to implement these asynchronous reinforcement learning framework on a single multi-core CPU machine, rather than under a distributed setting or using GPU as in previous Gorila frameworks. This dramatically reduces the computation cost and improves training speed.

Moreover, the change (gradient) toward the global policy made by multiple actor-learners are less correlated in time than a single agent while applying online updates, therefore the multi-agent strategy would also stabilize the learning process.  The authors uses multiple agents to update the controller without using experience replay as in the DQN algorithm, result in 1) a more efficient algorithm while reducing computation cost, and 2) an algorithm that can also perform one-shot learnings.

The authors then describe four asynchronous algorithms for reinforcement learning, namely asynchronous one-step Q-learning, asynchronous one-step Sarsa, asynchronous n-step Q-learning, and asynchronous advantage actor-critic, and their performance learning different games. The new algorithms succeed on a wide variety of continuous motor control problems.  The asynchronous training using 16 core CPU spends only half of the time to train as the previous GPU implementation.  These algorithms also obtain almost linear speedup in learning while using multiple CPU threads compared to a single CPU thread.  Lastly, the authors state that the asynchronous methods they used is complementary to the use of experience replay, the dueling architecture, and other techniques in order to make the learned-agent to perform the different tasks more accurately.


This is a well done study on authors’ new asynchronous parallel reinforcement learning algorithm.  This work makes the observation that multiple agents typically explore different part of the the environment.  Leveraging effects of this phenomena on a global policy, the authors improve the previous DQN Gorila framework (Nair, 2015) by implementing asynchronous updates from parallel multiple agent.  This dramatically speeded up training in various games as parallel, multiple agents can share their decorrelated experiences to improve learning speed.

Further, due to the nature of the de-correlated policy learned by these parallel agents, the new system no longer needs the experience replay memory as in Gorila to stabilize learning.  This simplified the model, reduced redundant memory use, and after all, simple is better for generalization in learning models.

The authors also indicate that the asynchronous one-shot Q-learning and SARSA achieved super linear scaling up.  It would be interesting to see the reason or analysis behind why parallel one-shot methods require less data to achieve similar level of convergence than its counterparts.

The  authors not only perform various games (including 3D games) to compare speed of convergence of their asynchronous algorithms to previous synchronous algorithm, they also try to use different asynchronous version of two optimization methods, momentum SGD (Rect 2011) and RMSProp (Tieleman and Hinton 2012) to investigate which has better performance.  Shared RMSProp performs most robust and stable compared to momentum SGD.  It would have been interesting to see the comparison in the robustness between the new methods and DQN using these two optimization methods as well, but it could be out of the scope of this paper.

The paper also mentions few future directions for improvement in their new algorithms.  We find that the direction of better estimating the state-action Q value interesting.  Methods such as reducing overestimation bias (Van Hasselt 2015), (Bellemare, 2016), and more accurate estimation of Q-value using dueling architecture (Wang 2015) should provide further improvements and would be complementary to the current paper.  A graph indicating the learned estimated Q-value in the state action space compared to the true Q-value would have been great to support their claim.  On the other hand, the true Q-value in the state action space could be expensive and hard to compute.

Dynamic Memory Networks for Visual and Textual Question Answering

by Caiming Xiong, Stephen Merity & Richard Socher
Arxiv, 2016

This paper builds upon the Dynamic Memory Network introduced by Kumar et al. in Dynamic Memory Networks for Natural Language Processing, which was aimed at tackling question answering given natural language text input. The original DMN paper[1] proposed the interesting idea of forming an episodic memory from text input, and using the final state of the memory, which captures all the information in the previous states, to answer the question. In this paper, Xiong et al proposed several improvements to the input module and and the episodic memory module in the DMN, and improve the performance of the their network to achieve state-of-the-art in question answering.

The original DMN paper introduced a model that contains four modules: an input module which is used to encode the input information into a set of vectors, a question module that encodes the question, an episodic memory module that computes memory for each time step, and an answer module which generates a one word answer. Xiong et al. propose three improvements. The most important and effective improvement in the paper is adding an input fusion layer to the input module, which allows the fact vectors to incorporate information of past and future facts. The second improvement is changing the attention mechanism in the episodic memory module to help capture the ordering information of the fact vectors. This change proposes the creative idea of using gradient recurring units to capture attention information, and is shown to improve performance.The third improvement is changing the update function of the episodic memory to allow different weight updates in each pass.

Another contribution of this paper is the proposal to modify the input module to take an image as input for the question answering task. It passes the input image to a CNN and treats the output as a 14 by 14 feature map, and then orders the features in a snake-like traversing manner and uses them as an ordered set of input facts.

Although visual question answering has been studied in the past, the small size of the dataset limits the possibility of tackling the problem with neural networks, and the release of VQA dataset in 2015 clears this block. The use of an attention net fits well in this context, and one example is the work that Yang et al. did in 2015[2]. Compared to Yang et al.’s work, the DMN+ paper uses an input fusion layer that captures adjacent information, and in the experiment this change is shown to improve performance by a lot in the DAQUAR[3] dataset. The intuition behind this is that the interaction and relationship between adjacent image patches contains helpful information for answering the question. Neural memory models has also been used in several other papers, including the original DMN. Both this paper and the original DMN paper are using neural memory models in the input, attention and computing memory in order to capture temporal, and local information.

The model analysis in the experiment section gives a good breakdown of the link between improvement in performance and each of the changes proposed. Each improvement is unit tested to prove the theory, and this shows good experiment design. The experiments compare the result with other approaches for both the VQA task and Text QA task, and compares accuracy on several different types of question.

While the paper is able to adapt the input module to image inputs, there is room for improvement or further exploration. One possible thing to try is to cross-validate across a variant of dimensions for the the output of the CNN. Another point for improvement is the way that the input fusion layer traverse the image during the ordering of local patches. Instead of traversing in a snake like fashion, it would be worthwhile to experiment something like a z-order curve. For instance, if patch B is located below patch A, then if traversed with a z-order curve, patch B would be closer to patch A in the ordering, and it might help to capture the information between nearby patches

It is also unclear what is the correlation between the number of passes when doing the memory update and the final accuracy. It might be valuable to experiment and discuss on how the number of passes would affect testing accuracy.

In conclusion, this paper presents a valuable improvement to DMN, and achieves state of the art performance. Some future work would be trying the model on multiple word open-ended question answering, and modifying it to solve harder visual question answering tasks involving temporal information like video question answering.

[1] Kumar, A., Irsoy, O., Ondruska, P., Iyyer, M., Bradbury, J., Gulrajani, I., and Socher, R. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. arXiv preprint arXiv:1506.07285, 2015.

[2] Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274, 2015.

[3] Malinowski, M. and Fritz, M. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In NIPS, 2014.