SceneNet: Understanding Real World Indoor Scenes with Synthetic Data

by: Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan, Simon Stent & Roberto Cipolla

ArXiv, 2015

This paper addresses the problem of insufficient training data for indoor scene understanding deep learning models. Presently, the only indoor depth datasets with per-pixel labels are NYUv2 and SUN RGB-D which contain only 795 and 5285 training images, respectively. Since both of these datasets relied on humans to label them, dataset creation in this fashion is expensive and time consuming. The datasets also suffer from human error with missing or incorrect labels. To overcome some of these challenges, the authors compile a new set of fully annotated large 3D (basis) scenes from the internet and generate new scenes by adding many objects from shape repositories. This allows the authors to render many videos of the scenes.

To add variation to the basis scenes, objects can be removed, added, or perturbed from their original positions. In order to make the scenes as physically realistic as possible, many different constraints are used on an object’s potential location. By solving an optimization of bounding box intersection, pairwise distance, visibility, distance to wall, and angle to wall a realistic scene can be generated. Since the objects are retrieved from an object repository, the label of each object is already known, meaning that any additions to the scene will not affect the completely labeled nature of the basis scenes. The authors recognize that the noise distributions for these newly generated scenes may not mimic the real world, so they apply the simulated Kinect noise model. This paper is not concerned with correctly texturing the objects as it only uses depth maps for the model training.

In order to see the improvements synthetic data has on semantic segmentation, the authors use a state of the art semantic segmentation algorithm build on a VGG network. Since these algorithms are for RGB images, they are modified to work on a three channel depth based input (DHA) and are then trained on synthetic data and fine-tuned using existing datasets. Fine tuning is required to make the results compelling, and when used on NYUv2, yields a 5 point advantage over the dataset without additional information. A similar trend holds for fine tuning on SUN RGB-D data. While the results are not always as good as the results of Eigen et. al [1] or Hermans et al [2] they are comparable on a fair number of classes. The main failure classes are paintings, televisions, and windows. This can be accounted for by relying on only depth information.

It would be interesting to see the effects of adding in object textures to these models. Since the models perform well based off only depth, adding color data should further improve performance. The authors say that applying textures from OpenSurfaces did not mimic the real world and ray tracing was too time consuming. It might be interesting to see if models using these incorrect textures would help performance, it should at the minimum improve detection of the “flat” objects that this technical had trouble with before.

In summary, this paper proposes an interesting solution to the problem of not having enough data to adequately train deep models for indoor scene understanding. The method, even without using RGB data performs reasonably well and can train faster than models using only real-world data.

[1] David Eigen, Rob Fergus. “Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture”

[2] A. Hermans, G. Floros and B. Leibe, “Dense 3D semantic mapping of indoor scenes from RGB-D images”

Recurrent Batch Normalization

by: Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar Gülçehre & Aaron Courville

ArXiv, 2016

This paper investigates the effect of batch normalization in recurrent neural networks. Since first proposed in Google Inception v2 network, batch normalization has become a standard technique in training deep neural networks. However, it has been reported in previous work [2] that hidden-to-hidden translation batch normalization may hurt the performance of LSTM. This paper, in contrast with previous work [2], shows that batch normalization can improve the performance of LSTMs when applied properly, and the scale parameter (gamma) in the batch normalization is crucial to avoid gradient vanishing.

Techniques such as drop out and batch normalization have been widely applied in training convolutional neural networks. In theory, one could directly apply dropout and batch normalization to each time step in recurrent neural networks, since a recurrent neural network is just a very deep feed-forward network with shared parameters over time steps (and is often so-implemented through unrolling), so. In practice, however, the large depth (time steps) of recurrent networks compared with ordinary ConvNets may prohibit naive application of these techniques.

Previous work [1] applies dropout only to input-to-hidden transitions in recurrent networks, but not hidden-to-hidden transitions, so that input data are only corrupted by a fixed number of dropout not related to the actual number of time steps. [2] follows the same intuition and only applied batch normalizations to input-to-hidden transitions.

This paper, however, applied batch normalization to both input-to-hidden and hidden-to-hidden transition in recurrent LSTM networks, and also to the cell state vector before the output. In other words, in this paper batch normalization is applied everywhere, except for the cell state update (so that the dynamics of LSTM cell is still preserved).

The paper empirically shows that when the scale parameter (gamma) is initialized to 1.0 as common practice in batch normalization, the gradient vanishes when back propagating through time. However, this problem can be fixed by initializing gamma to a smaller value. It is conjectured that tanh is the main reason for gradient vanishing, the paper recommends using 0.1 as initial scale parameter and 0 as initial bias parameter. Then, several experiments shows that the batch normalized LSTM proposed in this paper outperforms vanilla LSTM in different scenarios.

In its analysis, the paper empirically investigates the gradient vanishing problem in batch normalized LSTM, and links it to tanh function. However, it will be better to see mathematically analysis to the gradient flow. Also, it seems to me that the cell state normalization in output (BN in Eqn. 8 in the paper) is especially worth investigating, since all other batch normalization terms (those in Eqn. 6) can be merged into weight matrix (at test time), but the one in Eqn. 8 cannot and an extra affine transform has to be introduced into LSTM at test time. It is also worth investigating with ablation study the effect of each individual batch normalization term in the final performance.

In summary, this paper provides useful advice on how batch normalization can be applied to LSTM networks. It would be interesting to see further investigation on the gradient flow in recurrent neural network with batch normalization.

References
[1] W. Zaremba, I. Sutskever, O. Vinyals, and G. Brain, “Recurrent Neural Network Regularization,” arXiv Prepr., arXiv:1409.2329, 2014.
[2] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch Normalized Recurrent Neural Networks,” arXiv Prepr., arXiv:1510.01378, 2015.

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

by Justin Johnson, Alexandre Alahi & Fei-Fei Li
Arxiv, 2016

This paper proposes the use of “perceptual losses” for training a feed-forward network for the applications of style transfer and superresolution. The perceptual losses used in these applications are raw features and 2nd order statistics computed from a pre-defined CNN, namely the VGG network. This work builds primarily on the work by Gatys et al. in style transfer. Gatys et al. proposed a method for transferring the style of one image to the content of another. The method synthesizes a new image which matches the Gram matrix statistics from multiple layers of the style image and the raw feature of a single higher layer in the content image. This was performed in an iterative optimization framework, which can take on the order of seconds. In this paper, a feed-forward network is trained to perform this task. The observation that an optimization framework can be replaced by a feed-forward network is similar to the work from Dosovitskiy and Brox, which trained a feed-forward network to invert features, which was previously proposed by Manhendran and Vedaldi.

One central claim of the paper is that the method produces “similar qualitative results but…three orders of magnitude faster”, as highlighted in the abstract. Table 1 shows the timing between the feed-forward method proposed compared to the iterative optimization framework proposed by Gatys et al, and the relative speed-up factor. However, the results from the feed-forward method are of lower quality, as indicated quantitatively by the loss values in Figure 5, and qualitatively from the examples in Figure 6. Table 1 would be more meaningful if the time the Gatys et al. method takes to get to the same quality as the feed-forward network, rather than the time the method takes to meet convergence criteria, were highlighted. From Figure 5, this seems to be at less than 100 iterations, which means the speedup factor is closer to 150x rather than 1000x, as highlighted in the table and claimed in the abstract. This is a fairer representation of the results and nonetheless an impressive result.

In addition, a human study, such as a 2AFC test, would help quantify the dropoff in quality between the feed-forward network versus the iterative optimization framework and substantiate the claim that the feed-forward network is of similar quality to the Gatys et al method. An unmentioned application for this work is also as an initialization for the Gatys et al. method, should the user desire results of the higher quality. In this case, the speedup factor of the overall system becomes less impressive, as increasing iterations of the original slower optimization method have to be run to achieve the desired quality. Thus, a speed-up factor vs performance curve would provide a good reference for a potential user.

The paper also trains a similar framework for the application of superresolution, with some results shown in Figure 8. The paper mentions that automated metrics such as the standard PSNR and SSIM metrics do not correlate perceptually well with actual image quality. This is indeed a common problem in image synthesis problems. However, the paper does not offer any alternatives, such as a human study. Though the results are undoubtedly sharper, as pointed on the in the paper, there is a very apparent stippling pattern in the results, which may be displeasing for a human evaluator. This is possibly due to the use of only a 1st order perceptual loss.

In summary, the paper proposes the novel use of perceptual losses, extracted from pre-trained networks, to train feed-forward networks and currently applies the loss to train network for the tasks of style transfer and superresolution. The use of the proposed perceptual losses as a general framework for other structured output tasks, such as colorization, semantic segmentation, and surface normal prediction, as mentioned in the paper, certainly seems like a plausible and worthy direction of further research and exploration for the community.

Asynchronous Methods for Deep Reinforcement Learning

by Volodymyr Mnih, Adria Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy Lillicrap, David Silver & Koray Kavokcuoglu
Arxiv, 2016

The paper uses asynchronous gradient descent to perform deep reinforcement learning.  The authors execute multiple agents in parallel, on multiple instances of the environment to de-correlate the agents’ learning process in order to create a more stationary process. The authors observe that multiple parallel agents tend to explore different spaces in the environment,  thus each agent would learn a policy much different from each other.  This indicates that a proper asynchronous combination of each parallel agent’s policy can create a better shared global policy.  The authors choose to implement these asynchronous reinforcement learning framework on a single multi-core CPU machine, rather than under a distributed setting or using GPU as in previous Gorila frameworks. This dramatically reduces the computation cost and improves training speed.

Moreover, the change (gradient) toward the global policy made by multiple actor-learners are less correlated in time than a single agent while applying online updates, therefore the multi-agent strategy would also stabilize the learning process.  The authors uses multiple agents to update the controller without using experience replay as in the DQN algorithm, result in 1) a more efficient algorithm while reducing computation cost, and 2) an algorithm that can also perform one-shot learnings.

The authors then describe four asynchronous algorithms for reinforcement learning, namely asynchronous one-step Q-learning, asynchronous one-step Sarsa, asynchronous n-step Q-learning, and asynchronous advantage actor-critic, and their performance learning different games. The new algorithms succeed on a wide variety of continuous motor control problems.  The asynchronous training using 16 core CPU spends only half of the time to train as the previous GPU implementation.  These algorithms also obtain almost linear speedup in learning while using multiple CPU threads compared to a single CPU thread.  Lastly, the authors state that the asynchronous methods they used is complementary to the use of experience replay, the dueling architecture, and other techniques in order to make the learned-agent to perform the different tasks more accurately.

Discussion

This is a well done study on authors’ new asynchronous parallel reinforcement learning algorithm.  This work makes the observation that multiple agents typically explore different part of the the environment.  Leveraging effects of this phenomena on a global policy, the authors improve the previous DQN Gorila framework (Nair, 2015) by implementing asynchronous updates from parallel multiple agent.  This dramatically speeded up training in various games as parallel, multiple agents can share their decorrelated experiences to improve learning speed.

Further, due to the nature of the de-correlated policy learned by these parallel agents, the new system no longer needs the experience replay memory as in Gorila to stabilize learning.  This simplified the model, reduced redundant memory use, and after all, simple is better for generalization in learning models.

The authors also indicate that the asynchronous one-shot Q-learning and SARSA achieved super linear scaling up.  It would be interesting to see the reason or analysis behind why parallel one-shot methods require less data to achieve similar level of convergence than its counterparts.

The  authors not only perform various games (including 3D games) to compare speed of convergence of their asynchronous algorithms to previous synchronous algorithm, they also try to use different asynchronous version of two optimization methods, momentum SGD (Rect 2011) and RMSProp (Tieleman and Hinton 2012) to investigate which has better performance.  Shared RMSProp performs most robust and stable compared to momentum SGD.  It would have been interesting to see the comparison in the robustness between the new methods and DQN using these two optimization methods as well, but it could be out of the scope of this paper.

The paper also mentions few future directions for improvement in their new algorithms.  We find that the direction of better estimating the state-action Q value interesting.  Methods such as reducing overestimation bias (Van Hasselt 2015), (Bellemare, 2016), and more accurate estimation of Q-value using dueling architecture (Wang 2015) should provide further improvements and would be complementary to the current paper.  A graph indicating the learned estimated Q-value in the state action space compared to the true Q-value would have been great to support their claim.  On the other hand, the true Q-value in the state action space could be expensive and hard to compute.

Dynamic Memory Networks for Visual and Textual Question Answering

by Caiming Xiong, Stephen Merity & Richard Socher
Arxiv, 2016

This paper builds upon the Dynamic Memory Network introduced by Kumar et al. in Dynamic Memory Networks for Natural Language Processing, which was aimed at tackling question answering given natural language text input. The original DMN paper[1] proposed the interesting idea of forming an episodic memory from text input, and using the final state of the memory, which captures all the information in the previous states, to answer the question. In this paper, Xiong et al proposed several improvements to the input module and and the episodic memory module in the DMN, and improve the performance of the their network to achieve state-of-the-art in question answering.

The original DMN paper introduced a model that contains four modules: an input module which is used to encode the input information into a set of vectors, a question module that encodes the question, an episodic memory module that computes memory for each time step, and an answer module which generates a one word answer. Xiong et al. propose three improvements. The most important and effective improvement in the paper is adding an input fusion layer to the input module, which allows the fact vectors to incorporate information of past and future facts. The second improvement is changing the attention mechanism in the episodic memory module to help capture the ordering information of the fact vectors. This change proposes the creative idea of using gradient recurring units to capture attention information, and is shown to improve performance.The third improvement is changing the update function of the episodic memory to allow different weight updates in each pass.

Another contribution of this paper is the proposal to modify the input module to take an image as input for the question answering task. It passes the input image to a CNN and treats the output as a 14 by 14 feature map, and then orders the features in a snake-like traversing manner and uses them as an ordered set of input facts.

Although visual question answering has been studied in the past, the small size of the dataset limits the possibility of tackling the problem with neural networks, and the release of VQA dataset in 2015 clears this block. The use of an attention net fits well in this context, and one example is the work that Yang et al. did in 2015[2]. Compared to Yang et al.’s work, the DMN+ paper uses an input fusion layer that captures adjacent information, and in the experiment this change is shown to improve performance by a lot in the DAQUAR[3] dataset. The intuition behind this is that the interaction and relationship between adjacent image patches contains helpful information for answering the question. Neural memory models has also been used in several other papers, including the original DMN. Both this paper and the original DMN paper are using neural memory models in the input, attention and computing memory in order to capture temporal, and local information.

The model analysis in the experiment section gives a good breakdown of the link between improvement in performance and each of the changes proposed. Each improvement is unit tested to prove the theory, and this shows good experiment design. The experiments compare the result with other approaches for both the VQA task and Text QA task, and compares accuracy on several different types of question.

While the paper is able to adapt the input module to image inputs, there is room for improvement or further exploration. One possible thing to try is to cross-validate across a variant of dimensions for the the output of the CNN. Another point for improvement is the way that the input fusion layer traverse the image during the ordering of local patches. Instead of traversing in a snake like fashion, it would be worthwhile to experiment something like a z-order curve. For instance, if patch B is located below patch A, then if traversed with a z-order curve, patch B would be closer to patch A in the ordering, and it might help to capture the information between nearby patches

It is also unclear what is the correlation between the number of passes when doing the memory update and the final accuracy. It might be valuable to experiment and discuss on how the number of passes would affect testing accuracy.

In conclusion, this paper presents a valuable improvement to DMN, and achieves state of the art performance. Some future work would be trying the model on multiple word open-ended question answering, and modifying it to solve harder visual question answering tasks involving temporal information like video question answering.

[1] Kumar, A., Irsoy, O., Ondruska, P., Iyyer, M., Bradbury, J., Gulrajani, I., and Socher, R. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. arXiv preprint arXiv:1506.07285, 2015.

[2] Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274, 2015.

[3] Malinowski, M. and Fritz, M. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In NIPS, 2014.

Generative Image Modeling using Style and Structure Adversarial Networks

by Xiaolong Wang & Abhinav Gupta
Arxiv, 2016

Generative Adversarial Networks (GAN) have been recently proposed to generate realistic-looking images from random noise vectors. While images generated by GAN achieve state-of-the-art perceptual quality, it ignores the basic principle that image formation is a product of different factors (e.g. geometry, lighting, texture, etc.), and the lack of explicit control/guidance of the generation process makes the sampled images hard to interpret. This paper shows that by explicitly factorizing the generation process into two components: structure (3D geometry) and style (texture and lighting), it is possible to obtain higher-quality image samples, stronger features for representation learning, and better interpretability of the generative process than the standard GAN.

The proposed approach is simple and intuitive: first train the Structure-GAN and Style-GAN independently, and then merge the two networks for joint training. At the independent training stage, the Structure-GAN is trained to generate surface normal maps from random noise vectors, while the Style-GAN is trained to generate images conditioned on ground-truth surface normal maps, and the output images are optimized to not only look realistic but also satisfy pixel-wise surface normal constraints. At the joint training stage, the output of the Structure-GAN replaces ground-truth surface normal maps as input to the Style-GAN, and both networks are fine-tuned end-to-end to generate realistic images.

This paper is one of the first attempts in learning factorized GANs. The explicit factorization of style and structure allows one to probe into the generative process of the network by varying one factor at a time while fixing the other. However, the effect of factorization seems more evident for the setting of varying structure while fixing style than the other way around (style does not seem to change much while walking in the latent space in Fig.9). One potential explanation is that unlike Structure-GAN, Style-GAN is trained in a one-to-one conditional setting that limits its capability of modeling a multi-modal distribution. In other words, the Style-generator might learn to ignore the input noise vector since the supervision is defined on a normal-image pair (i.e. no randomness).

The use of multi-task FCN-loss for training the Style-GAN is also interesting. Theoretically, the per-pixel normal reconstruction loss is not needed for obtaining pixel-level correspondence between the input normal maps and generated images, since the Style-GAN is conditional, and treats a normal-image pair as a training example. The fact that adding a per-pixel loss helps in getting better alignment highlights the difficulty in optimizing the conditional GAN alone. The optimization instability is also implied by several engineering tricks (e.g. BatchNorm is used in Structure-Generator but not Structure-Discriminator, FCN is not fine-tuned initially when generated images are bad, gradients from the Style-GAN need to be weighted lower to prevent over-fitting when joint training, etc.).

Given that quantitative evaluation of generative models is infamously challenging, this paper proposes to use response statistics of pre-trained classifiers on the generated images as a pseudo-metric. The assumption is that if the images are realistic enough, classifiers would fire on them with high scores. However, off-the-shelf classifiers are not perfect, and exhibit different error modes that might be particularly sensitive to artifacts unique to a certain method. Therefore, the numbers in Fig.11 (a) and (b) need to be taken with a grain of salt.

The strongest quantitative result comes from the representation learning experiments, especially on scene classification, where the proposed method outperforms DCGAN by a large margin and only 3.7% away from the Places-AlexNet. However, it is unclear whether the improvement comes from the factorization or simply extra supervision (groundtruth surface normals) that are not available to DCGAN (or Places-AlexNet). In other words, it remains to be seen if factorized generation is indeed better in representation learning than direct generation when both are given the same amount of supervision.

Overall, this paper presents a novel method to factorize image generation into structure and style using generative adversarial networks, and demonstrates its advantages over standard GAN on the quality of image samples, interpretability/control of the generative process, and generalization ability of learned features for recognition tasks. Nonetheless, it would be nice to have more in-depth analysis on the effect of factorization (e.g. what is distilled by the style-GAN that is conflated in the standard GAN) and ablation study on whether the improvement is a result of factorization or extra surface normal supervision not available to the standard GAN.

Dueling Network Architectures For Deep Reinforcement Learning

by Ziyu Wang, Nando de Freitas & Marc Lanctot
Arxiv, 2016

This paper is motivated by the recent successes in deep reinforcement learning using advantage function, which is a measure of the importance of taking actions from a finite discrete set at each possible state. The authors propose a dueling neural network architecture for model-free reinforcement learning which separates estimating the state value function from the state-dependent action advantage function for the final goal of Q-function approximation.

This proposed framework is specifically useful in situations where the action does not make any changes in the environment while the agent is in some particular states. This happens when the size of the action domain is large or there is a redundancy in the action space. In such situations, the proposed dueling network will approximate the Q-function more accurately than other state-of-the-art Q-learning approaches.  Another benefit of the proposed architecture is that it can be easily combined with any other RL algorithms.

The first stream, which estimates the state value function, is needed for situations where estimation of the value of each action choice is unnecessary. The second stream, which estimates the advantage function values, is useful when the network needs to make a preference over the actions in a given state. The output of these two streams are combined by an aggregator to produce the Q-function. This aggregator layer will lead to automatic estimation of both value and advantage functions through back-propagation, without any need for extra supervision.

The performance of the proposed dueling network is compared against other baselines in different situations. First, the authors judge the learned Q-value in a simple corridor environment, where taking any action at each state under the learned policy makes the computation of Q-value independent of other states and actions. In this corridor environment, the authors compare the performance of their proposed dueling network to a single-stream architecture with the same number of parameters in the task of policy evaluation. Increasing the number of possible actions makes a larger gap between the performance of the two methods, supporting the dueling network.  Secondly, in a series of Atari games, the proposed dueling network follows the DDQN algorithm. Thus the authors compare the performance of dueling network with the state-of-the-art DDQN results, again with the same number of parameters (assuming they have the same capacity). The experiments reveal the improvements made by the dueling network.

Although it is a novelty of the paper to “automatically” estimate value and advantage functions by a back-propagation step through the proposed aggregator, it is not exactly clear how this back-propagation will result in the estimates of the two other functions. Additional explanation of the properties of this aggregator or a mathematical proof would be useful. Furthermore, it is mentioned in the paper that the proposed network has a complementary role to the other Q-network algorithms such as DQN and DDQN and any modification to those methods (such as replay memories) can be applied in the new case too. It would be interesting to see how this claim is true in practice. Besides, some detailed comments about the games for which dueling network did particularly well or poorly against the DDQN or human players can be useful. Why is it so good at Atlantis and Breakout, but so bad at Freeway and Asteroids? If any common features or patterns appeared among these games, they might highlight the kinds of situations in which dueling network performs particularly well.  Also, it would be useful to know in which other situations (rather than large action spaces) this dueling architecture can be applied.