Dynamic Memory Networks for Visual and Textual Question Answering

by Caiming Xiong, Stephen Merity & Richard Socher
Arxiv, 2016

This paper builds upon the Dynamic Memory Network introduced by Kumar et al. in Dynamic Memory Networks for Natural Language Processing, which was aimed at tackling question answering given natural language text input. The original DMN paper[1] proposed the interesting idea of forming an episodic memory from text input, and using the final state of the memory, which captures all the information in the previous states, to answer the question. In this paper, Xiong et al proposed several improvements to the input module and and the episodic memory module in the DMN, and improve the performance of the their network to achieve state-of-the-art in question answering.

The original DMN paper introduced a model that contains four modules: an input module which is used to encode the input information into a set of vectors, a question module that encodes the question, an episodic memory module that computes memory for each time step, and an answer module which generates a one word answer. Xiong et al. propose three improvements. The most important and effective improvement in the paper is adding an input fusion layer to the input module, which allows the fact vectors to incorporate information of past and future facts. The second improvement is changing the attention mechanism in the episodic memory module to help capture the ordering information of the fact vectors. This change proposes the creative idea of using gradient recurring units to capture attention information, and is shown to improve performance.The third improvement is changing the update function of the episodic memory to allow different weight updates in each pass.

Another contribution of this paper is the proposal to modify the input module to take an image as input for the question answering task. It passes the input image to a CNN and treats the output as a 14 by 14 feature map, and then orders the features in a snake-like traversing manner and uses them as an ordered set of input facts.

Although visual question answering has been studied in the past, the small size of the dataset limits the possibility of tackling the problem with neural networks, and the release of VQA dataset in 2015 clears this block. The use of an attention net fits well in this context, and one example is the work that Yang et al. did in 2015[2]. Compared to Yang et al.’s work, the DMN+ paper uses an input fusion layer that captures adjacent information, and in the experiment this change is shown to improve performance by a lot in the DAQUAR[3] dataset. The intuition behind this is that the interaction and relationship between adjacent image patches contains helpful information for answering the question. Neural memory models has also been used in several other papers, including the original DMN. Both this paper and the original DMN paper are using neural memory models in the input, attention and computing memory in order to capture temporal, and local information.

The model analysis in the experiment section gives a good breakdown of the link between improvement in performance and each of the changes proposed. Each improvement is unit tested to prove the theory, and this shows good experiment design. The experiments compare the result with other approaches for both the VQA task and Text QA task, and compares accuracy on several different types of question.

While the paper is able to adapt the input module to image inputs, there is room for improvement or further exploration. One possible thing to try is to cross-validate across a variant of dimensions for the the output of the CNN. Another point for improvement is the way that the input fusion layer traverse the image during the ordering of local patches. Instead of traversing in a snake like fashion, it would be worthwhile to experiment something like a z-order curve. For instance, if patch B is located below patch A, then if traversed with a z-order curve, patch B would be closer to patch A in the ordering, and it might help to capture the information between nearby patches

It is also unclear what is the correlation between the number of passes when doing the memory update and the final accuracy. It might be valuable to experiment and discuss on how the number of passes would affect testing accuracy.

In conclusion, this paper presents a valuable improvement to DMN, and achieves state of the art performance. Some future work would be trying the model on multiple word open-ended question answering, and modifying it to solve harder visual question answering tasks involving temporal information like video question answering.

[1] Kumar, A., Irsoy, O., Ondruska, P., Iyyer, M., Bradbury, J., Gulrajani, I., and Socher, R. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. arXiv preprint arXiv:1506.07285, 2015.

[2] Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274, 2015.

[3] Malinowski, M. and Fritz, M. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In NIPS, 2014.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s