End-to-End Memory Networks

by Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston & Rob Fergus
ArXiv, 2015

There has been a recent surge of interest in searching for computational models with easily accessible and scalable memory that would allow them to perform tasks that require multi hop computation and process variable length inputs. In particular, this work adds to the efforts on allowing neural networks to capture long-term dependencies in sequential data.

The standard RNN or LSTM architectures, while performing well for various sequential tasks like captioning / translation, can only capture ‘memory’ via a latent state that is unstable over long timescales. This prevents such models from being applied in settings requiring facts being recalled after a long time period (e.g. question answering after reading a story). To overcome this limitation and have a more explicit and longer-term memory representation,  this work leverages explicit storage and attention-based ideas which have been explored previously, in particular by two related recent methods ‘Neural Turing Machines'[1] and ‘Memory Networks'[2].  While [1] demonstrates applications of memory models for tasks like sorting and exploits memory-read and memory-write operations, this work and [2] address text-based reasoning while focusing  only on memory-read operations. This paper relaxes the strong supervision requirement of [2] (where the index of memory to be looked up was required during training) and shows interesting quantitative as well as qualitative results for the tasks addressed.

Given a sequence of sentences and a question, the aim is to produce an answer from a vocabulary V. The main module of the architecture is a memory layer which takes an input u (for example an embedded question) and has access to the stored sequential input (e.g. sentences). This layer first linearly embeds all the sentences and computes a memory output via weighted averaging of the embedded sentences. The weights aim to model soft attention and are computed using a softmax over similarity of embedded sentences to the layer input.  The final output of this layer is a linear addition of its input and the memory output.

The architecture for the question answering tasks is as follows – the input sentences are stored and the question is linearly embedded and serves as input to the first of K (K <= 3) memory layers. The output of the k_th memory layer serves as input to k+1_th memory layer. Finally, the output of the last memory layer followed by a linear product and softmax computes a probability for each word in V.

There are several finer points in the implementation which are crucial for performance, including: parameter sharing across layers (Sec 2.2), representing sentences as a bag of words vs position encoded vectors (Sec 4.1) , modifying embeddings by adding sentence-index specific bias (Sec 4.1), random noise (Sec 4.1), and finally the use of a particular training schedule (Sec 4.2).

The paper first evaluates on the QA task of [2] and demonstrates improvements over an LSTM variant as well as empirically validates various training choices. The quantitative  results, while still short of the strongly supervised method of [2], highlight the importance of allowing multiple memory read operations and the qualitative visualizations of the soft attention provide additional insights. The paper further adapts the architecture for a word prediction task and demonstrates improvements over standard approaches as well as explores the effect of having a larger memory.

The paper also mentions that the model performance is high-variance and that the one with the lowest training error is chosen – it would be instructive to know if a similar strategy was required for the baseline methods.  Another claim made is that joint training across tasks helps the performance but this seems to not hold in the regime of large training data and the paper could possibly speculate on this. It would also be interesting to see some examples and more details on how the system deals with a variable length sequential output – the exposition in the paper does not make this immediately clear. This work  could also be extended to leverage the sequential nature of the task more explicitly as the current method has a fixed “Position Encoding” of sentences and a sentence-index based bias feature in the memory module – these could be replaced by learned RNNs, possibly similar to [2]. The incorporation of a simple sentence parsing and logical reasoning based baseline to answer questions would further serve to highlight the benefits of learning based methods presented.

Overall, this is an interesting direction. The soft-attention based memory lookup used relaxes strong supervision requirements and can potentially allow applications to real question answering – it would be really exciting to see further developments along these directions.



[1] Graves, Alex, Greg Wayne, and Ivo Danihelka. “Neural Turing Machines.” arXiv preprint arXiv:1410.5401 (2014).

[2] Weston, Jason, Sumit Chopra, and Antoine Bordes. “Memory networks.” arXiv preprint arXiv:1410.3916 (2014).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s