[link]
## Introduction * Neural Network with a recurrent attention model over a large external memory. * Continous form of Memory-Network but with end-to-end training so can be applied to more domains. * Extension of RNNSearch and can perform multiple hops (computational steps) over the memory per symbol. * [Link to the paper](http://arxiv.org/pdf/1503.08895v5.pdf). * [Link to the implementation](https://github.com/facebook/MemNN). ## Approach * Model takes as input $x_1,...,x_n$ (to store in memory), query $q$ and outputs answer $a$. ### Single Layer * Input set ($x_i$) embedded in D-dimensional space, using embedding using matrix $A$ to obtain memory vectors ($m_i$). * Query is also embedded using matrix $B$ to obtain internal state $u$. * Compute match between each memory $m_i$ and $u$ in the embedding space followed by softmax operation to obtain probability vector $p$ over the inputs. * Each $x_i$ maps to an output vector $c_i$ (using embedding matrix $C$). * Output $o$ = weighted sum of transformed input $c_i$, weighted by $p_i$. * Sum of output vector, $o$ and embedding vector, $u$, is passed through weight matrix $W$ followed by softmax to produce output. * $A$, $B$, $C$ and $W$ are learnt by minimizing cross entropy loss. ### Multiple Layers * For layers above the first layer, input $u^{k+1} = u^k + o^k$. * Each layer has its own $A^k$ and $C^k$ - with constraints. * At final layer, output $o = \text{softmax}(W(o^K, u^K))$ ### Constraints On Embedding Vectors * Adjacent * Output embedding for one layer is input embedding for another ie $A^k+1 = C^k$ * $W = C^k$ * $B = A^1$ * Layer-wise (RNN-like) * Same input and output embeddings across layes ie $A^1 = A^2 ... = A^K$ and $C^1 = C^2 ... = C^K$. * A linear mapping $H$ is added to update of $u$ between hops. * $u^{k+1} = Hu^k + o^k$. * $H$ is also learnt. * Think of this as a traditional RNN with 2 outputs * Internal output - used for memory consideration * External output - the predicted result * $u$ becomes the hidden state. * $p$ is an internal output which, combined with $C$ is used to update the hidden state. ## Related Architectures * RNN - Memory stored as the state of the network and unusable in long temporal contexts. * LSTM - Locks network state using local memory cells. Fails over longer temporal contexts. * Memory Networks - Uses global memory. * Bidirectional RNN - Uses a small neural network with sophisticated gated architecture (attention model) to find useful hidden states but unlike MemNN, perform only a single pass over the memory. ## Sentence Representation for Question Answering Task * Bag-of-words representation * Input sentences and questions are embedded as a bag of words. * Can not capture the order of the words. * Position Encoding * Takes into account the order of words. * Temporal Encoding * Temporal information encoded by matrix $T_A$ and memory vectors are modified as $m_i = \text{sum}(Ax_{ij}) + T_A(i)$ * Random Noise * Dummy Memories (empty memories) are added at training time to regularize $T_A$. * Linear Start (LS) training * Removes softmax layers when starting training and insert them when validation loss stops decreasing. ## Observations * Best MemN2N models are close to supervised models in performance. * Position Encoding improves over bag-of-words approach. * Linear Start helps to avoid local minima. * Random Noise gives a small yet consistent boost in performance. * More computational hops leads to improved performance. * For Language Modelling Task, some hops concentrate on recent words while other hops have more broad attention span over all memory locations.
Your comment:
|