Key-Value Memory Networks for Directly Reading Documents on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com
Key-Value Memory Networks for Directly Reading Documents
Alexander Miller and Adam Fisch and Jesse Dodge and Amir-Hossein Karimi and Antoine Bordes and Jason Weston
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more
Summaries/Notes 1
[link] Summary by Shagun Sodhani 7 years ago
### Introduction

* Knowledge Bases (KBs) are effective tools for Question Answering (QA) but are often too restrictive (due to fixed schema) and too sparse (due to limitations of Information Extraction (IE) systems).
* The paper proposes Key-Value Memory Networks, a neural network architecture based on [Memory Networks](https://gist.github.com/shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e) that can leverage both KBs and raw data for QA.
* The paper also introduces MOVIEQA, a new QA dataset that can be answered by a perfect KB, by Wikipedia pages and by an imperfect KB obtained using IE techniques thereby allowing a comparison between systems using any of the three sources.
* [Link to the paper](https://arxiv.org/abs/1606.03126).

### Related Work

* TRECQA and WIKIQA are two benchmarks where systems need to select the sentence containing the correct answer, given a question and a list of candidate sentences. 
* These datasets are small and make it difficult to compare the  systems using different sources.
* Best results on these benchmarks are reported by CNNs and RNNs with attention mechanism.

### Key-Value Memory Networks

* Extension of [Memory Networks Model](https://gist.github.com/shagunsodhani/c7a03a47b3d709e7c592fa7011b0f33e).
* Generalises the way context is stored in memory.
* Comprises of a memory made of slots in the form of pair of vectors $(k_{1}, v_{1})...(k_{m}, v_{m})$ to encode long-term and short-term context.

#### Reading the Memory

* **Key Hashing** - Question, *x* is used to preselect a subset of array $(k_{h1}, v_{h1})...(k_{hN}, v_{hN})$ where the key shares atleast one word with *x* and frequency of the words is less than 1000.
* **Key Addresing** - Each candidate memory is assigned a relevance probability:
    * $p_hi$ = softmax($Aφ_X(x).Aφ_K (k_{hi}))$
    * φ is a feature map of dimension *D* and *A* is a *dxD* matrix.
* **Value Reading** - Value of memories are read by taking their weighted sum using addressing probabilites and a vector *o* is returned.
* $o = sum(p_{hi} A\phi_V(v_{hi}))$
* Memory access process conducted by "controller" neural network using $q = Aφ_X (x)$ as the query.
* Query is updated using
    * $q_2 = R_1 (q+o)$
* Addressing and reading steps are repeated using new $R_i$ matrices to retrieve more pertinent information in subsequent access.
* After a fixed number of hops, H, resulting state of controller is used to compute a final prediction.
* $a = \text{argmax}(\text{softmax}(q_{H+1}^T B\phi_Y (y_i)))$
where $y_i$ are the possible candidate outputs and $B$ is a $dXD$ matrix.
* The network is trained end-to-end using a cross entropy loss, backpropogation and stochastic gradient.
* End-to-End Memory Networks can be viewed as a special case of Key-Value Memory Networks by setting key and value to be the same for all the memories.

#### Variants of Key-Value Memories

* $φ_x$ and $φ_y$ - feature map corresponding to query and answer are fixed as bag-of-words representation.

##### KB Triple

* Triplets of the form "subject relation object" can be represented in Key-Value Memory Networks with subject and relation as the key and object as the value.
* In standard Memory Networks, the whole triplet would have to be embedded in the same memory slot.
* The reversed relations "object is_related_to subject" can also be stored.

##### Sentence Level

* A document can be split into sentences with each sentence encoded in the key-value pair of the memory slot as a bag-of-words.

##### Window Level

* Split the document in the windows of W words and represent it as bag-of-words. 
* The window becomes the key and the central word becomes the value.

##### Window + Centre Encoding

* Instead of mixing the window centre with the rest of the words, double the size of the dictionary and encode the centre of the window and the value using the second dictionary.

##### Window + Title

* Since the title of the document could contain useful information, the word window can be encoded as the key and document title as the value.
* The key could be augmented with features like "_window_" and "_title_" to distinguish between different cases.

### MOVIEQA Benchmark

#### Knowledge Representation

* Doc - Raw documents (from Wikipedia) related to movies.
* KB - Graph-based KB made of entities and relations.
* IE - Performing Information Extraction on Wikipedia to create a KB.
* The QA pairs should be answerable by both raw document and KB so that the three approaches can be compared and the gap between the three solutions can be closed.
* The dataset has more than 100000 QA pairs, making it much larger than most existing datasets.

### Experiments

#### MOVIEQA

##### Systems Compared

* [Bordes et al's QA system](TBD)
* [Supervised Embeddings](TBD)(without KB)
* [Memory Networks](TBD)
* Key-Value Memory Networks

##### Observations

* Key-Value Memory Networks outperforms all methods on all data sources.
* KB > Doc > IE
* The best memory representation for directly reading documents uses "Window Level + Centre Encoding + Title".

##### KB vs Synthetic Document Analysis

* Given KB triplets, construct synthetic "Wikipedia" articles using templates, conjunctions and coreferences to determine the causes for the gap in performance when using KB vs doc.
* Loss in One Template sentences are due to the difficulty of extracting subject, relation and object from the artificial docs. 
* Using multiple templates does not deteriorate performance much. But conjunctions and coreferences cause a dip in performance.

#### WIKIQA

* Given a question, select the sentence (from Wikipedia document) that best answers the question.
* Key-Value Memory Networks outperforms all other solutions though it is only marginally better than LDC and attentive models based on CNNs and RNNs.
Your comment:
Write your summary here (You can use $\LaTeX$ and markdown syntax):
Anon Private