Teaching Machines to Read and Comprehend on ShortScience.org

papers.nips.cc
scholar.google.com

Teaching Machines to Read and Comprehend
Hermann, Karl Moritz and Kociský, Tomás and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 5

[link] Summary by Hugo Larochelle 8 years ago

This paper explores the problem of question answering based on natural text. While this has been explored recently in the context of Memory Networks, the problems tackled so far have been synthetically generated. In this paper, the authors propose to extract from news sites more realistic question answering examples, by treating the main body of a news article as the content (the "facts") and extracting questions from the article's bullet point summaries. Specifically, by detecting the entities in these bullet points and replacing them with a question place older (e.g. "Producer X will not press charges"), they are able to generate queries which, while grammatically not being questions, do require to perform a form of question answering. Thanks to this procedure, two large *supervised* datasets are created, with several thousands of questions, based on the CNN and Daily Mail news sites.

Then, the authors investigate neural network based systems for solving this task. They consider a fairly simple Deep LSTM network, which is first fed the article's content and then the query. They also consider two architectures that incorporate an attentional mechanism, based on softmax weighting. The first ("Attentive Reader") attends once in the document (i.e. uses a single softmax weight vector) while the second ("Impatient Reader") attends after every word in the query (akin to the soft attention architecture in the "Show Attend and Tell" paper).

These neural network architectures are also compared with simpler baselines, which are closer to what a more "classical" statistical NLP solution might look like.

Results on both datasets demonstrate that the neural network approaches have superior performance, with the attentional models being significantly better than the simpler Deep LSTM model.

#### My two cents

This is welcome development in the research on reasoning models based on neural networks. I've always thought it was unfortunate that the best benchmark available is based on synthetically generated cases. This work fixes this problem in a really clever way, while still being able to generate a large amount of training data. Particularly clever is the random permutation of entity markers when processing each case. Thanks to that, a system cannot simply use general statistics on words to answer questions (e.g. just from the query "The hi-tech bra that helps you beat breast X" it's obvious that "cancer" is an excellent answer). In this setup, the system is forced to exploit the content of the article, thus ensuring that the benchmark is indeed measuring the system's question-answering abilities.

Since the dataset itself is an important contribution of this paper, I hope the authors release it publicly in the near future.

The evaluation of the different neural architectures is also really thoroughly done. The non-neural baselines are reasonable and the comparison between the neural nets is itself interesting, bringing more evidence that the softmax weighted attentional mechanism (which has been gaining in popularity) indeed brings something over a regular LSTM approach.

Your comment:

[link] Summary by Denny Britz 7 years ago

TLDR; The authors generate a large dataset (~1M examples) for question answering by using cloze deletion on summaries of crawled CNN and Daily Mail articles. They evaluate 2 baselines, 2 symbolic models (frame semantic, word distance), and 4 neural models (Deep LSTM, Uniform Reader, Attentive Reader, Impatient Reader) on the dataset. The neural models, particularly those with attenton, beat the syntactic models.

- Deep LSTM: 2-layer bidirectional LSTM without attention mechanism
- Attentive reader: 1-layer bidirectional LSTM with attention mechanism for the whole query
- Impatient Reader: 1-layer bidirectional LSTM with attention mechanism for each token in the query (can be interpreted as being able to re-read the document at each token)
- Uniform Reader: Uniform attention to all document tokens

In their experiments, the authors randomize document entities to avoid letting the models rely on world knowledge or co-occurence statistics, and intead purely testing document comprehension. This is done by replacing entities with consistent ids *within* a document, but using different ids across documents.

#### Data and model performance

All numbers are accuracies on two datasets (CNN, Daily Mail)

- Maximum Frequency Entity Baseline: 33.2 / 25.5
- Exclusive Frequence Entity Baseline: 39.3 / 32.8
- Frame-semantic model: 40.2 / 35.5
- Word distance model: 50.9 / 55.5
- Deep LSTM Reader: 57.0 / 62.2
- Uniform Reader: 39.4 / 34.4
- Attentive Reader: 63.0 / 69.0
- Impatient Reader: 63.8 / 68.0

#### Key Takeaways

- The input to the RNN is defined as QUERY <DELIMITER> DOCUMENT, which is then embedded with or without attention and run through `softmax(W*x)` .
- Some sequences are very long, up to 2000 tokens, and the average length was 763 tokens. All LSTM models seem to be able to deal with this, but the attention models show significantly higher accuracy.
- Very nice attention visualizations and negative examples analysis that show the attention-based models focusing on the relevant parts of the document to answer the questions.

#### Notes / Questions

- How does document length affect the Deep LSTM reader? The appendix shows an analysis for attention models, but not for the Deep LSTM. A goal of the paper was to show that attention mechanisms are well suited for long documents because the fixed vector encoding is a bottleneck. The reuslts here aren't clear.
- Are the gradient truncated? I can't imagine the network is unrolled for 2000 steps. The training parameters details don't mention this.
- The mathematical notation in this paper needs some love. The concepts are relatively simple, but the formulas are hard to parse.
- What if you limited the output vocabulary to words appearing in the query document?
- Can you apply the same "attention-based embedding" mechanism to text classification?

Your comment:

[link] Summary by NIPS Conference Reviews 7 years ago

This paper deals with the formal question of machine reading. It proposes a novel methodology for automatic dataset building for machine reading model evaluation. To do so, the authors leverage on news resources that are equipped with a summary to generate a large number of questions about articles by replacing the named entities of it. Furthermore a attention enhanced LSTM inspired reading model is proposed and evaluated. The paper is well-written and clear, the originality seems to lie on two aspects. First, an original methodology of question answering dataset creation, where context-query-answer triples are automatically extracted from news feeds. Such proposition can be considered as important because it opens the way for large model learning and evaluation. The second contribution is the addition of an attention mechanism to an LSTM reading model. the empirical results seem to show relevant improvement with respect to an up-to-date list of machine reading models.

Given the lack of an appropriate dataset, the author provides a new dataset which scraped CNN and Daily Mail, using both the full text and abstract summaries/bullet points. The dataset was then anonymised (i.e. entity names removed). Next the author presents a two novel Deep long-short term memory models which perform well on the Cloze query task.

Your comment:

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* Build a supervised reading comprehension data set using news corpus.
* Compare the performance of neural models and state-of-the-art natural language processing model on reading comprehension task.
* [Link to the paper](http://arxiv.org/abs/1506.03340v3)

#### Reading Comprehension

* Estimate conditional probability $p(a|c, q)$, where $c$ is a context document, $q$ is a query related to the document, and $a$ is the answer to that query.

#### Dataset Generation

* Use online newspapers (CNN and DailyMail) and their matching summaries.
* Parse summaries and bullet points into Cloze style questions.
* Generate corpus of document-query-answer triplets by replacing one entity at a time with a placeholder.
* Data anonymized and randomised using coreference systems, abstract entity markers and random permutation of the entity markers.
* The processed data set is more focused in terms of evaluating reading comprehension as models can not exploit co-occurrence.

#### Models

##### Baseline Models

* **Majority Baseline**
* Picks the most frequently observed entity in the context document.
* **Exclusive Majority**
* Picks the most frequently observed entity in the context document which is not observed in the query.

##### Symbolic Matching Models

* **Frame-Semantic Parsing**
* Parse the sentence to find predicates to answer questions like "who did what to whom".
* Extracting entity-predicate triples $(e1,V, e2)$ from query $q$ and context document $d$
* Resolve queries using rules like `exact match`, `matching entity` etc.

* **Word Distance Benchmark**
* Align placeholder of Cloze form questions with each possible entity in the context document and calculate the distance between the question and the context around the aligned entity.
* Sum the distance of every word in $q$ to their nearest aligned word in $d$

##### Neural Network Models

* **Deep LSTM Reader**
* Test the ability of Deep LSTM encoders to handle significantly longer sequences.
* Feed the document query pair as a single large document, one word at a time.
* Use Deep LSTM cell with skip connections from input to hidden layers and hidden layer to output.

* **Attentive Reader**
* Employ attention model to overcome the bottleneck of fixed width hidden vector.
* Encode the document and the query using separate bidirectional single layer LSTM.
* Query encoding is obtained by concatenating the final forward and backwards outputs.
* Document encoding is obtained by a weighted sum of output vectors (obtained by concatenating the forward and backwards outputs).
* The weights can be interpreted as the degree to which the network attends to a particular token in the document.
* Model completed by defining a non-linear combination of document and query embedding.

* **Impatient Reader**
* As an add-on to the attentive reader, the model can re-read the document as each query token is read.
* Model accumulates the information from the document as each query token is seen and finally outputs a joint document query representation in the form of a non-linear combination of document embedding and query embedding.

#### Result

* Attentive and Impatient Readers outperform all other models highlighting the benefits of attention modelling.
* Frame-Semantic pipeline does not scale to cases where several methods are needed to answer a query.
* Moreover, they provide poor coverage as a lot of relations do not adhere to the default predicate-argument structure.
* Word Distance approach outperformed the Frame-Semantic approach as there was significant lexical overlap between the query and the document.
* The paper also includes heat maps over the context documents to visualise the attention mechanism.

Your comment:

[link] Summary by mashayekhi 6 years ago

https://i.imgur.com/mYFkCxk.png
Main contributions:
The paper proposed a new method to provide large scale supervised reading comprehension and also developing attention based deep neural networks that can answer complex questions from real documents.

Importance:
Obtaining supervised natural language comprehensive data in large scale is difficult. On the other hand, reading comprehension methods constructed based on synthetic data failed in real environment when facing real data. This work addresses lack of real supervised reading comprehension data. In addition, they build novel deep learning models for reading comprehension by incorporating attention mechanism into recurrent neural networks. Attention mechanism allows a model to focus on the parts of a document that it believes will help it answer a question.

Method:
First part, two machine reading corpora is created by exploiting CNN and Daily Mail articles along with their corresponding summaries in form of the bullet points. These bullet points are abstractive and they are paraphrasing important parts of the article rather than copying sentences from the text. The bullet points turn into Cloze type questions by replacing one entity at a time with an entity marker, for example, “producer X will not press charges against ent212 ,his lawyer says.”. All the entities are replaced by entity markers and using a coreference and also entity markers are permuted for each data points to avoid world knowledge and co-occurrence effects in the reading comprehension.

Second part, For the reading comprehension task, they used 2 simple base line models(A), 2 symbolic matching models(B), and 4 recurrent neural networks models(C):
A1) Majority Baseline: It picks the most frequently observed entity in the context document.

A2) Exclusive Majority: It picks the most frequently observed entity in the context document which is not observed in the query.

B1) Frame-Semantic Parsing: This method parses the sentence to find "who did what to whom" using state-of-the-art frame semantic parser on the anonymized data points.

B2) Word Distance Benchmark: It aligns placeholder of Cloze form questions with each possible entity in the context document and calculates the distance between the question and the context around the aligned entity. Then sum of the distance of every word in a query to their nearest aligned word in the document is calculated.

C1) Deep LSTM Reader (2-layer LSTM)
This model feeds the [document | query] pair separated by a delimiter as a single large document, one word at a time. LSTM cells have skip connections from input to hidden layers and hidden layer to output.

C2) Attentive Reader (bi-directional LSTM with attention)
This model employs attention mechanism to overcome the bottleneck of fixed width hidden vector. First, it encodes the document and the query using separate bi-directional single layer LSTM. Then, query encoding is obtained by concatenating the final forward and backwards outputs. Document encoding is obtained by a weighted sum of output vectors (obtained by concatenating the forward and backwards outputs). The weights can be interpreted as the degree to which the network attends to a particular token in the document. Finally, the model is completed by defining a non-linear combination of document and query embedding.

C3) Uniform Reader (bi-directional LSTM)
It is Attentive Reader without attention mechanism, which is used here to see the effect of attention mechanism on the results.

C4) Impatient Reader (bi-directional LSTM with attention per each query token)
This one is similar to Attentive Reader except that the attention weights are computed per each query token. The intuition is that for each token the model finds which part of the context document is more relevant. The model accumulates the information from the document as each query token is seen and finally outputs a joint document query representation using a non-linear combination of document embedding and query embedding.

Results:
As expected, Attentive and Impatient Readers outperform all other models which show the benefits of attention model. Also Uniform Reader supports this hypothesis. The accuracies on two datasets (CNN, Daily Mail) are Maximum Frequency: 33.2 / 25.5, Exclusive Frequency: 39.3 / 32.8, Frame-semantic model: 40.2 / 35.5, Word distance model: 50.9 / 55.5, Deep LSTM Reader: 57.0 / 62.2, Uniform Reader: 39.4 / 34.4, Attentive Reader: 63.0 / 69.0, Impatient Reader: 63.8 / 68.0.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private