Teaching Machines to Read and Comprehend on ShortScience.org

papers.nips.cc
scholar.google.com

Teaching Machines to Read and Comprehend
Hermann, Karl Moritz and Kociský, Tomás and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 5

[link] Summary by mashayekhi 6 years ago

https://i.imgur.com/mYFkCxk.png
Main contributions:
The paper proposed a new method to provide large scale supervised reading comprehension and also developing attention based deep neural networks that can answer complex questions from real documents.

Importance:
Obtaining supervised natural language comprehensive data in large scale is difficult. On the other hand, reading comprehension methods constructed based on synthetic data failed in real environment when facing real data. This work addresses lack of real supervised reading comprehension data. In addition, they build novel deep learning models for reading comprehension by incorporating attention mechanism into recurrent neural networks. Attention mechanism allows a model to focus on the parts of a document that it believes will help it answer a question.

Method:
First part, two machine reading corpora is created by exploiting CNN and Daily Mail articles along with their corresponding summaries in form of the bullet points. These bullet points are abstractive and they are paraphrasing important parts of the article rather than copying sentences from the text. The bullet points turn into Cloze type questions by replacing one entity at a time with an entity marker, for example, “producer X will not press charges against ent212 ,his lawyer says.”. All the entities are replaced by entity markers and using a coreference and also entity markers are permuted for each data points to avoid world knowledge and co-occurrence effects in the reading comprehension.

Second part, For the reading comprehension task, they used 2 simple base line models(A), 2 symbolic matching models(B), and 4 recurrent neural networks models(C):
A1) Majority Baseline: It picks the most frequently observed entity in the context document.

A2) Exclusive Majority: It picks the most frequently observed entity in the context document which is not observed in the query.

B1) Frame-Semantic Parsing: This method parses the sentence to find "who did what to whom" using state-of-the-art frame semantic parser on the anonymized data points.

B2) Word Distance Benchmark: It aligns placeholder of Cloze form questions with each possible entity in the context document and calculates the distance between the question and the context around the aligned entity. Then sum of the distance of every word in a query to their nearest aligned word in the document is calculated.

C1) Deep LSTM Reader (2-layer LSTM)
This model feeds the [document | query] pair separated by a delimiter as a single large document, one word at a time. LSTM cells have skip connections from input to hidden layers and hidden layer to output.

C2) Attentive Reader (bi-directional LSTM with attention)
This model employs attention mechanism to overcome the bottleneck of fixed width hidden vector. First, it encodes the document and the query using separate bi-directional single layer LSTM. Then, query encoding is obtained by concatenating the final forward and backwards outputs. Document encoding is obtained by a weighted sum of output vectors (obtained by concatenating the forward and backwards outputs). The weights can be interpreted as the degree to which the network attends to a particular token in the document. Finally, the model is completed by defining a non-linear combination of document and query embedding.

C3) Uniform Reader (bi-directional LSTM)
It is Attentive Reader without attention mechanism, which is used here to see the effect of attention mechanism on the results.

C4) Impatient Reader (bi-directional LSTM with attention per each query token)
This one is similar to Attentive Reader except that the attention weights are computed per each query token. The intuition is that for each token the model finds which part of the context document is more relevant. The model accumulates the information from the document as each query token is seen and finally outputs a joint document query representation using a non-linear combination of document embedding and query embedding.

Results:
As expected, Attentive and Impatient Readers outperform all other models which show the benefits of attention model. Also Uniform Reader supports this hypothesis. The accuracies on two datasets (CNN, Daily Mail) are Maximum Frequency: 33.2 / 25.5, Exclusive Frequency: 39.3 / 32.8, Frame-semantic model: 40.2 / 35.5, Word distance model: 50.9 / 55.5, Deep LSTM Reader: 57.0 / 62.2, Uniform Reader: 39.4 / 34.4, Attentive Reader: 63.0 / 69.0, Impatient Reader: 63.8 / 68.0.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private