WikiReading: A Novel Large-scale Language Understanding Task over WikipediaWikiReading: A Novel Large-scale Language Understanding Task over WikipediaDaniel Hewlett and Alexandre Lacoste and Llion Jones and Illia Polosukhin and Andrew Fandrianto and Jay Han and Matthew Kelcey and David Berthelot2016
Paper summaryshagunsodhani#### Introduction
* Large scale natural language understanding task - predict text values given a knowledge base.
* Accompanied by a large dataset generated using Wikipedia
* [Link to the paper](http://www.aclweb.org/anthology/P/P16/P16-1145.pdf)
* WikiReading dataset built using Wikidata and Wikipedia.
* Wikidata consists of statements of the form (property, value) about different items
* 80M statements, 16M items and 884 properties.
* These statements are grouped by items to get (item, property, answer) tuples where the answer is a set of values.
* Items are further replaced by their Wikipedia documents to generate 18.58M statements of the form (document, property, answer).
* Task is to predict answer given document and property.
* Properties are divided into 2 classes:
* **Categorical properties** - properties with a small number of possible answers. Eg gender.
* **Relational properties** - properties with unique answers. Eg date of birth.
* This classification is done on the basis of the entropy of answer distribution.
* Properties with entropy less than 0.7 are classified as categorical properties.
* Answer distribution has a small number of very high-frequency answers (head) and a large number of answers with very small frequency (tail).
* 30% of the answers do not appear in the training set and must be inferred from the document.
##### Answer Classification
* Consider WikiReading as classification task and treat each answer as a class label.
* Linear model over Bag of Words (BoW) features.
* Two BoW vectors computed - one for the document and other for the property. These are concatenated into a single feature vector.
###### Neural Networks Method
* Encode property and document into a joint representation which is fed into a softmax layer.
* **Average Embeddings BoW**
* Average the BoW embeddings for documents and property and concatenate to get joint representation.
* **Paragraph Vectors**
* As a variant of the previous method, encode document as a paragraph vector.
* **LSTM Reader**
* LSTM reads the property and document sequence, word-by-word, and uses the final state as joint representation.
* **Attentive Reader**
* Use attention mechanism to focus on relevant parts of the document for a given property.
* **Memory Networks**
* Maps a property p and list of sentences x<sub>1</sub>, x<sub>2</sub>, ...x<sub>n</sub> in a joint representation by attention over the sentences in the document.
##### Answer Extraction
* For relational properties, it makes more sense to model the problem as information extraction than classification.
* Use an RNN to read the sequence of words and estimate if a given word is part of the answer.
* **Basic SeqToSeq (Sequence to Sequence)**
* Similar to LSTM Reader but augmented with a second RNN to decode answer as a sequence of words.
* **Placeholder SeqToSeq**
* Extends Basic SeqToSeq to handle OOV (Out of Vocabulary) words by adding placeholders to the vocabulary.
* OOV words in the document and answer are replaced by placeholders so that input and output sentences are a mixture of words and placeholders only.
* **Basic Character SeqToSeq**
* Property encoder RNN reads the property, character-by-character and transforms it into a fixed length vector.
* This becomes the initial hidden state for the second layer of a 2-layer document encoder RNN.
* Final state of this RNN is used by answer decoder RNN to generate answer as a character sequence.
* **Character SeqToSeq with pretraining**
* Train a character-level language model on input character sequence from the training set and use the weights to initiate the first layer of encoder and decoder.
* Evaluation metric is F1 score (harmonic mean of precision and accuracy).
* All models perform well on categorical properties with neural models outperforming others.
* In the case of relational properties, SeqToSeq models have a clear edge.
* SeqToSeq models also show a great deal of balance between relational and categorical properties.
* Language model pretraining enhances the performance of character SeqToSeq approach.
* Results demonstrate that end-to-end SeqToSeq models are most promising for WikiReading like tasks.
WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia
arXiv e-Print archive - 2016 via Local arXiv
First published: 2016/08/11 (3 years ago) Abstract: We present WikiReading, a large-scale natural language understanding task and
publicly-available dataset with 18 million instances. The task is to predict
textual values from the structured knowledge base Wikidata by reading the text
of the corresponding Wikipedia articles. The task contains a rich variety of
challenging classification and extraction sub-tasks, making it well-suited for
end-to-end models such as deep neural networks (DNNs). We compare various
state-of-the-art DNN-based architectures for document classification,
information extraction, and question answering. We find that models supporting
a rich answer space, such as word or character sequences, perform best. Our
best-performing model, a word-level sequence to sequence model with a mechanism
to copy out-of-vocabulary words, obtains an accuracy of 71.8%.