Summaries from Empirical Methods on Natural Language Processing (EMNLP) on ShortScience.org

aclweb.org
scholar.google.com

Addressing the Rare Word Problem in Neural Machine Translation
Luong, Thang and Sutskever, Ilya and Le, Quoc V. and Vinyals, Oriol and Zaremba, Wojciech
Association for Computational Linguistics - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 7 years ago

# Addressing the Rare Word Problem in Neural Machine Translation

## Introduction

* NMT(Neural Machine Translation) systems perform poorly with respect to OOV(out-of-vocabulary) words or rare words.
* The paper presents a word-alignment based technique for translating such rare words.
* [Link to the paper](https://arxiv.org/abs/1410.8206)

## Technique

* Annotate the training corpus with information about what do different OOV words (in the target sentence) correspond to in the source sentence.
* NMT learns to track the alignment of rare words across source and target sentences and emits such alignments for the test sentences.
* As a post-processing step, use a dictionary to map rare words from the source language to target language.

## Annotating the Corpus

### Copy Model

* Annotate the OOV words in the source sentence with tokens *unk1*, *unk2*,..., etc such that repeated words get the same token.
* In target language, each OOV word, that is aligned to some OOV word in the source language, is assigned the same token as the word in the source language.
* The OOV word in the target language, which has no alignment or is aligned with a known word in the source language. is assigned the null token.
* Pros
* Very straightforward
* Cons
* Misses out on words which are not labelled as OOV in the source language.

### PosAll - Positional All Model

* All OOV words in the source language are assigned a single *unk* token.
* All words in the target sentences are assigned positional tokens which denote that the *jth* word in the target sentence is aligned to the *ith* word in the source sentence.
* Aligned words that are too far apart, or are unaligned, are assigned a null token.
* Pros
* Captures complete alignment between source and target sentences.
* Cons
* It doubles the length of target sentences.

### PosUnk - Positional Unknown Model

* All OOV words in the source language are assigned a single *unk* token.
* All OOV words in the target sentences are assigned *unk* token with the position which gives the relative position of the word in the target language with respect to its aligned source word.
* Pros:
* Faster than PosAll model.
* Cons
* Does not capture alignment for all words.

## Experiments

* Dataset
* Subset of WMT'14 dataset
* Alignment computed using the [Berkeley Aligner](https://code.google.com/archive/p/berkeleyaligner/)
* Used architecture from [Sequence to Sequence Learning with Neural Networks paper](https://gist.github.com/shagunsodhani/a2915921d7d0ac5cfd0e379025acfb9f).

## Results

* All the 3 approaches (more specifically the PosUnk approach) improve the performance of existing NMTs in the order PosUnk > PosAll > Copy.
* Ensemble models benefit more than individual models as the ensemble of NMT models works better at aligning the OOV words.
* Performance gains are more when using smaller vocabulary.
* Rare word analysis shows that performance gains are more when proposition of OOV words is higher.

aclweb.org
scholar.google.com

WikiQA: A Challenge Dataset for Open-Domain Question Answering
Yang, Yi and tau Yih, Wen and Meek, Christopher
Empirical Methods on Natural Language Processing (EMNLP) - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* Presents WikiQA - a publicly available set of question and sentence pairs for open-domain question answering.
* [Link to the paper](https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/)

#### Dataset

* 3047 questions sampled from Bing query logs.
* Each question associated with a Wikipedia page.
* All sentences in the summary paragraph of the page become the candidate answers.
* Only 1/3rd questions have a correct answer in the candidate answer set.
* Solutions crowdsourced through MTurk like platform.
* Answer sentences are associated with *answer phrases* (shortest substring of a sentence that answers the question) though this annotation is not used in the experiments reported by the paper.

#### Other Datasets

* [QASent datset](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf)
    * Uses questions from TREC-QA dataset (questions from both query logs and human editors) and selects sentences which share at least one non-stopword from the question. 
    * Lexical overlap makes QA task easier.
    * Does not support evaluating for *answer triggering* (detecting if the correct answer even exists in the candidate sentences).

#### Experiments

##### Baseline Systems

* **Word Count** - Counts the number of non-stopwords common to question and answer sentences.
* **Weighted Word Count** - Re-weight word counts by the IDF values of the question words.
* **[LCLR](https://www.microsoft.com/en-us/research/publication/question-answering-using-enhanced-lexical-semantic-models/)** - Uses rich lexical semantic features like WordNet and vector-space lexical semantic models.
* **Paragraph Vectors** - Considers cosine similarity between question vector and sentence vector.
* **Convolutional Neural Network (CNN)** - Bigram CNN model with average pooling.
* **PV-Cnt** and **CNN-Cnt** - Logistic regression classifier combining PV (and CNN) models and Word Count models.

##### Metrics

* MAP and MRR for answer selection problem.
* Precision, recall and F1 scores for answer triggering problem.

#### Observations

* CNN-cnt outperforms all other models on both the tasks.
* Three additional features, namely the length of the question (QLen), the length of sentence (SLen), and the class of the question (QClass) are added to track question hardness and sentence comprehensiveness.
* Adding QLen improves performance significantly while adding SLen (QClass) improves (degrades) performance marginally.
* For the same model, the performance on the WikiQA dataset is inferior to that on the QASent dataset.
* Note: The dataset is very small to train end-to-end networks.

arxiv.org
scholar.google.com

A Neural Attention Model for Abstractive Sentence Summarization
Rush, Alexander M. and Chopra, Sumit and Weston, Jason
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 7 years ago

TLDR; The authors apply a neural seq2seq model to sentence summarization. The model uses an attention mechanism (soft alignment).


#### Key Points

- Summaries generated on the sentence level, not paragraph level
- Summaries have fixed length output
- Beam search decoder
- Extractive tuning for scoring function to encourage the model to take words from the input sequence
- Training data: Headline + first sentence pair.

arxiv.org
scholar.google.com

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses
Sordoni, Alessandro and Galley, Michel and Auli, Michael and Brockett, Chris and Ji, Yangfeng and Mitchell, Margaret and Nie, Jian-Yun and Gao, Jianfeng and Dolan, Bill
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 7 years ago

TLDR; The authors propose three neural models to generate a response (r) based on a context and message pair (c,m). The context is defined as a single message. The first model, RLMT, is a basic Recurrent Language Model that is fed the whole (c,m,r) triple. The second model, DCGM-1, encodes context and message into a BoW representation, put it through a feedforward neural network encoder, and then generates the response using an RNN decoder. The last model, DCGM-2, is similar but keeps the representations of context and message separate instead of encoding them into a single BoW vector. The authors train their models on 29M triple data set from Twitter and evaluate using BLEU, METEOR and human evaluator scores.

#### Key Points:

- 3 Models: RLMT, DCGM-1, DCGM-2
- Data: 29M triples from Twitter
- Because (c,m) is very long on average the authors expect RLMT to perform poorly.
- Vocabulary: 50k words, trained with NCE loss
- Generates responses degrade with length after ~8 tokens


#### Notes/Questions:

- Limiting the context to a single message kind of defeats the purpose of this. No real conversations have only a single message as context, and who knows how well the approach works with a larger context?
- Authors complain that dealing with long sequences is hard, but they don't even use an LSTM/GRU. Why?

aclweb.org
scholar.google.com

On Using Very Large Target Vocabulary for Neural Machine Translation
Jean, Sébastien and Cho, KyungHyun and Memisevic, Roland and Bengio, Yoshua
Association for Computational Linguistics - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 7 years ago

TLDR; The authors propose an importance-sampling approach to deal with large vocabularies in NMT models. During training, the corpus is partitioned, and for each partition only target words occurring in that partition are chosen. To improve decoding speed over the full vocabulary, the authors build a dictionary mapping from source sentence to potential target vocabulary. The authors evaluate their approach on standard MT tasks and perform better than the baseline models with smaller vocabulary.

#### Key Points:

- Computing partition function is the bottleneck. Use sampling-based approach.
- Dealing with large vocabulary during training is separate from dealing with large vocab during decoding. Training is handled with importance sampling. Decoding is handled with source-based candidate list.
- Decoding with candidate list takes around 0.12s (0.05) per token on CPU (GPU). Without target list 0.8s (0.25s).
- Issue: Candidate list is depended on source sentence, so it must be re-computed for each sentence.
- Reshuffling the data set is expensive as new partitions need to be calculated (not necessary, but improved scores).

#### Notes:

- How is the corpus partitioned? What's the effect of the partitioning strategy?
- The authors say that they replace UNK tokens using "another word alignment model" but don't go into detail what this is. The results show that doing this results in much larger score bump than increasing the vocab does. (The authors do this for all comparison models though).
- Reshuffling the dataset also results in a significant performance bump, but this operation is expensive. IMO the authors should take all these into account when reporting performance numbers. A single training update may be a lot faster, but the setup time increases. I'd would've like to see the authors assign a global time budget to train/test and then compare the models based on that.
- The authors only briefly mentioned that re-building the target vocab for each source sentence is an issue and how they solve it, no details given.