The authors explore the properties of "Deep Averaging Networks" on text classification problems, specifically sentiment and question answering tasks. DANs extend neural bag of words models, starting with a document representation that is the average of the word embeddings in that document, but extending to multiple feed-forward layers. The authors argue that these models are much simpler and faster to train than syntax and composition-based RNNs, while obtaining similar performance. Since this paper is actually arguing for simpler models, there is little technically here to understand, so the real contribution of the paper are the interesting experiments exploring how the DANs represent various phenomena. They show that differences between graded sentiment words (awesome, cool, ok, underwhelming, the worst) are magnified as layers are added. This shows the benefit of depth relative to a neural bag of words. Then they compare against RNNs with examples containing negation and contrastive conjunctions (e.g., but), which are traditionally modeled syntactically. They show that existing methods that we think can represent syntax/composition in fact are not strong enough. Something like "not bad" fully exposes the DAN -- it doubles the negation. But while the RNN-based models can learn not to simply double the negation, they are not powerful enough to reverse the polarity and get the example correct.
Finally, the authors introduce one novel mechanism for improving training, "word dropout." Similar to standard dropout, they randomly sample a subset of words at the input layer that are not used as part of the document representation. This gives the network multiple looks at each example with part of its feature space removed. Another way to think of this is data augmentation where new training instances are created by sampling feature vectors from existing data points with some features missing.
# Addressing the Rare Word Problem in Neural Machine Translation
* NMT(Neural Machine Translation) systems perform poorly with respect to OOV(out-of-vocabulary) words or rare words.
* The paper presents a word-alignment based technique for translating such rare words.
* [Link to the paper](https://arxiv.org/abs/1410.8206)
* Annotate the training corpus with information about what do different OOV words (in the target sentence) correspond to in the source sentence.
* NMT learns to track the alignment of rare words across source and target sentences and emits such alignments for the test sentences.
* As a post-processing step, use a dictionary to map rare words from the source language to target language.
## Annotating the Corpus
### Copy Model
* Annotate the OOV words in the source sentence with tokens *unk1*, *unk2*,..., etc such that repeated words get the same token.
* In target language, each OOV word, that is aligned to some OOV word in the source language, is assigned the same token as the word in the source language.
* The OOV word in the target language, which has no alignment or is aligned with a known word in the source language. is assigned the null token.
* Very straightforward
* Misses out on words which are not labelled as OOV in the source language.
### PosAll - Positional All Model
* All OOV words in the source language are assigned a single *unk* token.
* All words in the target sentences are assigned positional tokens which denote that the *jth* word in the target sentence is aligned to the *ith* word in the source sentence.
* Aligned words that are too far apart, or are unaligned, are assigned a null token.
* Captures complete alignment between source and target sentences.
* It doubles the length of target sentences.
### PosUnk - Positional Unknown Model
* All OOV words in the source language are assigned a single *unk* token.
* All OOV words in the target sentences are assigned *unk* token with the position which gives the relative position of the word in the target language with respect to its aligned source word.
* Faster than PosAll model.
* Does not capture alignment for all words.
* Subset of WMT'14 dataset
* Alignment computed using the [Berkeley Aligner](https://code.google.com/archive/p/berkeleyaligner/)
* Used architecture from [Sequence to Sequence Learning with Neural Networks paper](https://gist.github.com/shagunsodhani/a2915921d7d0ac5cfd0e379025acfb9f).
* All the 3 approaches (more specifically the PosUnk approach) improve the performance of existing NMTs in the order PosUnk > PosAll > Copy.
* Ensemble models benefit more than individual models as the ensemble of NMT models works better at aligning the OOV words.
* Performance gains are more when using smaller vocabulary.
* Rare word analysis shows that performance gains are more when proposition of OOV words is higher.
First published: 2015/11/19 (4 years ago) Abstract: The standard recurrent neural network language model (RNNLM) generates
sentences one word at a time and does not work from an explicit global sentence
representation. In this work, we introduce and study an RNN-based variational
autoencoder generative model that incorporates distributed latent
representations of entire sentences. This factorization allows it to explicitly
model holistic properties of sentences such as style, topic, and high-level
syntactic features. Samples from the prior over these sentence representations
remarkably produce diverse and well-formed sentences through simple
deterministic decoding. By examining paths through this latent space, we are
able to generate coherent novel sentences that interpolate between known
sentences. We present techniques for solving the difficult learning problem
presented by this model, demonstrate its effectiveness in imputing missing
words, explore many interesting properties of the model's latent sentence
space, and present negative results on the use of the model in language
TLDR; The authors present an RNN-based variational autoencoder that can learn a latent sentence representation while learning to decode. A linear layer that predicts the parameter of a Gaussian distribution is inserted between encoder and decoder. The loss is a combination of the reconstruction objective and the KL divergence with the prior (Gaussian) - similar to the "standard" VAE does. The authors evaluate the model on Language Modeling and Impution (Inserting Missing Words) tasks and also present a qualitative analysis of the latent space.
#### Key Points
- Training is tricky. Vanilla training results in the decoder ignoring the encoder and the KL error term becoming zero.
- Training Trick 1: KL Cost Annealing. During training, increase weight on the KL term of the cost to anneal from vanilla to VAE.
- Training Trick 2: Word dropout using a word keep rate hyperparameter. This forces the decoder to rely more on the global representation.
- Results on Language Modeling: Standard model (without cost annealing and word dropout) trails Vanilla RNNLM model, but not by much. KL cost term goes to zero in this setting. In an inputless decoder setting (word keep prob = 0) the VAE outperforms the RNNLM (obviously)
- Results on Imputing Missing Words: Benchmarked using an adversarial error classifier. VAE significantly outperforms RNNLM. However, the comparison is somewhat unfair since the RNNML has nothing to condition on and relies on unigram distribution for the first token.
- Qualitative: Can use higher word dropout to get more diverse sentences
- Qualitative: Can walk the latent space and get grammatical and meaningful sentences.
TLDR; The author train a three variants of a seq2seq model to generate a response to social media posts taken from Weibo. The first variant, NRM-glo is the standard model without attention mechanism using the last state as the decoder input. The second variant, NRM-loc, uses an attention mechanism. The third variant, NRM-hyb combines both by concatenating local and global state vectors. The authors use human users to evaluate their responses and compare them to retrievel-based and SMT-based systems. The authors find that SRM models generate reasonable responses ~75% of the time.
#### Key Points
- STC: Short-text conversation. Generate only a response to a post. Don't need to keep track of a whole conversation.
- Training data: 200k posts, 4M responses.
- Authors use GRU with 1000 hidden units.
- Vocabulary: Most frequent 40k words for both input and response.
- Retrieval is done using beam search with beam size 10.
- Hybrid model is difficult to train jointly. The authors train the model individually and then fine-tune the hybrid model.
- Tradeoff with retrieval based methods: Responses are written by a human and don't have grammatical errors, but cannot easily generalize to unseen inputs.
TLDR; The authors propose an importance-sampling approach to deal with large vocabularies in NMT models. During training, the corpus is partitioned, and for each partition only target words occurring in that partition are chosen. To improve decoding speed over the full vocabulary, the authors build a dictionary mapping from source sentence to potential target vocabulary. The authors evaluate their approach on standard MT tasks and perform better than the baseline models with smaller vocabulary.
#### Key Points:
- Computing partition function is the bottleneck. Use sampling-based approach.
- Dealing with large vocabulary during training is separate from dealing with large vocab during decoding. Training is handled with importance sampling. Decoding is handled with source-based candidate list.
- Decoding with candidate list takes around 0.12s (0.05) per token on CPU (GPU). Without target list 0.8s (0.25s).
- Issue: Candidate list is depended on source sentence, so it must be re-computed for each sentence.
- Reshuffling the data set is expensive as new partitions need to be calculated (not necessary, but improved scores).
- How is the corpus partitioned? What's the effect of the partitioning strategy?
- The authors say that they replace UNK tokens using "another word alignment model" but don't go into detail what this is. The results show that doing this results in much larger score bump than increasing the vocab does. (The authors do this for all comparison models though).
- Reshuffling the dataset also results in a significant performance bump, but this operation is expensive. IMO the authors should take all these into account when reporting performance numbers. A single training update may be a lot faster, but the setup time increases. I'd would've like to see the authors assign a global time budget to train/test and then compare the models based on that.
- The authors only briefly mentioned that re-building the target vocab for each source sentence is an issue and how they solve it, no details given.