TLDR; The authors propose an importance-sampling approach to deal with large vocabularies in NMT models. During training, the corpus is partitioned, and for each partition only target words occurring in that partition are chosen. To improve decoding speed over the full vocabulary, the authors build a dictionary mapping from source sentence to potential target vocabulary. The authors evaluate their approach on standard MT tasks and perform better than the baseline models with smaller vocabulary.
#### Key Points:
- Computing partition function is the bottleneck. Use sampling-based approach.
- Dealing with large vocabulary during training is separate from dealing with large vocab during decoding. Training is handled with importance sampling. Decoding is handled with source-based candidate list.
- Decoding with candidate list takes around 0.12s (0.05) per token on CPU (GPU). Without target list 0.8s (0.25s).
- Issue: Candidate list is depended on source sentence, so it must be re-computed for each sentence.
- Reshuffling the data set is expensive as new partitions need to be calculated (not necessary, but improved scores).
- How is the corpus partitioned? What's the effect of the partitioning strategy?
- The authors say that they replace UNK tokens using "another word alignment model" but don't go into detail what this is. The results show that doing this results in much larger score bump than increasing the vocab does. (The authors do this for all comparison models though).
- Reshuffling the dataset also results in a significant performance bump, but this operation is expensive. IMO the authors should take all these into account when reporting performance numbers. A single training update may be a lot faster, but the setup time increases. I'd would've like to see the authors assign a global time budget to train/test and then compare the models based on that.
- The authors only briefly mentioned that re-building the target vocab for each source sentence is an issue and how they solve it, no details given.
TLDR; The author train a three variants of a seq2seq model to generate a response to social media posts taken from Weibo. The first variant, NRM-glo is the standard model without attention mechanism using the last state as the decoder input. The second variant, NRM-loc, uses an attention mechanism. The third variant, NRM-hyb combines both by concatenating local and global state vectors. The authors use human users to evaluate their responses and compare them to retrievel-based and SMT-based systems. The authors find that SRM models generate reasonable responses ~75% of the time.
#### Key Points
- STC: Short-text conversation. Generate only a response to a post. Don't need to keep track of a whole conversation.
- Training data: 200k posts, 4M responses.
- Authors use GRU with 1000 hidden units.
- Vocabulary: Most frequent 40k words for both input and response.
- Retrieval is done using beam search with beam size 10.
- Hybrid model is difficult to train jointly. The authors train the model individually and then fine-tune the hybrid model.
- Tradeoff with retrieval based methods: Responses are written by a human and don't have grammatical errors, but cannot easily generalize to unseen inputs.
Short Science allows researchers to publish paper summaries that are voted on and ranked! About