Exploring the Limits of Language ModelingExploring the Limits of Language ModelingJózefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui2016
Paper summarydennybritzTLDR; The authors train large-scale language modeling LSTMs on the 1B word dataset to achieve new state of the art results for single models (51.3 -> 30 Perplexity) and ensemble models (41 -> 24.2 Perplexity). The authors evaluate how various architecture choices impact the model performance: Importance Sampling Loss, NCE Loss, Character-Level CNN inputs, Dropout, character-level CNN output, character-level LSTM Output.
#### Key Points
- 800k vocab, 1B words training data
- Using a CNN on characters instead of a traditional softmax significantly reduces number of parameters, but lacks the ability to differentiate between similar-looking words with very different meanings. Solution: Add correction factor
- Dropout on non-recurrent connections significantly improves results
- Character-level LSTM for prediction performs significantly worse than softmax or CNN softmax
- Sentences are not pre-processed, fed in 128-sized batches without resetting any LSTM state in between examples. Max word length for character-level input: 50
- Training: Adagrad and learning rate of 0.2. Gradient norm clipping 1.0. RNN unrolled for 20 steps. Small LSTM beats state of the art after just 2 hours training, largest and best model trained for 3 weeks on 32 K40 GPUs.
- NC vs. Importance Sampling: IC is sufficient
- Using character-level CNN word embeddings instead of a traditional matrix is sufficient and performs better
- Exact hyperparameters in table 1 are not clear to me.
#### This nice paper looks amazing at the first sight since it brings a mixture of:
- Fancy models
- State-of-art training procedure(considering the 32-GPU distributed training effort which takes 21 days to get the best result)
- Significant theory metric improvement(single model: 51.3 -> 30 perplexity reduction, ensemble model:41.0 -> 23.7)
- Benchmark on a somewhat industry scale(vocabulary of 793471 words, 0,8B words training data) data-set rather than a pure research one.
#### However, I also want to add some criticism:
- As  mentioned perplexity is somewhat confusing metric, big perplexity may not reflect the real improvement, it would rather bring some kind of "exaggerating" effect.
- This paper only provide the language model improvement, however, LMs are usually embedded into a complex usage scenario, such as speech recognition or machine translation. It would be more insightful if the LMs provided in this paper could share its result with integrating into some end-to-end products. Since the authors are working for Google Brain team, this is not too much a stringent requirement.
- So far as I know, the data set used by this paper is from news stories, this kind of data set is more formal than oral one. And for real application, what we face are usually less formal data(such as search engine and speech recognition). It is still a question what the best model mentioned in this paper will perform in a more realistic scenario. Again, for Google Brain team, this should not be a big obstacles for integrating it with existing system just by replacing or complementing the existing LMs.
Although I posted some personal criticism, I do still appreciate this nice paper and recommend this as a "must-read" for NLP and related guys since I do think this paper provide a unifying and comprehensive survey-style perspective for us to help grasp the latest state-of-art language model technology in an efficient way.