#### This nice paper looks amazing at the first sight since it brings a mixture of:
- Fancy models
- State-of-art training procedure(considering the 32-GPU distributed training effort which takes 21 days to get the best result)
- Significant theory metric improvement(single model: 51.3 -> 30 perplexity reduction, ensemble model:41.0 -> 23.7)
- Benchmark on a somewhat industry scale(vocabulary of 793471 words, 0,8B words training data) data-set rather than a pure research one.
#### However, I also want to add some criticism:
- As  mentioned perplexity is somewhat confusing metric, big perplexity may not reflect the real improvement, it would rather bring some kind of "exaggerating" effect.
- This paper only provide the language model improvement, however, LMs are usually embedded into a complex usage scenario, such as speech recognition or machine translation. It would be more insightful if the LMs provided in this paper could share its result with integrating into some end-to-end products. Since the authors are working for Google Brain team, this is not too much a stringent requirement.
- So far as I know, the data set used by this paper is from news stories, this kind of data set is more formal than oral one. And for real application, what we face are usually less formal data(such as search engine and speech recognition). It is still a question what the best model mentioned in this paper will perform in a more realistic scenario. Again, for Google Brain team, this should not be a big obstacles for integrating it with existing system just by replacing or complementing the existing LMs.
Although I posted some personal criticism, I do still appreciate this nice paper and recommend this as a "must-read" for NLP and related guys since I do think this paper provide a unifying and comprehensive survey-style perspective for us to help grasp the latest state-of-art language model technology in an efficient way.