[link]
TLDR; The authors evaluate softmax, hierarchical softmax, target sampling, NCE, selfnormalization and differentiated softmax (novel technique presented in the paper) on data sets with varying vocabulary size (10k, 100k, 800k) with a fixedtime training budget. The authors find that techniques that work best for small vocabluaries are not necessarily the ones that work best for large vocabularies. #### Data and Models Models:  Sotmax  Hierarchical Softmax (crossvalidation of clustering techniques)  Differentiated softmax, adjusting capacity based on token frequency (crossvalidation of number of frequency bands and size)  Target Sampling (crossvalidation of number of distractors)  NCE (crossvalidation of noise ratio)  Selfnormalization (crossvalidation of regularization strenth) Data:  PTB (1M tokens, 10k vocab)  Gigaword (5B tokens, 100k vocab)  billionW (800M tokens, 800k vocab) #### Key Takeaways  Techniques that work best for small vocabluaries are not necessarily the ones that work best for large vocabularies.  Differentiated softmax varies the capacity (size of matrix slice in the last layer) based on token frequency. In practice, it's implemented as separate matrices with different sizes.  Perplexity doesn't seem to improve much after ~500M tokens  Models are trained for 1 week each  The competitiveness of softmax diminishes with vocabulary sizes. It seems to perform relatively well on 10k and 100k, but poorly on 800k since it need more processing time per example.  Traning time, not training data, is the main factor of limiting performance. The authors found that very large models are still making progress after one week and may eventually beat if the other models if allowed to run longer. #### Questions / Notes  What about the hyperparameters for Differentiated Softmax? The paper doesn't show an analysis. Also, the fact that this method introduces two additional hyperparameters makes it harder to apply in practice.  Would've liked to see more comparisons for Softmax, which is the simplest technique of all and doesn't need hyperparameter tuning. It doesn't work well on 800k vocab, but it does for 100k. So, the authors only show how it breaks down for one dataset.
Your comment:
