Distributed representations of words and phrases and their compositionalityDistributed representations of words and phrases and their compositionalityMikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff2013
Paper summarygkcalatThe paper presents the famously known Word2Vec model, which became ubiquitous in numerous NLP applications partially owing to its linear feature space with additive compositionality. To be fair, the paper is an extension to the previously presented work by Tomas Mikolov and his colleagues on distributed representation of words and phrases. The proposed approach is based on the skip-gram model and introduces four novel methods to significantly improve training speed and performance. Particularly, the approach is effective with frequent and rare words and phrases, including idiomatic phrases.
Skip-gram model uses softmax function to define P(w_O | w_I), which becomes impractical when training on a large vocabulary due to expensive calculation of the denominator. To tackle it, the authors employ binary Huffman tree of words and calculate P(w_O | w_I) as a product of the sigmoids along the path from the root to any word w. The argument to the sigmoids is a product of the indicator function and the dot product between the vector representations of the words. The indicator flags whether a word is in the path from the root node to the word w. Tree structure simplifies the calculation from O(W) to average O(logW). Additionally, since Huffman tree assigns short codes to the frequent words, the training gets an extra speed up.
As an alternative to tree-based hierarchical softmax, the authors introduce an extension to Noise Contrastive Estimation (NCE), which they call Negative Sampling. The idea is to use logistic binary regression to distinguish between negative and positive samples, rather than picking one class from the entire vocabulary. They sample k negative examples from a “noise distribution” Pn for each positive example. As Pn authors employ unigram distribution U(w)^(3/4)/Z, which works best for both Negative Sampling and Hierarchical Softmax models, although they do not elaborate on the choice of the noise distribution.
To balance frequent and rare words, authors aggressively subsample words whose frequency is greater than a chosen threshold while preserving the ranking of the frequencies. They demonstrate that a heuristically chosen subsampling strategy accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words. They discard a word w with probability p(w) = 1 – sqrt(t/f(w)), where t is a manually selected threshold (e.g. 10^-5) and f(w) is the word frequency.
Finally, authors learn vector representation of the phrases by combining words into individual tokens. They score bigrams as score(w_i,w_j) = (cnt(w_i, w_j)-d)/(cnt(w_i) – cnt(w_j)) and then choose the ones that have scores above the chosen threshold. To form longer phrases, authors evaluate the scores recursively. The chosen score metric combines words that appear frequently together, and infrequently in other contexts.
Although the paper lacks detailed discussion on critical choice of the subsampling and negative sampling, it creates a foundation for explainable linear feature space in NLP model. The model shows the state-of-the-art performance and significantly outperforms other approaches, though it is partially due to a much bigger size of the training dataset. One of the most outstanding results of the paper is the inherent additive compositionality. This directly comes from the use of the softmax that tries to maximize the product of the probabilities. It forces the word vectors to have a linear relationship with the inputs to the softmax and decreases the distance of between the words that appear lose to each other. As a results, the word vectors can be combined as follows: “Paris” – “France” + “Germany” = “Berlin”.
The paper discusses a number of extensions to the Skip-gram model previously proposed by Mikolov et al (citation  in the paper): which learns linear word embeddings that are particularly useful for analogical reasoning type tasks. The extensions proposed (namely, negative sampling and sub-sampling of high frequency words) enable extremely fast training of the model on large scale datasets. This also results in significantly improved performance as compared to previously proposed techniques based on neural networks. The authors also provide a method for training phrase level embeddings by slightly tweaking the original training algorithm.
This paper proposes 3 improvements for the skip-gram model which allows for learning embeddings for words. The first improvement is subsampling frequent word, the second is the use of a simplified version of noise constrastive estimation (NCE) and finally they propose a method to learn idiomatic phrase embeddings. In all three cases the improvements are somewhat ad-hoc. In practice, both the subsampling and negative samples help to improve generalization substantially on an analogical reasoning task. The paper reviews related work and furthers the interesting topic of additive compositionality in embeddings.
The article does not propose any explanation as to why the negative sampling produces better results than NCE which it is suppose to loosely approximate. In fact it doesn't explain why besides the obvious generalization gain the negative sampling scheme should be preferred to NCE since they achieve similar speeds.