[link]
Summary by Ablaikhan Akhazhanov 1 month ago
The paper presents the famously known Word2Vec model, which became ubiquitous in numerous NLP applications partially owing to its linear feature space with additive compositionality. To be fair, the paper is an extension to the previously presented work by Tomas Mikolov and his colleagues on distributed representation of words and phrases. The proposed approach is based on the skip-gram model and introduces four novel methods to significantly improve training speed and performance. Particularly, the approach is effective with frequent and rare words and phrases, including idiomatic phrases.
Skip-gram model uses softmax function to define P(w_O | w_I), which becomes impractical when training on a large vocabulary due to expensive calculation of the denominator. To tackle it, the authors employ binary Huffman tree of words and calculate P(w_O | w_I) as a product of the sigmoids along the path from the root to any word w. The argument to the sigmoids is a product of the indicator function and the dot product between the vector representations of the words. The indicator flags whether a word is in the path from the root node to the word w. Tree structure simplifies the calculation from O(W) to average O(logW). Additionally, since Huffman tree assigns short codes to the frequent words, the training gets an extra speed up.
As an alternative to tree-based hierarchical softmax, the authors introduce an extension to Noise Contrastive Estimation (NCE), which they call Negative Sampling. The idea is to use logistic binary regression to distinguish between negative and positive samples, rather than picking one class from the entire vocabulary. They sample k negative examples from a “noise distribution” Pn for each positive example. As Pn authors employ unigram distribution U(w)^(3/4)/Z, which works best for both Negative Sampling and Hierarchical Softmax models, although they do not elaborate on the choice of the noise distribution.
To balance frequent and rare words, authors aggressively subsample words whose frequency is greater than a chosen threshold while preserving the ranking of the frequencies. They demonstrate that a heuristically chosen subsampling strategy accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words. They discard a word w with probability p(w) = 1 – sqrt(t/f(w)), where t is a manually selected threshold (e.g. 10^-5) and f(w) is the word frequency.
Finally, authors learn vector representation of the phrases by combining words into individual tokens. They score bigrams as score(w_i,w_j) = (cnt(w_i, w_j)-d)/(cnt(w_i) – cnt(w_j)) and then choose the ones that have scores above the chosen threshold. To form longer phrases, authors evaluate the scores recursively. The chosen score metric combines words that appear frequently together, and infrequently in other contexts.
Although the paper lacks detailed discussion on critical choice of the subsampling and negative sampling, it creates a foundation for explainable linear feature space in NLP model. The model shows the state-of-the-art performance and significantly outperforms other approaches, though it is partially due to a much bigger size of the training dataset. One of the most outstanding results of the paper is the inherent additive compositionality. This directly comes from the use of the softmax that tries to maximize the product of the probabilities. It forces the word vectors to have a linear relationship with the inputs to the softmax and decreases the distance of between the words that appear lose to each other. As a results, the word vectors can be combined as follows: “Paris” – “France” + “Germany” = “Berlin”.

more
less