This paper is a clever but conceptually simple idea to improve the vectors learned for individual words. In this proposed approach, instead of learning a distinct vector per word in the word, the model instead views a word as being composed of overlapping n-grams, which are combined to make the full word. Recall: in the canonical skipgram approach to learning word embeddings, each word is represented by a single vector. The word might be tokenized first (for example, de-pluralized), but, fundamentally, there isn’t any way for the network to share information about the meanings of “liberty” and “liberation”; even though a human could see that they share root structure, for a skipgram model, they are two totally distinct concepts that need to be separately learned. The premise of this paper is that this approach leaves valuable information on the table, and that by learning vectors for subcomponents of words, and combining them to represent the whole word, we can more easily identify and capture shared patterns. On a technical level, this is done by: - For each n value in the range selected (typically 3-6 inclusive), representing each input word as a set of overlapping windows of that n, with special characters for Start and End. For example, if n=3, and the word is “where”, it could be represented as [“<wh”, “whe”, “her”, “ere>”] - In addition to the set of ngrams, additionally representing each word through its full-word representation of “<where>”, to “catch” any leftover meaning not captured in the smaller ngrams. - When you’re calculating loss, representing each word as simply being the sum of all of its component ngrams https://i.imgur.com/NP5qFEV.png This has some interesting consequences. First off, the perplexity of the model, which you can think of as a measure of unsupervised goodness of fit, is equivalent or improved by this approach relative to baselines on all but one model. Intriguingly, and predictably once you think about it, the advantage of the subword approach is much stronger for languages like German, Russian, and Arabic, which have strong re-use and aggregation of root words, and also strong patterns of morphological mutation of words. Additionally, the authors found that the subword model got to its minimum loss value using much less data than the canonical approach. This makes decent sense if you think about the fact that subcomponent re-use means there are fewer meaningful word subcomponents than their are unique words, and seeing a subcomponent used across many words means that you need fewer words to learn the patterns it corresponds to. https://i.imgur.com/EmN167L.png A lot of the benefit of this approach seems to be through better representation of syntax; when tested on an analogy task, embeddings trained with subword information did meaningfully better on syntactic analogies (“swim is to swum as ran is to <>”) but equivalent or worse on semantic analogies (“mother is to girl as father is to <>”). One theory about this is that focusing on the subword elements does a better job of more quickly getting the representation of the word to be close to it’s exact meaning, but has a harder time learning precise semantics, relative to a full-word model.