Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword InformationPiotr Bojanowski and Edouard Grave and Armand Joulin and Tomas Mikolov2016
Paper summarydecodyngThis paper is a clever but conceptually simple idea to improve the vectors learned for individual words. In this proposed approach, instead of learning a distinct vector per word in the word, the model instead views a word as being composed of overlapping n-grams, which are combined to make the full word.
Recall: in the canonical skipgram approach to learning word embeddings, each word is represented by a single vector. The word might be tokenized first (for example, de-pluralized), but, fundamentally, there isn’t any way for the network to share information about the meanings of “liberty” and “liberation”; even though a human could see that they share root structure, for a skipgram model, they are two totally distinct concepts that need to be separately learned. The premise of this paper is that this approach leaves valuable information on the table, and that by learning vectors for subcomponents of words, and combining them to represent the whole word, we can more easily identify and capture shared patterns.
On a technical level, this is done by:
- For each n value in the range selected (typically 3-6 inclusive), representing each input word as a set of overlapping windows of that n, with special characters for Start and End. For example, if n=3, and the word is “where”, it could be represented as [“<wh”, “whe”, “her”, “ere>”]
- In addition to the set of ngrams, additionally representing each word through its full-word representation of “<where>”, to “catch” any leftover meaning not captured in the smaller ngrams.
- When you’re calculating loss, representing each word as simply being the sum of all of its component ngrams
This has some interesting consequences. First off, the perplexity of the model, which you can think of as a measure of unsupervised goodness of fit, is equivalent or improved by this approach relative to baselines on all but one model. Intriguingly, and predictably once you think about it, the advantage of the subword approach is much stronger for languages like German, Russian, and Arabic, which have strong re-use and aggregation of root words, and also strong patterns of morphological mutation of words. Additionally, the authors found that the subword model got to its minimum loss value using much less data than the canonical approach. This makes decent sense if you think about the fact that subcomponent re-use means there are fewer meaningful word subcomponents than their are unique words, and seeing a subcomponent used across many words means that you need fewer words to learn the patterns it corresponds to.
A lot of the benefit of this approach seems to be through better representation of syntax; when tested on an analogy task, embeddings trained with subword information did meaningfully better on syntactic analogies (“swim is to swum as ran is to <>”) but equivalent or worse on semantic analogies (“mother is to girl as father is to <>”). One theory about this is that focusing on the subword elements does a better job of more quickly getting the representation of the word to be close to it’s exact meaning, but has a harder time learning precise semantics, relative to a full-word model.
First published: 2016/07/15 (3 years ago) Abstract: Continuous word representations, trained on large unlabeled corpora are
useful for many natural language processing tasks. Popular models that learn
such representations ignore the morphology of words, by assigning a distinct
vector to each word. This is a limitation, especially for languages with large
vocabularies and many rare words. In this paper, we propose a new approach
based on the skipgram model, where each word is represented as a bag of
character $n$-grams. A vector representation is associated to each character
$n$-gram; words being represented as the sum of these representations. Our
method is fast, allowing to train models on large corpora quickly and allows us
to compute word representations for words that did not appear in the training
data. We evaluate our word representations on nine different languages, both on
word similarity and analogy tasks. By comparing to recently proposed
morphological word representations, we show that our vectors achieve
state-of-the-art performance on these tasks.