If you’ve been paying any attention to the world of machine learning in the last five years, you’ve likely seen everyone’s favorite example for how Word2Vec word embeddings work: king - man + woman = queen. Given the ubiquity of Word2Vec, and similar unsupervised embeddings, it can be easy to start thinking of them as the canonical definition of what a word embedding *is*. But that’s a little oversimplified. In the context of machine learning, an embedding layer simply means any layer structured in the form of a lookup table, where there is some pre-determined number of discrete objects (for example: a vocabulary of words), each of which corresponds to a d-dimensional vector in the lookup table (where d is the number of dimensions you as the model designer arbitrarily chose). These embeddings are initialized in some way, and trained jointly with the rest of the network, using some kind of objective function. Unsupervised, monolingual word embeddings are typically learned by giving a model as input a sample of words that come before and after a given target word in a sentence, and then asking it to predict the target word in the center. Conceptually, if there are words that appear in very similar contexts, they will tend to have similar word vectors. This happens because scores are calculated using the dot product of the target vector with each of the context words, and if two words are to both score highly in that context, the dot product with their common-context vectors must be high for both, which pushes them towards similar values. For the last 3-4 years, unsupervised word vectors like these - which were made widely available for download - became a canonical starting point for NLP problems; this starting representation of words made it easier to learn from smaller datasets, since knowledge about the relationships between words was being transferred from the larger original word embedding training set, through the embeddings themselves. This paper seeks to challenge the unitary dominance of monolingual embeddings, by examining the embeddings learned when the objective is, instead, machine translation, where given a sentence in one language, you must produce it in another. Remember: an embedding is just a lookup table of vectors, and you can use it as the beginning of a machine translation model just as you can the beginning of a monolingual model. In theory, if the embeddings learned by a machine translation model had desirable properties, they could also be widely shared and used for transfer learning, like Word2Vec embeddings often are. When the authors of the paper dive into comparing the embeddings from both of these two approaches, they find some interesting results, such as: while the monolingual embeddings do a better job at analogy-based tests, machine translation embeddings do better at having similarity, within their vector space, map to true similarity of concept. Put another way, while monolingual systems push together words that appear in similar contexts (Teacher, Student, Principal), machine translation systems push words together when they map to the same or similar words in the target language (Teacher, Professor). The attached image shows some examples of this effect; the first three columns are all monolingual approaches, the final two are machine translation ones. When it comes to analogies, machine translation embeddings perform less well at semantic analogies (Ottowa is to Canada as Paris is to France) but does better at syntactic analogies (fast is to fastest as heavier is to heaviest). While I don’t totally understand why monolingual would be better at semantic analogies, it does make sense that the machine translation model would do a better job of encoding syntactic information, since such information is necessarily to sensibly structure a sentence.