Word Translation Without Parallel DataWord Translation Without Parallel DataAlexis Conneau and Guillaume Lample and Marc'Aurelio Ranzato and Ludovic Denoyer and Hervé Jégou2017
Paper summarymarekInducing word translations using only monolingual corpora for two languages. Separate embeddings are trained for each language and a mapping is learned though an adversarial objective, along with an orthogonality constraint on the most frequent words. A strategy for an unsupervised stopping criterion is also proposed.
Word Translation Without Parallel Data
arXiv e-Print archive - 2017 via Local arXiv
First published: 2017/10/11 (1 year ago) Abstract: State-of-the-art methods for learning cross-lingual word embeddings have
relied on bilingual dictionaries or parallel corpora. Recent works showed that
the need for parallel data supervision can be alleviated with character-level
information. While these methods showed encouraging results, they are not on
par with their supervised counterparts and are limited to pairs of languages
sharing a common alphabet. In this work, we show that we can build a bilingual
dictionary between two languages without using any parallel corpora, by
aligning monolingual word embedding spaces in an unsupervised way. Without
using any character information, our model even outperforms existing supervised
methods on cross-lingual tasks for some language pairs. Our experiments
demonstrate that our method works very well also for distant language pairs,
like English-Russian or English-Chinese. We finally show that our method is a
first step towards fully unsupervised machine translation and describe
experiments on the English-Esperanto language pair, on which there only exists
a limited amount of parallel data.
The core goal of this paper is to perform in an unsupervised (read: without parallel texts) way what other machine translation researchers had previously only effectively performed in a supervised way: the creation of a word-to-word translational mapping between natural languages. To frame the problem concretely: the researchers start with word embeddings learned in each language independently, and their desired output is a set of nearest neighbors for a source word that contains the true target (i.e. translated) word as often a possible.
An interesting bit of background for this paper is that Mikilov, who was the initial progenitor of the word embedding approach, went on to posit, based on experiments he’d conducted, that the embeddings produced by different languages share characteristics in vector space, such that one could expect a linear translation (i.e. taking a set of points and rotating, shifting, and/or scaling them) to be able to map from one language to another. This assumption is relied on heavily in this paper. A notional note: when I refer to “a mapped source embedding” or “mapped source”, that just means that a matrix transformation, captured in a weight matrix W, is being used to do some form of rotation, scaling, or shifting, to “map” between the source embedding space and the shared space.
The three strategies this paper employs are:
1. Using adversarial training to try to force the distributions of the embeddings in source and target languages to be similar to one another
2. Taking examples where method (1) has high confidence, and borrowing a method from supervised word-to-word translation, called the Procrustes method, to further optimize the mapping into the shared vector space
3. Calculating the nearest neighbors of a source word using an approach they develop called “Cross-Domain Similarity Local Scaling”. At a high level, this conducts nearest neighbors, but “normalizes” for density, so that, on an intuitive level, it’s basically scaling distances up in dense regions of the space, and scaling them down in sparse regions
Focusing on (1) first, the notion here goes back to that assumption I mentioned earlier: that internal relationships within embedding space are similar across languages, such that if you able to align the overall distributions of target embedding with a mapped source embedding, then you might - if you take Mikilov’s assumption seriously - reasonably expect this to push words in the mapped-source space close to their corresponding words in target space. And this does work, to some degree, but the researchers found that this approach on it’s own didn’t get them to where they wanted to be in terms of accuracy.
To further refine the mapping created by the adversarial training, the authors use something called the “Procrustes Method”. They go into it in more detail in the paper, but at a high level, it turns out that if you’re trying to solve the problem of minimizing the sum of squared distances between a mapped-source embedding and a target embedding, assuming that that mapping is linear, and that you want the weight matrix to be orthogonal, that problem reduces to doing the singular value decomposition of the matrix of source embeddings multiplied by the (transposed) matrix of target embeddings, for a set of ground truth shared words. Now, you may reasonably note: this is an unsupervised method, we don’t have access to ground truth embeddings across languages. And you would be correct. So, here, what the authors do is take words that are *mutual* nearest neighbors (according to the CSLS metric of nearest neighbors I’ll describe in (3) ) after conducting their adversarially-learned rotation, and take that mutual-nearest-neighbor-dom as a marker of high confidence in that word pair. They took these mutually-nearest-neighbor pairs, and used those as “ground truth” to conduct this singular value decomposition, which was applied on top of the adversarially-learned rotation to get to their final mapping.
(3) is described well in equation form in the paper itself, and is just a way of constructing a similarity metric between a mapped-source embedding and a target embedding that does some clever normalization. Specifically, it takes two times the (cosine) distance between Ws (mapped source) and t (target), and subtracts out the average (cosine) distance of Ws to its k nearest target words, as well as the (average) cosine distance of t to its k nearest source words. In this way, it normalizes the distance between Ws and t based on how dense each of their neighborhoods is.
Using all of these approaches together, the authors really do get quite impressive performance. For EN-ES, ES-EN, EN-FR, FR-EN, EN-DE, DE-EN, and EO (Esperanto)-EN, the performance of the adversarial method is within 0.5 accuracy score of the supervised method, with the adversarial method being higher in 5 of those 7 cases (note: I read this as "functionally equivalent"). Interestingly, though, for EN-RU, RU-EN, EN-CHN, and CHN-EN, the adversarial method was dramatically less effective, with accuracy deltas ranging from 5 to 10 points between the adversarial and the supervised method, with the supervised method prevailing in all cases. This suggests that the assumption of a simple linear mapping between the vector spaces of different languages may be a more valid one when the languages are more closely related, and thus closer in their structure. I'd be really interested in any experiments that try to actually confirm this by testing on a wider array of languages, or testing on subgroups of languages that are closer or farther (i.e. you would expect ES-FR to do even better than EN-FR, and you would expect ES-DE to do worse than EN-DE).
Inducing word translations using only monolingual corpora for two languages. Separate embeddings are trained for each language and a mapping is learned though an adversarial objective, along with an orthogonality constraint on the most frequent words. A strategy for an unsupervised stopping criterion is also proposed.