Summary by CodyWild 6 months ago
The core goal of this paper is to perform in an unsupervised (read: without parallel texts) way what other machine translation researchers had previously only effectively performed in a supervised way: the creation of a word-to-word translational mapping between natural languages. To frame the problem concretely: the researchers start with word embeddings learned in each language independently, and their desired output is a set of nearest neighbors for a source word that contains the true target (i.e. translated) word as often a possible.
An interesting bit of background for this paper is that Mikilov, who was the initial progenitor of the word embedding approach, went on to posit, based on experiments he’d conducted, that the embeddings produced by different languages share characteristics in vector space, such that one could expect a linear translation (i.e. taking a set of points and rotating, shifting, and/or scaling them) to be able to map from one language to another. This assumption is relied on heavily in this paper. A notional note: when I refer to “a mapped source embedding” or “mapped source”, that just means that a matrix transformation, captured in a weight matrix W, is being used to do some form of rotation, scaling, or shifting, to “map” between the source embedding space and the shared space.
The three strategies this paper employs are:
1. Using adversarial training to try to force the distributions of the embeddings in source and target languages to be similar to one another
2. Taking examples where method (1) has high confidence, and borrowing a method from supervised word-to-word translation, called the Procrustes method, to further optimize the mapping into the shared vector space
3. Calculating the nearest neighbors of a source word using an approach they develop called “Cross-Domain Similarity Local Scaling”. At a high level, this conducts nearest neighbors, but “normalizes” for density, so that, on an intuitive level, it’s basically scaling distances up in dense regions of the space, and scaling them down in sparse regions
Focusing on (1) first, the notion here goes back to that assumption I mentioned earlier: that internal relationships within embedding space are similar across languages, such that if you able to align the overall distributions of target embedding with a mapped source embedding, then you might - if you take Mikilov’s assumption seriously - reasonably expect this to push words in the mapped-source space close to their corresponding words in target space. And this does work, to some degree, but the researchers found that this approach on it’s own didn’t get them to where they wanted to be in terms of accuracy.
To further refine the mapping created by the adversarial training, the authors use something called the “Procrustes Method”. They go into it in more detail in the paper, but at a high level, it turns out that if you’re trying to solve the problem of minimizing the sum of squared distances between a mapped-source embedding and a target embedding, assuming that that mapping is linear, and that you want the weight matrix to be orthogonal, that problem reduces to doing the singular value decomposition of the matrix of source embeddings multiplied by the (transposed) matrix of target embeddings, for a set of ground truth shared words. Now, you may reasonably note: this is an unsupervised method, we don’t have access to ground truth embeddings across languages. And you would be correct. So, here, what the authors do is take words that are *mutual* nearest neighbors (according to the CSLS metric of nearest neighbors I’ll describe in (3) ) after conducting their adversarially-learned rotation, and take that mutual-nearest-neighbor-dom as a marker of high confidence in that word pair. They took these mutually-nearest-neighbor pairs, and used those as “ground truth” to conduct this singular value decomposition, which was applied on top of the adversarially-learned rotation to get to their final mapping.
(3) is described well in equation form in the paper itself, and is just a way of constructing a similarity metric between a mapped-source embedding and a target embedding that does some clever normalization. Specifically, it takes two times the (cosine) distance between Ws (mapped source) and t (target), and subtracts out the average (cosine) distance of Ws to its k nearest target words, as well as the (average) cosine distance of t to its k nearest source words. In this way, it normalizes the distance between Ws and t based on how dense each of their neighborhoods is.
Using all of these approaches together, the authors really do get quite impressive performance. For EN-ES, ES-EN, EN-FR, FR-EN, EN-DE, DE-EN, and EO (Esperanto)-EN, the performance of the adversarial method is within 0.5 accuracy score of the supervised method, with the adversarial method being higher in 5 of those 7 cases (note: I read this as "functionally equivalent"). Interestingly, though, for EN-RU, RU-EN, EN-CHN, and CHN-EN, the adversarial method was dramatically less effective, with accuracy deltas ranging from 5 to 10 points between the adversarial and the supervised method, with the supervised method prevailing in all cases. This suggests that the assumption of a simple linear mapping between the vector spaces of different languages may be a more valid one when the languages are more closely related, and thus closer in their structure. I'd be really interested in any experiments that try to actually confirm this by testing on a wider array of languages, or testing on subgroups of languages that are closer or farther (i.e. you would expect ES-FR to do even better than EN-FR, and you would expect ES-DE to do worse than EN-DE).

more
less