[link]
The core goal of this paper is to perform in an unsupervised (read: without parallel texts) way what other machine translation researchers had previously only effectively performed in a supervised way: the creation of a wordtoword translational mapping between natural languages. To frame the problem concretely: the researchers start with word embeddings learned in each language independently, and their desired output is a set of nearest neighbors for a source word that contains the true target (i.e. translated) word as often a possible. An interesting bit of background for this paper is that Mikilov, who was the initial progenitor of the word embedding approach, went on to posit, based on experiments he’d conducted, that the embeddings produced by different languages share characteristics in vector space, such that one could expect a linear translation (i.e. taking a set of points and rotating, shifting, and/or scaling them) to be able to map from one language to another. This assumption is relied on heavily in this paper. A notional note: when I refer to “a mapped source embedding” or “mapped source”, that just means that a matrix transformation, captured in a weight matrix W, is being used to do some form of rotation, scaling, or shifting, to “map” between the source embedding space and the shared space. The three strategies this paper employs are: 1. Using adversarial training to try to force the distributions of the embeddings in source and target languages to be similar to one another 2. Taking examples where method (1) has high confidence, and borrowing a method from supervised wordtoword translation, called the Procrustes method, to further optimize the mapping into the shared vector space 3. Calculating the nearest neighbors of a source word using an approach they develop called “CrossDomain Similarity Local Scaling”. At a high level, this conducts nearest neighbors, but “normalizes” for density, so that, on an intuitive level, it’s basically scaling distances up in dense regions of the space, and scaling them down in sparse regions Focusing on (1) first, the notion here goes back to that assumption I mentioned earlier: that internal relationships within embedding space are similar across languages, such that if you able to align the overall distributions of target embedding with a mapped source embedding, then you might  if you take Mikilov’s assumption seriously  reasonably expect this to push words in the mappedsource space close to their corresponding words in target space. And this does work, to some degree, but the researchers found that this approach on it’s own didn’t get them to where they wanted to be in terms of accuracy. To further refine the mapping created by the adversarial training, the authors use something called the “Procrustes Method”. They go into it in more detail in the paper, but at a high level, it turns out that if you’re trying to solve the problem of minimizing the sum of squared distances between a mappedsource embedding and a target embedding, assuming that that mapping is linear, and that you want the weight matrix to be orthogonal, that problem reduces to doing the singular value decomposition of the matrix of source embeddings multiplied by the (transposed) matrix of target embeddings, for a set of ground truth shared words. Now, you may reasonably note: this is an unsupervised method, we don’t have access to ground truth embeddings across languages. And you would be correct. So, here, what the authors do is take words that are *mutual* nearest neighbors (according to the CSLS metric of nearest neighbors I’ll describe in (3) ) after conducting their adversariallylearned rotation, and take that mutualnearestneighbordom as a marker of high confidence in that word pair. They took these mutuallynearestneighbor pairs, and used those as “ground truth” to conduct this singular value decomposition, which was applied on top of the adversariallylearned rotation to get to their final mapping. (3) is described well in equation form in the paper itself, and is just a way of constructing a similarity metric between a mappedsource embedding and a target embedding that does some clever normalization. Specifically, it takes two times the (cosine) distance between Ws (mapped source) and t (target), and subtracts out the average (cosine) distance of Ws to its k nearest target words, as well as the (average) cosine distance of t to its k nearest source words. In this way, it normalizes the distance between Ws and t based on how dense each of their neighborhoods is. Using all of these approaches together, the authors really do get quite impressive performance. For ENES, ESEN, ENFR, FREN, ENDE, DEEN, and EO (Esperanto)EN, the performance of the adversarial method is within 0.5 accuracy score of the supervised method, with the adversarial method being higher in 5 of those 7 cases (note: I read this as "functionally equivalent"). Interestingly, though, for ENRU, RUEN, ENCHN, and CHNEN, the adversarial method was dramatically less effective, with accuracy deltas ranging from 5 to 10 points between the adversarial and the supervised method, with the supervised method prevailing in all cases. This suggests that the assumption of a simple linear mapping between the vector spaces of different languages may be a more valid one when the languages are more closely related, and thus closer in their structure. I'd be really interested in any experiments that try to actually confirm this by testing on a wider array of languages, or testing on subgroups of languages that are closer or farther (i.e. you would expect ESFR to do even better than ENFR, and you would expect ESDE to do worse than ENDE).
Your comment:
