The problem setting of the paper is the desire to perform translation in a monolingual setting, where datasets exist of each language independently, but little or no paired sentence data (paired here meaning that you know you have the same sentence or text in both languages). The paper outlines the prior methods in this area as being, first, training a single-language language model (i.e. train a model to take in a sentence, and return how coherent of a sentence it is in a given language) and using that to supplement a machine translation system. The authors honestly don’t go into this much, so I can’t tell exactly what they mean by it. The second baseline they talk about is bootstrapping themselves additional training data, by training a model using a small amount of training data, then using that mediocre model to translate additional sentences, which they use as additional training data to train the mediocre model to a higher performance. It doesn’t seem like this should work, but I’ve seen this or similar approaches used in a few cases, and it typically does add benefit. But, the authors claim, they can do better. The core intuition of this paper is pretty simple, and will be familiar to anyone who read my summary of CycleGAN, lo these many weeks ago. Their approach rests on the idea that, even if you can’t push translation models to be objectively correct in a paired sense, you can push translation models to be symmetric with one another, insofar as translating from language A to B (let’s say English to French), and then back from French to English, gets you something in English that looks like your original input. This forces the model to maintain an informative mapping, so that enough information about the English sentence is stored to allow it to be reconstructed.However, unconstrained, the model could just develop a 1:1 word mapping that gives you information about the English input, but doesn’t actually map to the translation in French. If you can additionally confirm that the translation into French looks like a coherent French sentence (which, recall, we can do with a language model trained on French independently), we can get closer to to generating a mapping that is hopefully more coherent. One interesting aspect of this paper is the fact that the model they describe is trained with reinforcement learning. Typically, reinforcement learning is used for scenarios where you don’t have direct visibility into how the actions you take impact your loss function. Compare this to a supervised network (where you can take the derivative of your loss with respect to the last layer, and backpropogate that back through to your inputs), or even a GAN, where you can take the derivative of the discriminator-created loss back through the input the discriminator, and through into the GAN that created it. This model treats the translation models that are learned as policies; that is, probability distributions over sets of words. It samples multiple A -> B translations using something called beam search, which, as it samples the sequence of words, samples several at each timestep, and then keeps that chain alive by continuing to sample along it. This helps the sequential translation not fall into a pit where it samples one highly probable word, but then, as you add more words, it doesn’t lead towards a good sentence. This ultimately results in taking multiple (k=12, in this case) samples from each translation distribution, and so the model uses as its objective the expected rewards over these samples, where the rewards is constructed as a combination of in-language coherence loss (scored by using the log likelihood of a trained single-language model) and reconstruction loss (scored by close the A -> B -> A’ is to the original A). My confusion about the use of reinforcement loss here mostly comes from the question of whether it just wasn’t possible to build an end to end model, where, like a GAN, you backpropogated through a constructed input, back to the model that constructed it (in this case, through both translator models). Is the issue just that a sequence of words is fundamentally discrete, in a way that images aren’t, and in a way that impedes backprop? That seems possible, but also, I think it’s the case that a typical encoder-decoder model, that outputs softmaxes over words, is able to be backpropogated through. Overall, it’s hard for me to tell if I’m missing something basic about why reinforcement learning is the obvious choice here, or if other more GAN-like approaches were an option, but that meme hadn’t spread into the literature yet, and RL was the more historically canonical choice. One other minor disappointing note: it looks like their results are based on a scenario that does use a small number of bilingual training pairs, as a way to pretrain the translation models to a reasonable, non-random starting point. It’s not clear whether this method would have worked with an actual cold start, i.e. a translation model that has no idea what it’s doing, and is using only this as signal. That said, they used a much smaller number of bilingual pairs than a true supervised method, and so even with a need for a warm start, a method like this could still give you leverage over language pairs where there exists some paired data, but not enough to build a full, sophisticated model on top of.