Dual Learning for Machine Translation on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Dual Learning for Machine Translation
Yingce Xia and Di He and Tao Qin and Liwei Wang and Nenghai Yu and Tie-Yan Liu and Wei-Ying Ma
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

Summaries/Notes 3

[link] Summary by Denny Britz 7 years ago

TLDR; The authors finetune an FR -> EN NMT model using a RL-based dual game. 1. Pick a French sentence from a monolingual corpus and translate it to EN. 2. Use an EN language model to get a reward for the translation 3. Translate the translation back into FR using an EN -> FR system. 4. Get a reward based on the consistency between original and reconstructed sentence. Training this architecture using Policy Gradient authors can make efficient use of monolingual data and show that a system trained on only 10% of parallel data and finetuned with monolingual data achieves comparable BLUE scores as a system trained on the full set of parallel data.

### Key Points

- Making efficient use of monolingual data to improve NMT systems is a challenge
- Two Agent communication game: Agent A only knows language A and agent B only knows language B. A send message through a noisy translation channel, B receives message, checks its correctness, and sends it back through another noisy translation channel. A checks if it is consistent with the original message. Translation channels are then improves based on the feedback.
- Pieces required: LanguageModel(A), LanguageModel(B), TranslationModel(A->B), TranslationModel(B->A). Monolingual Data.
- Total reward is linear combination of: `r1 = LM(translated_message)`, `r2 = log(P(original_message | translated_message)`
- Samples are based on beam search using the average value as the gradient approximation
- EN -> FR pretrained on 100% of parallel data: 29.92 to 32.06 BLEU
- EN -> FR pretrained on 10% of parallel data: 25.73 to 28.73 BLEU
- FR -> EN pretrained on 100% of parallel data: 27.49 to 29.78 BLEU
- FR -> EN pretrained on 10% of parallel data: 22.27 to 27.50 BLEU

### Some Notes

- I think the idea is very interesting and we'll see a lot related work coming out of this. It would be even more amazing if the architecture was trained from scratch using monolingual data only. Due the the high variance of RL methods this is probably quite hard to do though.
- I think the key issue is that the rewards are quite noisy, as is the case with MT in general. Neither the language model nor the BLEU scores gives good feedback for the "correctness" of a translation.
- I wonder why there is such a huge jump in BLEU scores for FR->EN on 10% of data, but not for EN->FR on the same amount of data.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private