Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural NetworksSutskever, Ilya and Vinyals, Oriol and Le, Quoc V.2014
Paper summaryabhshkdzThis paper presents a simple approach to predicting
sequences from sequential input. They use a multi-layer
LSTM-based encoder-decoder architecture and show
promising results on the task of neural machine translation.
Their approach beats a phrase-based statistical machine
translation system by a BLEU score of > 1.0 and is close to
state-of-the-art if used to re-rank 1000-best predictions
from the SMT system. Main contributions:
- The first LSTM encodes an input sequence to a single
vector, which is then decoded by a second LSTM. End of sequence
is indicated by a special character.
- 4-layer deep LSTMs.
- 160k source vocabulary, 80k target vocabulary. Trained on
12M sentences. Words in output sequence are generated by a softmax
over fixed vocabulary.
- Beam search is used at test time to predict translations
(Beam size 2 does best).
- Qualitative results (PCA projections) show that learned representations are
fairly insensitive to active/passive voice, as sentences similar in meaning
are clustered together.
- Another interesting observation was that reversing the source
sequence gives a significant boost to translation of long sentences
and results in performance gain, most likely due to the introduction of
short-term dependencies that are more easily captured by the gradients.
## Weaknesses / Notes
- The reversing source input idea needs better justification,
otherwise comes across as an 'ugly hack'.
- To re-score the n-best list of predictions of the baseline,
they average confidences of LSTM and baseline model. They should
have reported re-ranking accuracies by using just the LSTM-model
TLDR; The authors show that a multilayer LSTM RNN (4 layers, 1000 cells per layer, 1000d embeddings, 160k source vocab, 80k target vocab) can achieve competitive results on Machine Translation tasks. The authors find that reversing the input sequence leads to significant improvements, most likely due to the introduction of short-term dependencies that are more easily captured by the gradients. Somewhat surprisingly, the LSTM did not have difficulties on long sentences. The model is evaluated on MT tasks and achieves competitive results (34.8 BLEU) by itself, and close to state of the art if coupled with existing baseline systems (36.5 BLEU).
#### Key Points
- Invert input sequence leads to significant improvement
- Deep LSTM performs much better than shallow LSTM.
- User different parameters for encoder/decoder. This allows parallel training for multiple languages decoders.
- 4 Layers, 1000 cells per layer. 1000-dimensional words embeddings. 160k source vocabulary. 80k target vocabulary.Trained on 12M sentences (652M words). SGD with fixed learning rate of 0.7, decreasing by 1/2 every epoch after 5 initial epochs. Gradient clipping. Parallelization on GPU leads to 6.3k words/sec.
- Batching sentences of approximately the same length leads to 2x speedup.
- PCA projection shows meaningful clusters of sentences robust to passive/active voice, suggesting that the fixed vector representation captures meaning.
- "No complete explanation" for why the LSTM does so much better with the introduced short-range dependencies.
- Beam size 1 already performs well, beam size 2 is best in deep model.
- Seems like the performance here is mostly due to the computational resources available and optimized implementation. These models are pretty big by most standards, and other approaches (e.g. attention) may lead to better results if they had more computational resources.
- Reversing the input still feels like a hack to me, there should be a more principled solution to deal with long-range dependencies.