You May Not Need Attention You May Not Need Attention
Paper summary An attention mechanism and a separate encoder/decoder are two properties of almost every single neural translation model. The question asked in this paper is- how far can we go without attention and without a separate encoder and decoder? And the answer is- pretty far! The model presented preforms just as well as the attention model of Bahdanau on the four language directions that are studied in the paper. The translation model presented in the paper is basically a simple recurrent language model. A recurrent language model receives at every timestep the current input word and has to predict the next word in the dataset. To translate with such a model, simply give it the current word from the source sentence and have it try to predict the next word from the target sentence. Obviously, in many cases such a simple model wouldn't work. For example, if your sentence was "The white dog" and you wanted to translate to Spanish ("El perro blanco"), at the 2nd timestep, the input would be "white" and the expected output would be "perro" (dog). But how could the model predict "perro" when it hasn't seen "dog" yet? To solve this issue, we preprocess the data before training and insert "empty" padding tokens into the target sentence. When the model outputs such a token, it means that the model would like to read more of the input sentence before emitting the next output word. So in the example from above, we would change the target sentence to "El PAD perro blanco". Now, at timestep 2 the model emits the PAD symbol. At timestep 3, when the input is "dog", the model can emit the token "perro". These padding symbols are deleted in post-processing, before the output is returned to the user. You can see a visualization of the decoding process below: To enable us to use beam search, our model actually receives the previous outputted target token in addition to receiving the current source token at every timestep. PyTorch code for the model is available at
You May Not Need Attention
Ofir Press and Noah A. Smith
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CL


Summary by CodyWild 2 years ago
Not all research advances are made with state of the art models. Sometimes new methods are introduced that are slow, parameter-heavy or have some other deficiency. Such ideas are not meant to be introduced into production servers, they are meant to spark a discussion, which could then lead the research community to discover new ideas which will one day be used to improve state of the art models. -------------------------------------------------------------------------------------------- This paper does not try to present a state of the art model. This paper was written to question two commonly held beliefs- that separate encoding/decoding are necessary in NMT and that attention is a required component of all NMT models. If you look at the many NMT models published in the last two years, all of them contain at least one of these properties, and the vast majority contain both. -------------------------------------------------------------------------------------------- This paper is not claiming that we should just throw away all of that progress. We just want to show the research community that it is possible to build vastly different translation models that can still perform well. This specific model that we presented also has the advantage of being extremely simple. Usually new NMT papers introduce a new mechanism, they make an existing model more complex. Here we want to step forward by showing an NMT model that is simpler than almost anything else that came before it. -------------------------------------------------------------------------------------------- Yes, its not state of the art. Yes, it is trained using external alignment data. Yes, it requires special preprocessing. But it shatters two widely held beliefs. It also uses a constant amount of memory. And it works well on long sequences, which are known to be difficult for attention models. We firmly believe that these advantages far outweigh the disadvantages of our model. And *that* is why we posted this paper. We think the community should start thinking more about models that don’t use attention. Or models that have combined encoding/decoding. Or maybe just take the eagerness property from our model and apply it to an attention model. These research directions could lead to an improvement in performance of state of the art models on long sequences. Or they could be used to lower the memory requirements of simultaneous translation systems. Interesting methods aren’t found only in state of the art models.

Your comment:
Summary by Ofir Press 2 years ago
Your comment: allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and