First published: 2018/10/31 (3 months ago) Abstract: In NMT, how far can we get without attention and without separate encoding
and decoding? To answer that question, we introduce a recurrent neural
translation model that does not use attention and does not have a separate
encoder and decoder. Our eager translation model is low-latency, writing target
tokens as soon as it reads the first source token, and uses constant memory
during decoding. It performs on par with the standard attention-based model of
Bahdanau et al. (2014), and better on long sentences.
An attention mechanism and a separate encoder/decoder are two properties of almost every single neural translation model. The question asked in this paper is- how far can we go without attention and without a separate encoder and decoder? And the answer is- pretty far! The model presented preforms just as well as the attention model of Bahdanau on the four language directions that are studied in the paper.
The translation model presented in the paper is basically a simple recurrent language model. A recurrent language model receives at every timestep the current input word and has to predict the next word in the dataset. To translate with such a model, simply give it the current word from the source sentence and have it try to predict the next word from the target sentence.
Obviously, in many cases such a simple model wouldn't work. For example, if your sentence was "The white dog" and you wanted to translate to Spanish ("El perro blanco"), at the 2nd timestep, the input would be "white" and the expected output would be "perro" (dog). But how could the model predict "perro" when it hasn't seen "dog" yet?
To solve this issue, we preprocess the data before training and insert "empty" padding tokens into the target sentence. When the model outputs such a token, it means that the model would like to read more of the input sentence before emitting the next output word.
So in the example from above, we would change the target sentence to "El PAD perro blanco". Now, at timestep 2 the model emits the PAD symbol. At timestep 3, when the input is "dog", the model can emit the token "perro". These padding symbols are deleted in post-processing, before the output is returned to the user. You can see a visualization of the decoding process below:
To enable us to use beam search, our model actually receives the previous outputted target token in addition to receiving the current source token at every timestep.
PyTorch code for the model is available at https://github.com/ofirpress/YouMayNotNeedAttention
RNN language models are composed of:
1. Embedding layer
2. Recurrent layer(s) (RNN/LSTM/GRU/...)
3. Softmax layer (linear transformation + softmax operation)
The embedding matrix and the matrix of the linear transformation just before the softmax operation are of the same size (size_of_vocab * recurrent_state_size) .
They both contain one representation for each word in the vocabulary.
## __Weight Tying__
This paper shows, that by using the same matrix as both the input embedding and the pre-softmax linear transformation (the output embedding), the performance of a wide variety of language models is improved while the number of parameters is massively reduced.
In weight tied models each word has just one representation that is used in both the input and output embedding.
## __Why does weight tying work?__
1. In the paper we show that in un-tied language models, the output embedding contains much better word representations that the input embedding. We show that when the embedding matrices are tied, the quality of the shared embeddings is comparable to that of the output embedding in the un-tied model. So in the tied model the quality of the input and output embeddings is superior to the quality of those embeddings in the un-tied model.
2. In most language modeling tasks because of the small size of the datasets the models tend to overfit. When the number of parameters is reduced in a way that makes sense there is less overfitting because of the reduction in the capacity of the network.
## __Can I tie the input and output embeddings of the decoder of an translation model?__
Yes, we show that this reduces the model's size while not hurting its performance.
In addition, we show that if you preprocess your data using BPE, because of the large overlap between the subword vocabularies of the source and target language, __Three-Way Weight Tying__ can be used. In Three-Way Weight Tying, we tie the input embedding in the encoder to the input and output embeddings of the decoder (so each word has one representation which is used across three matrices).
[This](http://ofir.io/Neural-Language-Modeling-From-Scratch/) blog post contains more details about the weight tying method.
This paper shows how to train a character level RNN to generate text using only the GAN objective (reinforcement learning and the maximum-likelihood objective are not used).
The baseline WGAN is made up of:
* A recurrent **generator** that first embeds the previously omitted token, inputs this into a GRU, which outputs a state that is then transformed into a distribution over the character vocabulary (which represents the model's belief about the next output token).
* A recurrent **discriminator** that embeds each input token and then feeds them into a GRU. A linear transformation is used on the final hidden state in order to give a "score" to the input (a correctly-trained discriminator should give a high score to real sequences of text and a low score to fake ones).
The paper shows that if you try to train this baseline model to generate sequences of length 32 it just wont work (only gibberish is generated).
In order to get the model to work, the baseline model is augmented in three different ways:
1. **Curriculum Learning**: At first the generator has to generate sequences of length 1 and the discriminator only trains on real and generated sequences of length 1. After a while, the models moves on to sequences of length 2, and then 3, and so on, until we reach length 32.
2. **Teacher Helping**: In GANs the problem is usually that the generator is too weak. In order to help it, this paper proposes a method in which at stage $i$ in the curriculum, when the generator should generate sequences of length $i$, we feed it a real sequence of length $i-1$ and ask it to just generate 1 character more.
3. **Variable Lengths**: In each stage $i$ in the curriculum learning process, we generate and discriminate sequences of length $k$, for each $ 1 \leq k \leq i$ in each batch (instead of just generating and discriminating sequences of length exactly $i$).