[link]
**Problem Setting:** Sequence to Sequence learning (seq2seq) is one of the most successful techniques in machine learning nowadays. The basic idea is to encode a sequence into a vector (or a sequence of vectors if using attention based encoder) and then use a recurrent decoder to decode the target sequence conditioned on the encoder output. While researchers have explored various architectural changes to this basic encoderdecoder model, the standard way of training such seq2seq models is to maximize the likelihood of each successive target word conditioned on the input sequence and the *gold* history of target words. This is also known as *teacherforcing* in RNN literature. Such an approach has three major issues: 1. **Exposure Bias:** Since we teacherforce the model with *gold* history during training, the model is never exposed to its errors during training. At test time, we will not have access to *gold* history and we feed the history generated by the model. If it is erroneous, the model does not have any clue about how to rectify it. 2. **LossEvaluation Mismatch:** While we evaluate the model using sequence level metrics (such as BLEU for Machine Translation), we are training the model with word level cross entropy loss. 3. **Label bias:** Since the word probabilities are normalized at each time step (by using softmax over the final layer of the decoder), this can result in label bias if we vary the number of possible candidates in each step. More about this later. **Solution:** This paper proposes an alternative training procedure for seq2seq models which attempt to solve all the 3 major issues listed above. The idea is to pose seq2seq learning as beamsearch optimization problem. Authors begin by removing the final softmax activation function from the decoder. Now instead of probability distributions, we will get score for next possible word. Then the training procedure is changed as follows: At every time step $t$, they maintain a set $S_t$ of $K$ candidate sequences of length $t$. Now the loss function is defined with the following characteristics: 1. If the *gold* subsequence of length $t$ is in set $S_t$ and the score for *gold* subsequence exceeds the score of the $K$th ranked candidate by a margin, the model incurs no loss. Now the candidates for next timestep are chosen in a way similar to regular beamsearch with beamsize $K$. 2. If the *gold* subsequence of length $t$ is in set $S_t$ and it is the $K$th ranked candidate, then the loss will push the *gold* sequence up by increasing its score. The candidates for next timestep are chosen in a way similar as first case. 3. If the *gold* subsequence of length $t$ is NOT in set $S_t$, then the score of the *gold* sequence is increased to be higher than $K$th ranked candidate by a margin. In this case, candidates for next step or chosen by only considering *gold* word at time $t$ and getting its top$K$ successors. 4. Further, since we want the full *gold* sequence to be at top of the beam at the end of the search, when $t=T$, the loss is modified to require the score of *gold* sequence to exceed the score of the *highest* ranked incorrect prediction by a margin. This nonprobabilistic training method has several advantages: * The model is trained in a similar way it would be tested, since we use beamsearch during training as well as testing. Hence this helps to eliminate exposure bias. * The score based loss can be easily scaled by a mistakespecific cost function. For example, in MT, one could use a cost function which is inversely proportional to BLEU score. So there is no lossevaluation mismatch. * Each time step can have different set of successor words based on any hard constraints in the problem. Note that the model is nonprobabilistic and hence this varying successor function will not introduce any label bias. Refer [this set of slides][1] for an excellent illustration of label bias. Cost of forwardprop grows linearly with respect to beam size $K$. However, GPU implementation should help to reduce this cost. Authors propose a clever way of doing BPTT which makes the backprop almost same cost as ordinary seq2seq training. **Additional Tricks** 1. Authors pretrain the seq2seq model with regular word level crossentropy loss and this is crucial since random initialization did not work. 2. Authors use "curriculum beam" strategy in training where they start with beam size of 2 and increase the beam size by 1 for every 2 epochs until it reaches the required beam size. You have to train your model with training beam size of at least test beam size + 1. (i.e $K_{tr} >= K_{te} + 1$). 3. When you use dropout, you need to be careful to use the same dropout value during backprop. Authors do this by sharing a single dropout across all sequences in a time step. **Experiments** Authors compare the proposed model against basic seq2seq in word ordering, dependency parsing and MT tasks. The proposed model achieves significant improvement over the strong baseline. **Related Work:** The whole idea of the paper is based on [learning as search optimization (LaSO) framework][2] of Daume III and Marcu (2005). Other notable related work are training seq2seq models using mix of crossentropy and REINFORCE called [MIXER][3] and [an actorcritic based seq2seq training][4]. Authors compare with MIXER and they do significantly better than MIXER. **My two cents:** This is one of the important research directions in my opinion. While other recent methods attempt to use reinforcement learning to avoid the issues in wordlevel crossentropy training, this paper proposes a really simple score based solution which works very well. While most of the language generation research is stuck with probabilistic framework (I am saying this w.r.t Deep NLP research), this paper highlights the benefits on nonprobabilistic generation models. I see this as one potential way of avoiding the nasty scalability issues that come with softmax based generative models. [1]: http://www.cs.stanford.edu/~nmramesh/crf [2]: https://www.isi.edu/~marcu/papers/daume05laso.pdf [3]: http://arxiv.org/pdf/1511.06732v7.pdf [4]: https://arxiv.org/pdf/1607.07086v2.pdf
Your comment:

[link]
This paper is covered by author in this [talk](https://github.com/udibr/notes/blob/master/Talk%20by%20Sasha%20Rush%20%20Interpreting%2C%20Training%2C%20and%20Distilling%20Seq2Seq%E2%80%A6.pdf) 