This is a very techincal paper and I only covered items that interested me * Model * Encoder * 8 layers LSTM * bidirectional only first encoder layer * top 4 layers add input to output (residual network) * Decoder * same as encoder except all layers are just forward direction * encoder state is not passed as a start point to Decoder state * Attention * energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer * computed from output of 1st decoder layer * prefeed to all layers * Training has two steps: ML and RL * ML (crossentropy) training: * common wisdom, initialize all trainable parameters uniformly between [0.04, 0.04] * clipping=5, batch=128 * Adam (lr=2e4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps) * 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days * [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.20.3 (higher for smaller datasets) * RL  [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15) * sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on ngrams of size 14 * mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$ * mean $r$ computed from $m=15$ samples * SGD, 400K steps, 3 days, no drouput * Prediction (i.e. Decoder) * beam search (3 beams) * A normalized score is computed to every beam that ended (died) * did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.60.7]$ * normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sumlog of attention weight of every input word (i.e. after summing over all output words) * Do a second pruning using normalized scores
Your comment:
