Training Tips for the Transformer ModelTraining Tips for the Transformer ModelMartin Popel and Ondřej Bojar2018
Paper summarysudharsansai**TL;DR:** This paper summarizes some of the practical tips for training a transformer model for MT task, though I believe some of the tips are task-agnostic. The parameters considered include number of GPUs, batch size, learning rate schedule, warmup steps, checkpoint averaging and maximum sequence lengths.
**Framework used for the experiments:** [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor)
The effect of varying the most important hyper-parameters on the performances are as follows:
**Early Stopping:** Usually papers don't report the stopping criterion except in some vague terms (like number of days to train). The authors suggest that with a large dataset, even a very large model almost never converges and keeps improving by small amounts. So, keep training your model for long periods if your GPU budget supports such an option.
**Data Preprocessing:** Mostly neural architectures these days use sub word units instead of words. It's better to create the sub word vocabulary using a sufficiently large dataset. Also, its advised to filter datasets based on *max_sequence_length* and store them (for eg. as TFRecords) before training your model and not do the filtering every epoch to save precious CPU time.
**Batch Size:** Computational throughput (number of tokens executed per unit time) increases sub-linearly w.r.t. the batch size, which means after a particular number, increasing the batch size may not be that useful. From performance POV however, increasing the batch size usually leads to faster and better convergence. So, try using the maximum batch size, be it a single or multi-GPU training. Keep in mind, however, that due to random batching you may run out of memory suddenly even after days of training. So, leave some backup memory for such cases while increasing the batch size.
**Dataset size:** Experiments from the paper reinforce the fact that with BIG models, more data is better. While comparing datasets of different sizes, its advised to train the models for long enough periods because the effect of dataset sizes kicks in usually after long periods.
**Model size:** A bigger model even with a smaller batch size performs better than a smaller model with larger batch sizes after a few days of training. For debugging, use the smallest models btw!
**Maximum sequence length:** Decreasing the *max_sequence_length* leads to more examples from the dataset excluded while allowing bigger batch sizes. So, its a trade-off. Often, the presence of more examples off-sets the gains from increasing batch sizes while training for enough time. But even such a gain plateaus after a sufficient sequence length, since very long sentences are often outliers and won't contribute much to performance gains.
**Learning rate and Warm-up steps:** The usual advice of using a not-so-high and not-so-low learning rates apply here. Using large warm-up steps often off-set the damage caused by large learning rates. So does gradient clipping.
**Number of GPUs:** For the fastest convergence, use as many GPUs as available. There would be no noticeable variation in the performances. There is a huge debate on scaling of learning rates while going from single to multiple GPUs, though the authors report that there is no significant variation while using the same learning rates, independent of the batch size (which increases with more GPUs)
**Checkpoint averaging:** Averaging last n (=10) model checkpoints saved at 1hr/30mins intervals almost always leads to better performances. This is similar to Averaged SGD from AWD-LSTM (*Merity et al.*)