Training Very Deep NetworksTraining Very Deep NetworksSrivastava, Rupesh Kumar and Greff, Klaus and Schmidhuber, Jürgen2015
Paper summarynipsreviewsMachine learning researchers frequently find that they get better results by adding more and more layers to their neural networks, but the difficulties of initialization and decaying/exploding gradients have been severely limiting. Indeed, the difficulties of getting information to flow through deep neural networks arguably kept them out of widespread use for 30 years. This paper addresses this problem head on and demonstrates one method for training 100 layer nets.
The paper describes an affective method to train very deep neural networks by means of 'information highways', or building direct connections to upper network layers. Although a generalization of prior techniques, such as cross-layer connections, the authors have shown this method to be effective by experimentation. The contributions are quite novel and well supported by experimental evidence.
TLDR; The authors propose "Highway Networks", which uses gates (inspired by LSTMs) to determine how much of a layer's activations to transform or just pass through. Highway Networks can be used with any kind of activation function, including recurrent and convnolutional units, and trained using plain SGD. The gating mechanism allows highway networks with tens or hundreds of layers to be trained efficiently. The authors show that highway networks with fewer parameters achieve results competitive with state-of-the art for the MNIST and CIFAR tasks. Gates outputs vary significantly with the input examples, demonstrating that the network not just learns a "fixed structure", but dynamically routes data based for specific examples examples.
Datasets used: MNIST, CIFAR-10, CIFAR-100
#### Key Takeaways
- Apply LSTM-like gating to networks layers. Transform gate T and carry gate C.
- The gating forces the layer inputs/outputs to be of the same size. We can use additional plain layers for dimensionality transformations.
- Bias weights of the transform gates should be initialized to negative values (-1, -2, -3, etc) to initially force the networks to pass through information and learn long-term dependencies.
- HWN does not learn a fixed structure (same gate outputs), but dynamic routing based on current input.
- In complex data sets each layer makes an important contritbution, which is shown by lesioning (setting to pass-through) individual layers.
#### Notes / Questions
- Seems like the authors did not use dropout in their experiments. I wonder how these play together. Is dropout less effective for highway networks because the gates already learn efficients paths?
- If we see that certain gates outputs have low variance across examples, can we "prune" the network into a fixed strucure to make it more efficient (for production deployments)?