Net2Net: Accelerating Learning via Knowledge Transfer on ShortScience.org

arxiv.org
scholar.google.com

Net2Net: Accelerating Learning via Knowledge Transfer
Chen, Tianqi and Goodfellow, Ian J. and Shlens, Jonathon
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 2

[link] Summary by Hugo Larochelle 8 years ago

This paper presents an approach to initialize a neural network from the parameters of a smaller and previously trained neural network. This is effectively done by increasing the size (in width and/or depth) of the previously trained neural network, in such of a way that the function represented by the network doesn't change (i.e. the output of the larger neural network is still the same). The motivation here is that initializing larger neural networks in this way allows to accelerate their training, since at initialization the neural network will already be quite good.

In a nutshell, neural networks are made wider by adding several copies (selected randomly) of the same hidden units to the hidden layer, for each hidden layer. To ensure that the neural network output remains the same, each incoming connection weight must also be divided by the number of replicas that unit is connected to in the previous layer. If not training using dropout, it is also recommended to add some noise to this initialization, in order to break its initial symmetry (though this will actually break the property that the network's output is the same). As for making a deeper network, layers are added by initializing them to be the identity function. For ReLU units, this is achieved using an identity matrix as the connection weight matrix. For units based on sigmoid or tanh activations, unfortunately it isn't possible to add such identity layers.

In their experiments on ImageNet, the authors show that this initialization allows them to train larger networks faster than if trained from random initialization. More importantly, they were able to outperform their previous validation set ImageNet accuracy by initializing a very large network from their best Inception network.

Your comment:

[link] Summary by Abhishek Das 6 years ago

This paper presents a simple method to accelerate the training
of larger neural networks by initializing them with parameters
from a trained, smaller network. Networks are made wider or deeper
while preserving the same output as the smaller network which
maintains performance when training starts, leading to faster
convergence. Main contributions:

- Net2Deeper
    - Initialize layers with identity weight matrices
    to preserve the same output.
    - Only works when activation function $f$ satisfies
    $f(If(x)) = f(x)$ for example ReLU, but not sigmoid, tanh.

- Net2Wider
    - Additional units in a layer are randomly sampled
    from existing units. Incoming weights are kept the same
    while outgoing weights are divided by the number of
    replicas of that unit so that the output at the next layer
    remains the same.

- Experiments on ImageNet
    - Net2Deeper and Net2Wider models converge faster to the
    same accuracy as networks initialized randomly.
    - A deeper and wider model initialized with Net2Net from
    the Inception model beats the validation accuracy (and
    converges faster).

## Strengths

- The Net2Net technique avoids the brief period of low performance that exists in
methods that initialize some layers of a deeper network from a trained
network and others randomly.

- This idea is very useful in production systems which essentially have to
be lifelong learning systems. Net2Net presents an easy way to immediately
shift to a model of higher capacity and reuse trained networks.

- Simple idea, clearly presented.


## Weaknesses / Notes

- The random mapping algorithm for different layers was done manually
for this paper. Developing a remapping inference algorithm should be
the next step in making the Net2Net technique more general.

- The final accuracy that Net2Net models achieve seems to depend only
on the model capacity and not the initialization. I think this merits
further investigation. In this paper, it might just be because of randomness
in training (dropout) or noise added to the weights of the new units to
approximately represent the same function (when not using dropout).

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private