Understanding the difficulty of training deep feedforward neural networksUnderstanding the difficulty of training deep feedforward neural networksGlorot, Xavier and Bengio, Yoshua2010
Paper summarycubs#### Problem addressed:
Instead of approaching the problem why pre-training works, this paper addresses why traditional way of training deep NNs dont work.
#### Summary:
The main focus of this paper is to empirically study why deep nets dont work with backprop without any pre-training. To analyse this, the authors mainly study the trends of activations and gradient strength across layers vs training iteration using simple backprop. Their study shows that the higher layer units saturate to 0 in the case of Sigmoid which prevents any backpropagated gradients to lower layers. It takes a lot of iterations to get out of saturation after which the lower layers start to learn.
For this reason the authors suggest using activations symmetric around 0 to avoid saturation, like Tanh and Softsign. For Tanh, they find that units of every layer initialized on either part of 0 start saturating (to respective sides) one after the other starting from lower layer to higher layer. For Softsign on the other hand, units from all layers move towards saturation together. Further the histogram of final activations suggest that Tanh units have a peak at both 0 and -1,+1 saturation, while Softsign units generally lie in the linear region. Note that the linear region in Tanh/Softsign has activation gradients-- hence propagates information.
The most interesting part of this study is the way the authors analyse the flow of information from the input layer to the top layer and vice versa. While the forward prop transmits the information about input to higher layers, backward prop transmits the error gradient. They measure the flow of information in terms of the variance of activation (forward) and gradients (backwards) for different layers. Since we would want the information flow to be equal at all layers, the variance should also be the same. So they propose to initialize the weight vectors such that this variance is preserved across layers. They call this ""Normalized Initialization"". Their empirical results show that both activations and gradients (hence information) at all layers have better propagation with their initialization.
#### Novelty:
Analysis of activation values and back-prop gradient across layers for analyzing training difficulties. Also, a new weight initialization method.
#### Drawbacks:
The variance study for activation/gradient is done for linear networks but applied to Tanh and Softsign. How is this justified?
#### Datasets:
Shapeset 3x2, MNIST, CIFAR-10
#### Presenter:
Devansh Arpit
The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization**
$$W \sim U \left [ - \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$
where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$.
Showing some ways **how to debug neural networks** might be another reason to read the paper.
The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign).
However, no regularization was used and many mini-batch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much.
Questions that remain open for me:
* [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9)
* Figure 4: Why is this plot not simply completely dependent on the data?
* Is softsign still used? Why not?
* If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{-0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?
The weights at each layer $W$ are initialized based on the number of connections they have. Each $w \in W$ is drawn from a Gaussian distribution with mean $\mu = 0$ with the variance as follows.
$$\text{Var}(W) = \frac{2}{n_\text{in}+ n_\text{out}}$$
Where $n_\text{in}$ is the number of neurons in the previous layer from the feedforward direction and $n_\text{out}$ is the number of neurons from the previous layer from the backprop direction.
Reference: [Andy Jones's Blog](http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization)
#### Problem addressed:
Instead of approaching the problem why pre-training works, this paper addresses why traditional way of training deep NNs dont work.
#### Summary:
The main focus of this paper is to empirically study why deep nets dont work with backprop without any pre-training. To analyse this, the authors mainly study the trends of activations and gradient strength across layers vs training iteration using simple backprop. Their study shows that the higher layer units saturate to 0 in the case of Sigmoid which prevents any backpropagated gradients to lower layers. It takes a lot of iterations to get out of saturation after which the lower layers start to learn.
For this reason the authors suggest using activations symmetric around 0 to avoid saturation, like Tanh and Softsign. For Tanh, they find that units of every layer initialized on either part of 0 start saturating (to respective sides) one after the other starting from lower layer to higher layer. For Softsign on the other hand, units from all layers move towards saturation together. Further the histogram of final activations suggest that Tanh units have a peak at both 0 and -1,+1 saturation, while Softsign units generally lie in the linear region. Note that the linear region in Tanh/Softsign has activation gradients-- hence propagates information.
The most interesting part of this study is the way the authors analyse the flow of information from the input layer to the top layer and vice versa. While the forward prop transmits the information about input to higher layers, backward prop transmits the error gradient. They measure the flow of information in terms of the variance of activation (forward) and gradients (backwards) for different layers. Since we would want the information flow to be equal at all layers, the variance should also be the same. So they propose to initialize the weight vectors such that this variance is preserved across layers. They call this ""Normalized Initialization"". Their empirical results show that both activations and gradients (hence information) at all layers have better propagation with their initialization.
#### Novelty:
Analysis of activation values and back-prop gradient across layers for analyzing training difficulties. Also, a new weight initialization method.
#### Drawbacks:
The variance study for activation/gradient is done for linear networks but applied to Tanh and Softsign. How is this justified?
#### Datasets:
Shapeset 3x2, MNIST, CIFAR-10
#### Presenter:
Devansh Arpit