Understanding the difficulty of training deep feedforward neural networksUnderstanding the difficulty of training deep feedforward neural networksGlorot, Xavier and Bengio, Yoshua2010

Paper summaryjoecohenThe weights at each layer $W$ are initialized based on the number of connections they have. Each $w \in W$ is drawn from a Gaussian distribution with mean $\mu = 0$ with the variance as follows.
$$\text{Var}(W) = \frac{2}{n_\text{in}+ n_\text{out}}$$
Where $n_\text{in}$ is the number of neurons in the previous layer from the feedforward direction and $n_\text{out}$ is the number of neurons from the previous layer from the backprop direction.
Reference: [Andy Jones's Blog](http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization)

The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization**
$$W \sim U \left [ - \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$
where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$.
Showing some ways **how to debug neural networks** might be another reason to read the paper.
The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign).
However, no regularization was used and many mini-batch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much.
Questions that remain open for me:
* [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9)
* Figure 4: Why is this plot not simply completely dependent on the data?
* Is softsign still used? Why not?
* If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{-0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?

The weights at each layer $W$ are initialized based on the number of connections they have. Each $w \in W$ is drawn from a Gaussian distribution with mean $\mu = 0$ with the variance as follows.
$$\text{Var}(W) = \frac{2}{n_\text{in}+ n_\text{out}}$$
Where $n_\text{in}$ is the number of neurons in the previous layer from the feedforward direction and $n_\text{out}$ is the number of neurons from the previous layer from the backprop direction.
Reference: [Andy Jones's Blog](http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization)