Self-Normalizing Neural NetworksSelf-Normalizing Neural NetworksGünter Klambauer and Thomas Unterthiner and Andreas Mayr and Sepp Hochreiter2017
* They suggest a variation of ELUs, which leads to networks being automatically normalized.
* The effects are comparable to Batch Normalization, while requiring significantly less computation (barely more than a normal ReLU).
* They define Self-Normalizing Neural Networks (SNNs) as neural networks, which automatically keep their activations at zero-mean and unit-variance (per neuron).
* They use SELUs to turn their networks into SNNs.
* ![SELU](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Self-Normalizing_Neural_Networks__SELU.jpg?raw=true "SELU")
* with `alpha = 1.6733` and `lambda = 1.0507`.
* They proof that with properly normalized weights the activations approach a fixed point of zero-mean and unit-variance. (Different settings for alpha and lambda can lead to other fixed points.)
* They proof that this is still the case when previous layer activations and weights do not have optimal values.
* They proof that this is still the case when the variance of previous layer activations is very high or very low and argue that the mean of those activations is not so important.
* Hence, SELUs with these hyperparameters should have self-normalizing properties.
* SELUs are here used as a basis because:
1. They can have negative and positive values, which allows to control the mean.
2. They have saturating regions, which allows to dampen high variances from previous layers.
3. They have a slope larger than one, which allows to increase low variances from previous layers.
4. They generate a continuous curve, which ensures that there is a fixed point between variance damping and increasing.
* ReLUs, Leaky ReLUs, Sigmoids and Tanhs do not offer the above properties.
* SELUs for SNNs work best with normalized weights.
* They suggest to make sure per layer that:
1. The first moment (sum of weights) is zero.
2. The second moment (sum of squared weights) is one.
* This can be done by drawing weights from a normal distribution `N(0, 1/n)`, where `n` is the number of neurons in the layer.
* SELUs don't perform as well with normal Dropout, because their point of low variance is not 0.
* They suggest a modification of Dropout called Alpha-dropout.
* In this technique, values are not dropped to 0 but to `alpha' = -lambda * alpha = -1.0507 * 1.6733 = -1.7581`.
* Similar to dropout, activations are changed during training to compensate for the dropped units.
* Each activation `x` is changed to `a(xd+alpha'(1-d))+b`.
* `d = B(1, q)` is the dropout variable consisting of 1s and 0s.
* `a = (q + alpha'^2 q(1-q))^(-1/2)`
* `b = -(q + alpha'^2 q(1-q))^(-1/2) ((1-q)alpha')`
* They made good experiences with dropout rates around 0.05 to 0.1.
* Note: All of their tests are with fully connected networks. No convolutions.
* Example training results:
* ![MINST CIFAR10](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Self-Normalizing_Neural_Networks__MNIST_CIFAR10.jpg?raw=true "MNIST CIFAR10")
* Left: MNIST, Right: CIFAR10
* Networks have N layers each, see legend. No convolutions.
* 121 UCI Tasks
* They manage to beat SVMs and RandomForests, while other networks (Layer Normalization, BN, Weight Normalization, Highway Networks, ResNet) perform significantly worse than their network (and usually don't beat SVMs/RFs).
* They achieve better results than other networks (again, Layer Normalization, BN, etc.).
* They achive almost the same result as the so far best model on the dataset, which consists of a mixture of neural networks, SVMs and Random Forests.
* They achieve better results than other networks.
* They beat the best non-neural method (Naive Bayes).
* Among all tested other networks, MSRAinit performs best, which references a network withput any normalization, only ReLUs and Microsoft Weight Initialization (see paper: `Delving deep into rectifiers: Surpassing human-level performance on imagenet classification`).
First published: 2017/06/08 (2 years ago) Abstract: Deep Learning has revolutionized vision via convolutional neural networks
(CNNs) and natural language processing via recurrent neural networks (RNNs).
However, success stories of Deep Learning with standard feed-forward neural
networks (FNNs) are rare. FNNs that perform well are typically shallow and,
therefore cannot exploit many levels of abstract representations. We introduce
self-normalizing neural networks (SNNs) to enable high-level abstract
representations. While batch normalization requires explicit normalization,
neuron activations of SNNs automatically converge towards zero mean and unit
variance. The activation function of SNNs are "scaled exponential linear units"
(SELUs), which induce self-normalizing properties. Using the Banach fixed-point
theorem, we prove that activations close to zero mean and unit variance that
are propagated through many network layers will converge towards zero mean and
unit variance -- even under the presence of noise and perturbations. This
convergence property of SNNs allows to (1) train deep networks with many
layers, (2) employ strong regularization, and (3) to make learning highly
robust. Furthermore, for activations not close to unit variance, we prove an
upper and lower bound on the variance, thus, vanishing and exploding gradients
are impossible. We compared SNNs on (a) 121 tasks from the UCI machine learning
repository, on (b) drug discovery benchmarks, and on (c) astronomy tasks with
standard FNNs and other machine learning methods such as random forests and
support vector machines. SNNs significantly outperformed all competing FNN
methods at 121 UCI tasks, outperformed all competing methods at the Tox21
dataset, and set a new record at an astronomy data set. The winning SNN
architectures are often very deep. Implementations are available at: