All you need is a good init All you need is a good init
Paper summary Mean(input) = 0, var(input) =1 is good for learning. Independent input features are good for learning. So: 1) Pre-Initialize network weights with (approximate) orthonormal matrices 2) Do forward pass with mini-batch 3) Divide layer weights by $\sqrt{var(Output)}$ 4) PROFIT!

Your comment:

Short Science allows researchers to publish paper summaries that are voted on and ranked!