This paper describes a learning algorithm for deep neural networks that can be understood as an extension of stacked denoising autoencoders. In short, instead of reconstructing one layer at a time and greedily stacking, a unique unsupervised objective involving the reconstruction of all layers is optimized jointly by all parameters (with the relative importance of each layer cost controlled by hyper-parameters). In more details: * The encoding (forward propagation) adds noise (Gaussian) at all layers, while decoding is noise-free. * The target at each layer is the result of noise-less forward propagation. * Direct connections (also known as skip-connections) between a layer and its decoded reconstruction are used. The resulting encoder/decoder architecture thus ressembles a ladder (hence the name Ladder Networks). * Miniature neural networks with a single hidden unit and skip-connections are used to decode the left and top layers into a reconstruction. Each network is applied element-wise (without parameter sharing across reconstructed units). * The unsupervised objective is combined with a supervised objective, corresponding to the regular negative class log-likelihood objective (using an output softmax layer). Two losses are used for each input/target pair: one based on the noise-free forward propagation (which also provides the target of the denoising objective) and one with the noise added (which also corresponds to the encoding stage of the unsupervised autoencoder objective). Batch normalization is used to train the network. Since the model combines unsupervised and supervised learning, it can be used for semi-supervised learning, where unlabeled examples can be used to update the network using the unsupervised objective only. State of the art results in the semi-supervised setting are presented, for both the MNIST and CIFAR-10 datasets. #### My two cents What I find most exciting about this paper is its performance. On MNIST, with only 100 labeled examples, it achieves 1.13% error! That is essentially the performance of stacked denoising autoencoders, trained on the entire training set (though that was before ReLUs and batch normalization, which this paper uses)! This confirms a current line of thought in Deep Learning (DL) that, while recent progress in DL applied on large labeled datasets does not rely on any unsupervised learning (unlike at the "beginning" of DL in the mid 2000s), unsupervised learning might instead be crucial for success in low-labeled data regime, in the semi-supervised setting. Unfortunately, there is one little issue in the experiments, disclosed by the authors: while they used few labeled examples for training, model selection did use all 10k labels in the validation set. This is of course unrealistic. But model selection in the low data regime is arguably, in itself, an open problem. So I like to think that this doesn't invalidate the progress made in this paper, and only suggests that some research needs to be done on doing effective hyper-parameter search with a small validation set. Generally, I really hope this paper will stimulate more research on DL methods to the specific case of small labeled dataset / large unlabeled dataset. While this isn't a problem that is as "flashy" as tasks such as the ImageNet Challenge which comes with lots of labeled data, I think this is a crucial research direction for AI in general. Indeed, it seems naive to me to expect that we will be able to collect large labeled dataset for each and every task, on our way to real AI.