Deep Residual Learning for Image RecognitionDeep Residual Learning for Image RecognitionHe, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian2015

Paper summaryjoecohenThis summary is as ridiculous as this network is long. A good implementation of the network is here: https://github.com/dmlc/mxnet/blob/master/example/image-classification/symbol_resnet-28-small.py
Here is a visualization of this crazy network:
![](http://josephpcohen.com/w/wp-content/uploads/resnet-28-small.png)

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**.
Advantages:
* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
* No vanishing / exploding gradient
* Identities don't have parameters to be learned
## Evaluation
The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.
* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5
## See also
* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

TLDR; The authors present Residual Nets, which achieve 3.57% error on the ImageNet test set and won the 1st place on the ILSVRC 2015 challenge. ResNets work by introducing "shortcut" connections across stacks of layers, allowing the optimizer to learn an easier residual function instead of the original mapping. This allows for efficient training of very deep nets without the introduction of additional parameters or training complexity. The authors present results on ImageNet and CIFAR-100 with nets as deep as 152 layers (and one ~1000 layer deep net).
#### Key Points
- Problem: Deeper networks experience a *degradation* problem. They don't overfit but nonetheless perform worse than shallower networks on both training and test data due to being more difficult to optimize.
- Because Deep Nets can in theory learn an identity mapping for their additional layers they should strict outperform shallower nets. In practice however, optimizers have problems learning identity (or near-identity) mappings. Learning residual mappings is easier, mitigating this problem.
- Residual Mapping: If the desired mapping is H(x), let the layers learn F(x) = H(x) - x and add x back through a shortcut connection H(x) = F(x) + x. An identity mapping can then be learned easily by driving the learned mapping F(x) to 0.
- No additional parameters or computational complexity are introduced by residuals nets.
- Similar to Highway Networks, but gates are not data-dependent (no extra parameters) and are always open.
- Due the the nature of the residual formula, input and output must be of same size (just like Highway Networks). We can do size transformation by zero-padding or projections. Projections introduce additional parameters. Authors found that projections perform slightly better, but are "not worth" the large number of extra parameters.
- 18 and 34-layer VGG-like plain net gets 27.94 and 28.54 error respectively, not that higher error for deeper net. ResNet gets 27.88 and 25.03 respectively. Error greatly reduces for deeper net.
- Use Bottleneck architecture with 1x1 convolutions to change dimensions.
- Single ResNet outperforms previous start of the art ensembles. ResNet ensemble even better.
#### Notes/Questions
- Love the simplicity of this.
- I wonder how performance depends on the number of layers skipped by the shortcut connections. The authors only present results with 2 or 3 layers.
- "Stacked" or recursive residuals?
- In principle Highway Networks should be able to learn the same mappings quite easily. Is this an optimization problem? Do we just not have enough data. What if we made the gates less fine-grained and substituted sigmoid with something else?
- Can we apply this to RNNs, similar to LSTM/GRU? Seems good for learning long-range dependencies.

This summary is as ridiculous as this network is long. A good implementation of the network is here: https://github.com/dmlc/mxnet/blob/master/example/image-classification/symbol_resnet-28-small.py
Here is a visualization of this crazy network:
![](http://josephpcohen.com/w/wp-content/uploads/resnet-28-small.png)