Residual Networks Behave Like Ensembles of Relatively Shallow Networks
Andreas Veit
and
Michael Wilber
and
Serge Belongie
arXiv e-Print archive - 2016 via arXiv
Keywords:
cs.CV, cs.AI, cs.LG, cs.NE
First published: 2016/05/20 (7 years ago) Abstract: In this work we propose a novel interpretation of residual networks showing
that they can be seen as a collection of many paths of differing length.
Moreover, residual networks seem to enable very deep networks by leveraging
only the short paths during training. To support this observation, we rewrite
residual networks as an explicit collection of paths. Unlike traditional
models, paths through residual networks vary in length. Further, a lesion study
reveals that these paths show ensemble-like behavior in the sense that they do
not strongly depend on each other. Finally, and most surprising, most paths are
shorter than one might expect, and only the short paths are needed during
training, as longer paths do not contribute any gradient. For example, most of
the gradient in a residual network with 110 layers comes from paths that are
only 10-34 layers deep. Our results reveal one of the key characteristics that
seem to enable the training of very deep networks: Residual networks avoid the
vanishing gradient problem by introducing short paths which can carry gradient
throughout the extent of very deep networks.
This paper introduces an interpretation of deep residual networks as implicit ensembles of exponentially many shallow networks. For a residual block $i$, there are $2^{i-1}$ paths from input to $i$, and the input to $i$ is a mixture of $2^{i-1}$ different distributions. The interpretation is backed by a number of experiments such as removing or re-ordering residual blocks at test time and plotting norm of gradient v/s number of residual blocks the gradient signal passes through. Removing $k$ residual blocks (for k <= 20) from a network of depth n decreases the number of paths to $2^{n-k}$ but there are still sufficiently many valid paths to not hurt classification error, whereas sequential CNNs have a single viable path which gets corrupted. Plot of gradient at input v/s path length shows that almost all contributions to the gradient come from paths shorter than 20 residual blocks, which are the effective paths. The paper concludes by saying that network 'multiplicity', which is the number of paths, plays a key role in terms of the network's expressability.
## Strengths
- Extremely insightful set of experiments. These experiments nail down the intuitions as to why residual networks work, as well as clarify the connections with stochastic depth (sampling the network multiplicity during training i.e. ensemble by training) and highway networks (reduction in number of available paths by gating both skip connections and paths through residual blocks).
## Weaknesses / Notes
- Connections between effective paths and model compression.