* They describe an architecture that merges classical convolutional networks and residual networks. * The architecture can (theoretically) learn anything that a classical convolutional network or a residual network can learn, as it contains both of them. * The architecture can (theoretically) learn how many convolutional layers it should use per residual block (up to the amount of convolutional layers in the whole network). ### How * Just like residual networks, they have "blocks". Each block contains convolutional layers. * Each block contains residual units and non-residual units. * They have two "streams" of data in their network (just matrices generated by each block): * Residual stream: The residual blocks write to this stream (i.e. it's their output). * Transient stream: The non-residual blocks write to this stream. * Residual and non-residual layers receive *both* streams as input, but only write to *their* stream as output. * Their architecture visualized: ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Resnet_in_Resnet__architecture.png?raw=true "Architecture") * Because of this architecture, their model can learn the number of layers per residual block (though BN and ReLU might cause problems here?): ![Learning layercount](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Resnet_in_Resnet__learning_layercount.png?raw=true "Learning layercount") * The easiest way to implement this should be along the lines of the following (some of the visualized convolutions can be merged): * Input of size CxHxW (both streams, each C/2 planes) * Concat * Residual block: Apply C/2 convolutions to the C input planes, with shortcut addition afterwards. * Transient block: Apply C/2 convolutions to the C input planes. * Apply BN * Apply ReLU * Output of size CxHxW. * The whole operation can also be implemented with just a single convolutional layer, but then one has to make sure that some weights stay at zero. ### Results * They test on CIFAR-10 and CIFAR-100. * They search for optimal hyperparameters (learning rate, optimizer, L2 penalty, initialization method, type of shortcut connection in residual blocks) using a grid search. * Their model improves upon a wide ResNet and an equivalent non-residual CNN by a good margin (CIFAR-10: 0.5-1%, CIFAR-100: 1-2%).