Authors test different variant of CNN architectures, non-linearities, poolings, etc. on ImageNet. Summary: - use ELU non-linearity without batchnorm or ReLU with it. - apply a learned colorspace transformation of RGB (2 layers of 1x1 convolution ). - use the linear learning rate decay policy. - use a sum of the average and max pooling layers. - use mini-batch size around 128 or 256. If this is too big for your GPU, decrease the learning rate proportionally to the batch size. - use fully-connected layers as convolutional and average the predictions for the final decision. - when investing in increasing training set size, check if a plateau has not been reach. - cleanliness of the data is more important then the size. - if you cannot increase the input image size, reduce the stride in the consequent layers, it has roughly the same effect. - if your network has a complex and highly optimized architecture, like e.g. GoogLeNet, be careful with modifications.