This paper studies the transferability of features learnt at different layers of a convolutional neural network. Typically, initial layers of a CNN learn features that resemble Gabor filter or color blobs, and are fairly general, while the later layers are more task-specific. Main contributions: - They create two splits of the ImageNet dataset (A/B) and explore how performance varies for various network design choices such as - Base: CNN trained on A or B. - Selffer: first n layers are copied from a base network, and the rest of the network is randomly initialized and trained on the same task. - Transfer: first n layers are copied from a base network, and the rest of the network is trained on a different task. - Each of these 'copied' layers can either be fine-tuned or kept frozen. - Selffer networks without fine-tuning don't perform well when the split is somewhere in the middle of the network (n = 3-6). This is because neurons in these layers co-adapt to each other's activations in complex ways, which get broken up when split. - As we approach final layers, there is lesser for the network to learn and so these layers can be trained independently. - Fine-tuning a selffer network gives it the chance to re-learn co-adaptations. - Transfer networks transferred at lower n perform better than larger n, indicating that features get more task-specific as we move to higher layers. - Fine-tuning transfer networks, however, results in better performance. They argue that better generalization is due to the effect of having seen the base dataset, even after considerable fine-tuning. - Fine-tuning works much better than using random features. - Features are more transferable across related tasks than unrelated tasks. - They study transferability by taking two random data splits, and splits of man-made v/s natural data. ## Strengths - Experiments are thorough, and the results are intuitive and insightful. ## Weaknesses / Notes - This paper only analyzes transferability across different splits of ImageNet (as similar/dissimilar tasks). They should have reported results on transferability from one task to another (classification/detection) or from one dataset to another (ImageNet/MSCOCO). - It would be interesting to study the role of dropout in preventing co-adaptations while transferring features.