This is an interesting - and refreshing - paper, in that, instead of trying to go all-in on a particular theoretical point, the authors instead run a battery of empirical investigations, all centered around the question of how to explain what happens to make transfer learning work. The experiments don't all line up to support a single point, but they do illustrate different interesting facets of the transfer process. - An initial experiment tries to understand how much of the performance of fine-tuned models can be explained by (higher-level, and thus larger-scale) features, and how much is driven by lower level (and thus smaller-scale) image statistics. To start with, the authors compare the transfer performance from ImageNet onto three different datasets - clip art, sketches, and real images. As expected, transfer performance is highest with real datasets, which are the most similar to training domain. However, there still *is* positive transfer in terms of final performance across all domains, as well as benefit in optimization speed. - To try to further tease out the difference between the transfer benefits of high and low-level features, the authors run an experiment where blocks of pixels are shuffled around within the image on downstream tasks . The larger the size of the blocks being shuffled, the more that large-scale features of the image are preserved. As predicted, accuracy drops dramatically when pixel block size is small, for both randomly initialized and pretrained models. In addition, the relative value added by pretraining drops, for all datasets except quickdraw (the dataset of sketches). This suggests that in most datasets, the value brought by fine-tuning was mostly concentrated in large-scale features. One interesting tangent of this experiment was the examination of optimization speed (in the form of mean training accuracy over initial epochs). Even at block sizes too small for pretraining to offer a benefit to final accuracy, it did still contribute to faster training. (See transparent bars in right-hand plot below) https://i.imgur.com/Y8sO1da.png - On a somewhat different front, the authors look into how similar pretrained + finetuned models are to one another, compared to models trained on the same dataset from random initializations. First, they look at a measure of feature similarity, and find that the features learned by two pretrained networks are more similar to each other than a pretrained network is to a randomly initalized network, and also more than two randomly initialized networks are to one another. Randomly initialized networks are closest to one another in their final-layer features, but this is still a multiple of 4 or 5 less than the similarity between the pretrained networks - Looking at things from the perspective of optimization, the paper measures how much performance drops when you linearly interpolate between different solutions found by both randomly initialized and pretrained networks. For randomly initialized networks, interpolation requires traversing a region where test accuracy drops to 0%. However, for pretrained networks, this isn't the case, with test accuracy staying high throughout. This suggests that pretraining gets networks into a basin of the loss landscape, and that future training stays within that basin. There were also some experiments on module criticality that I believe were in a similar vein to these, but which I didn't fully follow - Finally, the paper looks at the relationship between accuracy on the original pretraining task and both accuracy and optimization speed on the downstream task. They find that higher original-task accuracy moves in the same direction as higher downstream-task accuracy, though this is less true when the downstream task is less related (as with quickdraw). Perhaps more interestingly, they find that the benefits of transfer to optimization speed happen and plateau quite early in training. Clip Art and Real transfer tasks are much more similar in the optimization speed benefits they get form ImageNet training, where on the accuracy front, the real did dramatically better. https://i.imgur.com/jBCJcLc.png While there's a lot to dig into in these results overall, the things I think are most interesting are the reinforcing of the idea that even very random and noisy pretraining can be beneficial to optimization speed (this seems reminiscent of another paper I read from this year's NeurIPS, examining why pretraining on random labels can help downstream training), and the observation that pretraining deposits weights in a low-loss bucket, from which they can learn more efficiently (though, perhaps, if the task is too divergent from the pretraining task, this difficulty in leaving the basin becomes a disadvantage). This feels consistent with some work in the Lottery Ticket Hypothesis, which has recently suggested that, after a short duration of training, you can rewind a network to a checkpoint saved after that duration, and be successfully able to train to low loss again.