Occasionally, I come across results in machine learning that I'm glad exist, even if I don't fully understand them, precisely because they remind me how little we know about the complicated information architectures we're building, and what kinds of signal they can productively use. This is one such result. The paper tests a method called self-training, and compares it against the more common standard of pre-training. Pre-training works by first training your model on a different dataset, in a supervised way, with the labels attached to that dataset, and then transferring the learned weights on that model model (except for the final prediction head) and using that as initialization for training on your downstream task. Self-training also uses an external dataset, but doesn't use that external data's labels. It works by 1) Training a model on the labeled data from your downstream task, the one you ultimately care about final performance on 2) Using that model to make label predictions (for the label set of your downstream task), for the external dataset 3) Retraining a model from scratch with the combined set of human labels and predicted labels from step (2) https://i.imgur.com/HaJTuyo.png This intuitively feels like cheating; something that shouldn't quite work, and yet the authors find that it equals or outperforms pretraining and self-supervised learning in the setting they examined (transferring from ImageNet as an external dataset to CoCo as a downstream task, and using data augmentations on CoCo). They particularly find this to be the case when they're using stronger data augmentations, and when they have more labeled CoCo data to train with from the pretrained starting point. They also find that self-training outperforms self-supervised (e.g. contrastive) learning in similar settings. They further demonstrate that self-training and pre-training can stack; you can get marginal value from one, even if you're already using the other. They do acknowledge that - because it requires training a model on your dataset twice, rather than reusing an existing model directly - their approach is more computationally costly than the pretrained-Imagenet alternative. This work is, I believe, rooted in the literature on model distillation and student/teacher learning regimes, which I believe has found that you can sometimes outperform a model by training on its outputs, though I can't fully remember the setups used in those works. The authors don't try too hard to give a rigorous theoretical account of why this approach works, which I actually appreciate. I think we need to have space in ML for people to publish what (at least to some) might be unintuitive empirical results, without necessarily feeling pressure to articulate a theory that may just be a half-baked after-the-fact justification. One criticism or caveat I have about this paper is that I wish they'd evaluated what happened if they didn't use any augmentation. Does pre-training do better in that case? Does the training process they're using just break down? Only testing on settings with augmentations made me a little less confident in the generality of their result. Their best guess is that it demonstrates the value of task-specificity in your training. I think there's a bit of that, but also feel like this ties in with other papers I've read recently on the surprising efficacy of training with purely random labels. I think there's, in general, a lot we don't know about what ostensibly supervised networks learn in the face of noisy or even completely permuted labels.