This paper focuses on taking advances from the (mostly heuristic, practical) world of data augmentation for supervised learning, and applying those to the unsupervised setting, as a way of inducing better performance in a semi-supervised environment (with many unlabeled points, and few labeled ones) Data augmentation has been a mostly behind-the-scenes implementation detail in modern deep learning: minor modifications like shifting a dataset by a few pixels, rotating it slightly, or flipping it horizontally, to generate additional pseudoexamples for the model to train on. The core idea motivating such approaches is that the tactics of data augmentation are modifications that should not change the class label in a platonic, ground truth sense, which allows us to use them as training examples of the same class label as the original image from which the perturbations were made. In addition to just giving then network generically more data, this approach communicates to the network the specific kinds of modifications that can be made to an image and have it still be of the same class. The Unsupervised Data Augmentation (UDA) tactic from this paper notices two things: (1) Within the sphere of supervised learning, there has been dataset-specific innovation in generating augmented data that will be particularly helpful to a given dataset. Inlanguage modeling, an example of this is having a sentence go into another language and back again through two well-trained translation networks, and use the resulting sentence as another example of the same class. For ImageNet, there’s an approach called AutoAugment that uses reinforcement learning on a validation set to learn a policy of image operations (rotate, shear, change color) in order to increase validation accuracy. [I am a little confused about this as an approach, since I worry about essentially overfitting to the validation set. That said, I don’t have time to delve specifically into the AutoAugment paper today, so I’ll just leave this here as a caveat] (2) Within semi-supervised learning, there’s a growing tendency to use consistency loss as a way of making use of unlabeled data. The basic idea of consistency loss is that, even if you don’t know the class of a given datapoint, if you modify it in some small way, you can be confident that the model’s prediction should be consistent between the point and its perturbation, even if you don’t have knowledge of what the actual ground truth is. Often, systems like this have been designed using simple Gaussian noise on top of original unlabeled images. The key proposal of this paper is to substitute this more simplified perturbation procedure with the augmentation approaches being iterated on in supervised learning, since the goal of both - to modify inputs so as to capture ways in which they might differ but still be notionally of the same class - is nearly identical On top of this core idea, the UDA paper also proposes an additional clever training tactic: in cases where you have many unlabeled examples and few labeled ones, you may need a large model to capture the information from the unlabeled examples, but this may result in overfitting on the labeled ones. To avoid this, they use an approach called “Training Signal Annealing,” where at each point in training they remove from the loss calculation any examples that the model is particularly confident about: where the prediction of the true class is above some threshold eta. As training goes on, the network is gradually allowed to see more of the training signals. In this kind of a framework, the model can’t overfit as easily, because once it starts getting the right answer on supervised examples, they drop out of the loss calculation. In terms of empirical results, the authors find that in UDA they’re able to improve on many semi-supervised benchmarks with quite small numbers of labeled examples. At one point, they use as a baseline a BERT model that was fine-tuned in an unsupervised way prior to its semi-supervised training, and show that their augmentation method can add value even on top of the value that the unsupervised pre-training adds.