Contrastive learning works by performing augmentations on a batch of images, and training a network to match the representations of the two augmented parts of a pair together, and push the representations of images not in a pair farther apart. Historically, these algorithms have benefitted from using stronger augmentations, which has the effect of making the two positive elements in a pair more visually distinct from one another. This paper tries to build on that success, and, beyond just using a strong augmentation, tries to learn a way to perturb images that adversarially increases contrastive loss. As with adversarial training in normal supervised setting, the thinking here is that examples which push loss up the highest are the hardest and thus most informative for the network to learn from While the concept of this paper made some sense, I found the notation and the explanation of mechanics a bit confusing, particularly when it came to choice to frame a contrastive loss as a cross-entropy loss, with the "weights" of the dot product in the the cross-entropy loss being, in fact, the projection by the learned encoder of various of the examples in the batch. https://i.imgur.com/iQXPeXk.png This notion of the learned representations being "weights" is just odd and counter-intuitive, and the process of trying to wrap my mind around it isn't one I totally succeeded at. I think the point of using this frame is because it provides an easy analogue to the Fast Gradient Sign Method of normal supervised learning adversarial examples, even though it has the weird effect that, as the authors say "your weights vary by batch...rather than being consistent across training," Notational weirdness aside, my understanding is that the method of this paper: - Runs a forward pass of normal contrastive loss (framed as cross-entropy loss) which takes augmentations p and q and runs both forward through an encoder. - Calculates a delta to apply to each input image in the q that will increase the loss most, taken over all the images in the p set - I think the delta is per-image in q, and is just aggregated over all images in p, but I'm not fully confident of this, as a result of notational confusion. It could also be one delta applied for all all images in q. - Calculate the loss that results when you run forward the adversarially generated q against the normal p - Train a combined loss that is a weighted combination of the normal p/q contrastive part and the adversarial p/q contrastive part https://i.imgur.com/UWtJpVx.png The authors show a small but relatively consistent improvement to performance using their method. Notably, this improvement is much stronger when using larger encoders (presumably because they have more capacity to learn from harder examples). One frustration I have with the empirics of the paper is that, at least in the main paper, they don't discuss the increase in training time required to calculate these perturbations, which, a priori, I would imagine to be nontrivial.