Adversarial Examples Are Not Bugs, They Are Features on ShortScience.org

arxiv.org
scholar.google.com

Adversarial Examples Are Not Bugs, They Are Features
Ilyas, Andrew and Santurkar, Shibani and Tsipras, Dimitris and Engstrom, Logan and Tran, Brandon and Madry, Aleksander
- 2019 via Local Bibsonomy
Keywords: adversarial

Summaries/Notes 2

[link] Summary by CodyWild 4 years ago

It didn’t hit me how much this paper was a pun until I finished it, and in retrospect, I say, bravo. 

This paper focuses on adversarial examples, and argues that, at least in some cases, adversarial perturbations aren’t purely overfitting failures on behalf of the model, but actual features that generalize to the test set. This conclusion comes from a set of two experiments: 

- In one, the authors create a dataset that only contains what they call “robust features”. They do this by taking a classifier trained to be robust using adversarial training (training on adversarial examples), and doing gradient descent to modify the input pixels until the final-layer robust model activations of the modified inputs match the final layer activations when the unmodified inputs are passed in. Operating under the premise that features identified by a robust model are themselves robust, because by definition they don’t change in the presence of an adversarial perturbation, creating a training set that matches these features means that you’ve created some kind of platonic, robust version of the training set, with only robust features present. They then take this dataset, and train a new model on it, and show that it has strong test set performance, in both normal settings, and adversarial ones. This is not enormously surprising, since the original robust classifier performed well, but still interesting. 

- The most interesting and perhaps surprising experiment is where the authors create a dataset by taking normal images, and layering on top an adversarial perturbation. They then label these perturbed images with the label corresponding to the perturbation class, and train a model off of that. They then find that this model, which is trained on images which correspond to their labeled class only in their perturbation features, and not in the underlying visual features a human would recognize, achieves good test set performance under normal conditions. However, it performs poorly on adversarial perturbations of the test set. 

https://i.imgur.com/eJQXb0i.png
Overall, the authors claim that the perturbations that are “tricking” models are features that can genuinely provide some amount of test set generalization, due to real but unintuitive regularities in the data, but that these features are non-robust, in that small amounts of noise can cause them to switch sign.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private