Tsipras et al. investigate the trade-off between classification accuracy and adversarial robustness. In particular, on a very simple toy dataset, they proof that such a trade-off exists; this means that very accurate models will also have low robustness. Overall, on this dataset, they find that there exists a sweet-spot where the accuracy is 70% and the adversarial accuracy (i.e., accuracy on adversarial examples) is 70%. Using adversarial training to obtain robust networks, they additionally show that the robustness is increased by not using “fragile” features – features that are only weakly correlated with the actual classification tasks. Only focusing on few, but “robust” features also has the advantage of more interpretable gradients and sparser weights (or convolutional kernels). Due to the induced robustness, adversarial examples are perceptually significantly more different from the original examples, as illustrated in Figure 1 on MNIST. https://i.imgur.com/OP2TOOu.png Figure 1: Illustration of adversarial examples for a standard model, a model trained using $L_\infty$ adversarial training and $L_2$ adversarial training. Especially for the $L_2$ case it is visible that adversarial examples need to change important class characteristics to fool the network. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).