This paper focuses on the well-known fact that adversarial examples are often transferable: that is, that an adversarial example created by optimizing loss on a surrogate model trained on similar data can often still induce increased loss on the true target model, though typically not to the same magnitude as an example optimized against the target itself. Its goal is to come up with clearer theoretical formulation for transferred examples, and more clearly understand what kinds of models transfer better than others. The authors define their two scenarios of interest as white box (where the parameters of the target model are known), and limited knowledge, or black box, where only the data type and feature representation is known, but the exact training dataset is unknown, as well as the parameters of the target model. Most of the mathematics of this paper revolve around this equation, which characterizes how to find a delta to maximize loss on the surrogate model: https://i.imgur.com/Y0mD35x.png In words: you’re finding a delta (perturbations of each input value) such that the p-norm of delta is less than some radius epsilon, and such that delta maximizes the dot product between delta and the model gradient with respect to the inputs. The closer two vectors are to one another, the higher their dot product. So, having your delta just *be* the model gradient w.r.t inputs maximizes that quantity. However, we also need to meet the requirement of having our perturbation’s norm be less than epsilon, so we in order to find the actual optimal value, we divide by the norm of the gradient (to get ourselves a norm of 1), and multiply by epsilon (to get ourselves a norm of epsilon). This leads to the optimal value of delta being, for a norm of 2: https://i.imgur.com/Op0H7KL.png An important thing to remember is that all of the above has been using w-hat, meaning it’s been an examination of what the optimal delta is when we’re calculating against the surrogate model. But, if we plug in the optimal transfer value of delta we found above, how does this compare to the increase in loss if we were able to optimize against the true model? https://i.imgur.com/RHILZK1.png Loss on the true model is, as above, calculated as the dot product of the delta perturbation with the gradient w.r.t inputs of the true model. Using the same logic as above, this quantity is maximized when our perturbation is as close as possible to the target model’s gradient vector. So, the authors show, the degree to which adversarial examples calculated on one model transfer to another is mediated by the cosine distance between surrogate model’s gradient vector and the target model’s one. The more similar these gradients w.r.t the input are to one another, the closer surrogate-model loss increase will be to target-model loss increase. This is one of those things that makes sense once it’s laid out, but it’s still useful to have a specific conceptual quality to point to when predicting whether adversarial examples will transfer, rather than just knowing that they do, at least some of the time, to at least some extent. Another interesting thing to notice from the above equation, though not directly related to transfer examples, is the right hand of the equation, the upper bound on loss increase, which is the p-norm of the gradient vector of the target model. In clearer words, this means that the amount of loss that it’s possible to induce on a model using a given epsilon of perturbation is directly dependent on the norm of that model’s gradient w.r.t inputs. This suggests that more highly regularized models, which are by definition smoother and have smaller gradients with respect to inputs, will be harder to attack. This hypothesis is borne out by the authors’ experiments. However, they also find, consistent with my understanding of prior work, that linear models are harder to attack than non-linear ones. This draws a line between two ways we’re used to thinking about model complexity/simplicity: having a less-smooth function with bigger gradients increases your vulnerability, but having nonlinear model structure seems to decrease it. https://i.imgur.com/mw9exLU.png One final intriguing empirical finding of this paper is that, in addition to being the hardest models to attack when they are the target, highly regularized models work the best as surrogate models. There’s a simplistic way in which this makes sense, in that if you create your examples against a “harder” adversary to begin with, they’ll be in some sense stronger, and transfer better. However, I’m not sure that intuition is a correct one here.