Certified Defenses against Adversarial ExamplesCertified Defenses against Adversarial ExamplesAditi Raghunathan and Jacob Steinhardt and Percy Liang2018
Paper summarydavidstutzRaghunathan et al. provide an upper bound on the adversarial loss of two-layer networks and also derive a regularization method to minimize this upper bound. In particular, the authors consider the scoring functions $f^i(x) = V_i^T\sigma(Wx)$ with bounded derivative $\sigma'(z) \in [0,1]$ which holds for Sigmoid and ReLU activation functions. Still, the model is very constrained considering recent, well-performng deep (convolutional) neural networks. The upper bound is then derived by considering $f(A(x))$ where $A(x)$ is the optimal attacker $A(x) = \arg\max_{\tilde{x} \in B_\epsilon(x)} f(\tilde{x})$. For a linear model $f(x) = (W_1 – W_2)^Tx$, an upper bound can be derived as follows:
$f(\tilde{x}) = f(x) + (W_1 – W_2)^T(\tilde{x} – x) \leq f(x) + \epsilon\|W_1 – W_2\|_1$.
For two-layer networks a bound is derived by considering
$f(\tilde{x}) = f(x) + \int_0^1 \nabla f(t\tilde{x} + (1-t)x)^T (\tilde{x} – x) dt \leq f(x) + \max_{\tilde{x}\in B_\epsilon(x)} \epsilon\|\nabla f(\tilde{x})\|_1$.
In this case, Raghunathan rewrite the second term, i.e. $\max_{\tilde{x}\in B_\epsilon(x)} \epsilon\|\nabla f(\tilde{x})\|_1$ to derive an upper bound in the form of a semidefinite program, see the paper for details. For $v = V_1 – V_2$, this semidefinite program is based on the matrix
$M(v,W) = \left[\begin{array}0 & 0 & 1^T W^R \text{diag}(v)\\0 & 0 & W^T\text{diag}(v)\\ \text{diag}(v)^T W 1 & \text{diag}(v)^T W & 0\end{array}\right]$.
By deriving the dual objective, the upper bound can then be minimized by constraining the eigenvalues of $M(v, W)$ (specifically, the largest eigenvalue; note that the dual also involves dual variables – see the paper for details). Overall, the proposed regularize involves minimizing the largest eigenvalue of $M(v, W) – D$ where $D$ is a diagonal matrix based on the dual variables. In practice, this is implemented using SciPy's implementation of the Lanczos algorithm.
Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/).
First published: 2018/01/29 (10 months ago) Abstract: While neural networks have achieved high accuracy on standard image
classification benchmarks, their accuracy drops to nearly zero in the presence
of small adversarial perturbations to test inputs. Defenses based on
regularization and adversarial training have been proposed, but often followed
by new, stronger attacks that defeat these defenses. Can we somehow end this
arms race? In this work, we study this problem for neural networks with one
hidden layer. We first propose a method based on a semidefinite relaxation that
outputs a certificate that for a given network and test input, no attack can
force the error to exceed a certain value. Second, as this certificate is
differentiable, we jointly optimize it with the network parameters, providing
an adaptive regularizer that encourages robustness against all attacks. On
MNIST, our approach produces a network and a certificate that no attack that
perturbs each pixel by at most \epsilon = 0.1 can cause more than 35% test
error.
Raghunathan et al. provide an upper bound on the adversarial loss of two-layer networks and also derive a regularization method to minimize this upper bound. In particular, the authors consider the scoring functions $f^i(x) = V_i^T\sigma(Wx)$ with bounded derivative $\sigma'(z) \in [0,1]$ which holds for Sigmoid and ReLU activation functions. Still, the model is very constrained considering recent, well-performng deep (convolutional) neural networks. The upper bound is then derived by considering $f(A(x))$ where $A(x)$ is the optimal attacker $A(x) = \arg\max_{\tilde{x} \in B_\epsilon(x)} f(\tilde{x})$. For a linear model $f(x) = (W_1 – W_2)^Tx$, an upper bound can be derived as follows:
$f(\tilde{x}) = f(x) + (W_1 – W_2)^T(\tilde{x} – x) \leq f(x) + \epsilon\|W_1 – W_2\|_1$.
For two-layer networks a bound is derived by considering
$f(\tilde{x}) = f(x) + \int_0^1 \nabla f(t\tilde{x} + (1-t)x)^T (\tilde{x} – x) dt \leq f(x) + \max_{\tilde{x}\in B_\epsilon(x)} \epsilon\|\nabla f(\tilde{x})\|_1$.
In this case, Raghunathan rewrite the second term, i.e. $\max_{\tilde{x}\in B_\epsilon(x)} \epsilon\|\nabla f(\tilde{x})\|_1$ to derive an upper bound in the form of a semidefinite program, see the paper for details. For $v = V_1 – V_2$, this semidefinite program is based on the matrix
$M(v,W) = \left[\begin{array}0 & 0 & 1^T W^R \text{diag}(v)\\0 & 0 & W^T\text{diag}(v)\\ \text{diag}(v)^T W 1 & \text{diag}(v)^T W & 0\end{array}\right]$.
By deriving the dual objective, the upper bound can then be minimized by constraining the eigenvalues of $M(v, W)$ (specifically, the largest eigenvalue; note that the dual also involves dual variables – see the paper for details). Overall, the proposed regularize involves minimizing the largest eigenvalue of $M(v, W) – D$ where $D$ is a diagonal matrix based on the dual variables. In practice, this is implemented using SciPy's implementation of the Lanczos algorithm.
Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/).