Welcome to ShortScience.org! |
[link]
Xie et al. propose to improve the transferability of adversarial examples by computing them based on transformed input images. In particular, they adapt I-FGSM such that, in each iteration, the update is computed on a transformed version of the current image with probability $p$. When, at the same time attacking an ensemble of networks, this is shown to improve transferability. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
The main contribution of this paper is introducing a new transformation that the authors call Batch Normalization (BN). The need for BN comes from the fact that during the training of deep neural networks (DNNs) the distribution of each layer’s input change. This phenomenon is called internal covariate shift (ICS). #### What is BN? Normalize each (scalar) feature independently with respect to the mean and variance of the mini batch. Scale and shift the normalized values with two new parameters (per activation) that will be learned. The BN consists of making normalization part of the model architecture. #### What do we gain? According to the author, the use of BN provides a great speed up in the training of DNNs. In particular, the gains are greater when it is combined with higher learning rates. In addition, BN works as a regularizer for the model which allows to use less dropout or less L2 normalization. Furthermore, since the distribution of the inputs is normalized, it also allows to use sigmoids as activation functions without the saturation problem. #### What follows? This seems to be specially promising for training recurrent neural networks (RNNs). The vanishing and exploding gradient problems \cite{journals/tnn/BengioSF94} have their origin in the iteration of transformation that scale up or down the activations in certain directions (eigenvectors). It seems that this regularization would be specially useful in this context since this would allow the gradient to flow more easily. When we unroll the RNNs, we usually have ultra deep networks. #### Like * Simple idea that seems to improve training. * Makes training faster. * Simple to implement. Probably. * You can be less careful with initialization. #### Dislike * Does not work with stochastic gradient descent (minibatch size = 1). * This could reduce the parallelism of the algorithm since now all the examples in a mini batch are tied. * Results on ensemble of networks for ImageNet makes it harder to evaluate the relevance of BN by itself. (Although they do mention the performance of a single model). |
[link]
CNNs predictions are known to be very sensitive to adversarial examples, which are samples generated to be wrongly classifiied with high confidence. On the other hand, probabilistic generative models such as `PixelCNN` and `VAEs` learn a distribution over the input domain hence could be used to detect ***out-of-distribution inputs***, e.g., by estimating their likelihood under the data distribution. This paper provides interesting results showing that distributions learned by generative models are not robust enough yet to employ them in this way. * **Pros (+):** convincing experiments on multiple generative models, more detailed analysis in the invertible flow case, interesting negative results. * **Cons (-):** It would be interesting to provide further results for different datasets / domain shifts to observe if this property can be quanitfied as a characteristics of the model or of the input data. --- ## Experimental negative result Three classes of generative models are considered in this paper: * **Auto-regressive** models such as `PixelCNN` [1] * **Latent variable** models, such as `VAEs` [2] * Generative models with **invertible flows** [3], in particular `Glow` [4]. The authors train a generative model $G$ on input data $\mathcal X$ and then use it to evaluate the likelihood on both the training domain $\mathcal X$ and a different domain $\tilde{\mathcal X}$. Their main (negative) result is showing that **a model trained on the CIFAR-10 dataset yields a higher likelihood when evaluated on the SVHN test dataset than on the CIFAR-10 test (or even train) split**. Interestingly, the converse, when training on SVHN and evaluating on CIFAR, is not true. This result was consistantly observed for various architectures including [1], [2] and [4], although it is of lesser effect in the `PixelCNN` case. Intuitively, this could come from the fact that both of these datasets contain natural images and that CIFAR-10 is strictly more diverse than SVHN in terms of semantic content. Nonetheless, these datasets vastly differ in appearance, and this result is counter-intuitive as it goes against the direction that generative models can reliably be use to detect out-of-distribution samples. Furthermore, this observation also confirms the general idea that higher likelihoods does not necessarily coincide with better generated samples [5]. --- ## Further analysis for invertible flow models The authors further study this phenomenon in the invertible flow models case as they provide a more rigorous analytical framework (exact likelihood inference unlike VAE which only provide a bound on the true likelihood). More specifically invertible flow models are characterized with a ***diffeomorphism*** (invertible function), $f(x; \phi)$, between input space $\mathcal X$ and latent space $\mathcal Z$, and choice of the latent distribution $p(z; \psi)$. The ***change of variable formula*** links the density of $x$ and $z$ as follows: $$ \int_x p_x(x)d_x = \int_x p_z(f(x)) \left| \frac{\partial f}{\partial x} \right| dx $$ And the training objective under this transformation becomes $$ \arg\max_{\theta} \log p_x(\mathbf{x}; \theta) = \arg\max_{\phi, \psi} \sum_i \log p_z(f(x_i; \phi); \psi) + \log \left| \frac{\partial f_{\phi}}{\partial x_i} \right| $$ Typically, $p_z$ is chosen to be Gaussian, and samples are build by inverting $f$, i.e.,$z \sim p(\mathbf z),\ x = f^{-1}(z)$. And $f_{\phi}$ is build such that computing the log determinant of the Jacabian in the previous equation can be done efficiently. First, they observe that contribution of the flow can be decomposed in a ***density*** element (left term) and a ***volume*** element (right term), resulting from the change of variables formula. Experiment results with Glow [4] show that the higher density on SVHN mostly comes from the ***volume element contribution***. Secondly, they try to directly analyze the difference in likelihood between two domains $\mathcal X$ and $\tilde{\mathcal X}$; which can be done by a second-order expansion of the log-likelihood locally around the expectation of the distribution (assuming $\mathbb{E} (\mathcal X) \sim \mathbb{E}(\tilde{\mathcal X})$). For the constant volume Glow module, the resulting analytical formula indeed confirms that the log-likelihood of SVHN should be higher than CIFAR's, as observed in practice. --- ## References * [1] Conditional Image Generation with PixelCNN Decoders, van den Oord et al, 2016 * [2] Auto-Encoding Variational Bayes, Kingma and Welling, 2013 * [3] Density estimation using Real NVP, Dinh et al., ICLR 2015 * [4] Glow: Generative Flow with Invertible 1x1 Convolutions, Kingma and Dhariwal * [5] A Note on the Evaluation of Generative Models, Theis et al., ICLR 2016 |
[link]
Zhao et al. propose a generative adversarial network (GAN) based approach to generate meaningful and natural adversarial examples for images and text. With natural adversarial examples, the authors refer to meaningful changes in the image content instead of adding seemingly random/adversarial noise – as illustrated in Figure 1. These natural adversarial examples can be crafted by first learning a generative model of the data, e.g., using a GAN together with an inverter (similar to an encoder), see Figure 2. Then, given an image $x$ and its latent code $z$, adversarial examples $\tilde{z} = z + \delta$ can be found within the latent code. The hope is that these adversarial examples will correspond to meaningful, naturally looking adversarial examples in the image space. https://i.imgur.com/XBhHJuY.png Figure 1: Illustration of natural adversarial examples in comparison ot regular, FGSM adversarial examples. https://i.imgur.com/HT2StGI.png Figure 2: Generative model (GAN) together with the required inverter. In practice, e.g., on MNIST, any black-box classifier can be attacked by randomly sampling possible perturbations $\delta$ in the random space (with increasing norm) until an adversarial perturbation is found. Here, the inverted from Figure 2 is trained on top of the critic of the GAN (although specific details are missing in the paper). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
This paper presents a variational approach to the maximisation of mutual information in the context of a reinforcement learning agent. Mutual information in this context can provide a learning signal to the agent that is "intrinsically motivated", because it relies solely on the agent's state/beliefs and does not require from the ("outside") user an explicit definition of rewards. Specifically, the learning objective, for a current state s, is the mutual information between the sequence of K actions a proposed by an exploration distribution $w(a|s)$ and the final state s' of the agent after performing these actions. To understand what the properties of this objective, it is useful to consider the form of this mutual information as a difference of conditional entropies: $$I(a,s'|s) = H(a|s) - H(a|s',s)$$ Where $I(.|.)$ is the (conditional) mutual information and $H(.|.)$ is the (conditional) entropy. This objective thus asks that the agent find an exploration distribution that explores as much as possible (i.e. has high $H(a|s)$ entropy) but is such that these actions have predictable consequences (i.e. lead to predictable state s' so that $H(a|s',s)$ is low). So one could think of the agent as trying to learn to have control of as much of the environment as possible, thus this objective has also been coined as "empowerment". The main contribution of this work is to show how to train, on a large scale (i.e. larger state space and action space) with this objective, using neural networks. They build on a variational lower bound on the mutual information and then derive from it a stochastic variational training algorithm for it. The procedure has 3 components: the exploration distribution $w(a|s)$, the environment $p(s'|s,a)$ (can be thought as an encoder, but which isn't modeled and is only interacted with/sampled from) and the planning model $p(a|s',s)$ (which is modeled and can be thought of as a decoder). The main technical contribution is in how to update the exploration distribution (see section 4.2.2 for the technical details). This approach exploits neural networks of various forms. Neural autoregressive generative models are also used as models for the exploration distribution as well as the decoder or planning distribution. Interestingly, the framework allows to also learn the state representation s as a function of some "raw" representation x of states. For raw states corresponding to images (e.g. the pixels of the screen image in a game), CNNs are used. |