Do Deep Generative Models Know What They Don't Know?

Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/10/22 (1 month ago)

**Abstract:** A neural network deployed in the wild may be asked to make predictions for
inputs that were drawn from a different distribution than that of the training
data. A plethora of work has demonstrated that it is easy to find or synthesize
inputs for which a neural network is highly confident yet wrong. Generative
models are widely viewed to be robust to such mistaken confidence as modeling
the density of the input features can be used to detect novel,
out-of-distribution inputs. In this paper we challenge this assumption. We find
that the density learned by flow-based models, VAEs, and PixelCNNs cannot
distinguish images of common objects such as dogs, trucks, and horses (i.e.
CIFAR-10) from those of house numbers (i.e. SVHN), assigning a higher
likelihood to the latter when the model is trained on the former. Moreover, we
find evidence of this phenomenon when pairing several popular image data sets:
FashionMNIST vs MNIST, CelebA vs SVHN, ImageNet vs CIFAR-10 / CIFAR-100 / SVHN.
To investigate this curious behavior, we focus analysis on flow-based
generative models in particular since they are trained and evaluated via the
exact marginal likelihood. We find such behavior persists even when we restrict
the flow models to constant-volume transformations. These transformations admit
some theoretical analysis, and we show that the difference in likelihoods can
be explained by the location and variances of the data and the model curvature.
Our results caution against using the density estimates from deep generative
models to identify inputs similar to the training distribution until their
behavior for out-of-distribution inputs is better understood.
more
less

Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

CNNs predictions are known to be very sensitive to adversarial examples, which are samples generated to be wrongly classifiied with high confidence. On the other hand, probabilistic generative models such as `PixelCNN` and `VAEs` learn a distribution over the input domain hence could be used to detect ***out-of-distribution inputs***, e.g., by estimating their likelihood under the data distribution. This paper provides interesting results showing that distributions learned by generative models are not robust enough yet to employ them in this way. * **Pros (+):** convincing experiments on multiple generative models, more detailed analysis in the invertible flow case, interesting negative results. * **Cons (-):** It would be interesting to provide further results for different datasets / domain shifts to observe if this property can be quanitfied as a characteristics of the model or of the input data. --- ## Experimental negative result Three classes of generative models are considered in this paper: * **Auto-regressive** models such as `PixelCNN` [1] * **Latent variable** models, such as `VAEs` [2] * Generative models with **invertible flows** [3], in particular `Glow` [4]. The authors train a generative model $G$ on input data $\mathcal X$ and use it to evaluate the likelihood on both the training domain $\mathcal X$ and a different domain $\tilde{\mathcal X}$. Their main (negative result) is showing that **a model trained on the CIFAR-10 dataset yields a higher likelihood when evaluated on the SVHN test dataset than on the CIFAR-10 test (or even train) split**. Interestingly, the converse, when training on SVHN and evaluating on CIFAR, is not true. This result was consistantly observed for various architectures including [1], [2] and [4], although it is of lesser effect in the `PixelCNN` case. Intuitively, this could come from the fact that both of these datasets contain natural images and that CIFAR-10 is strictly more diverse than SVHN in terms of semantic content. Nonetheless, these datasets vastly differ in appearance, and this result is counter-intuitive as it goes against the direction that generative models can reliably be use to detect out-of-distribution samples. Furthermore, this observation also confirms the general idea that higher likelihoods does not necessarily coincide with better generated samples [5]. --- ## Further analysis for invertible flow models The authors further study this phenomenon in the invertible flow models case as they provide a more rigorous analytical framework (exact likelihood inference unlike VAE which only provide a bound on the true likelihood). More specifically invertible flow models are characterized with a ***diffeomorphism*** (invertible function), $f(x; \phi)$, between input space $\mathcal X$ and latent space $\mathcal Z$, and choice of the latent distribution $p(z; \psi)$. The ***change of variable formula*** links the density of $x$ and $z$ as follows: $$ \int_x p_x(x)d_x = \int_x p_z(f(x)) \left| \frac{\partial f}{\partial x} \right| dx $$ And the training objective under this transformation becomes $$ \arg\max_{\theta} \log p_x(\mathbf{x}; \theta) = \arg\max_{\phi, \psi} \sum_i \log p_z(f(x_i; \phi); \psi) + \log \left| \frac{\partial f_{\phi}}{\partial x_i} \right| $$ Typically, $p_z$ is chosen to be Gaussian, and samples are build by inverting $f$, i.e.,$z \sim p(\mathbf z),\ x = f^{-1}(z)$. And $f_{\phi}$ is build such that computing the log determinant of the Jacabian in the previous equation can be done efficiently. First, they observe that contribution of the flow can be decomposed in a ***density*** element (left term) and a ***volume*** element (right term), resulting from the change of variables formula. Experiment results with Glow [4] show that the higher density on SVHN mostly comes from the ***volume element contribution***. Secondly, they try to directly analyze the difference in likelihood between two domains $\mathcal X$ and $\tilde{\mathcal X}$; which can be done by a second-order expansion of the log-likelihood locally around the expectation of the distribution (assuming $\mathbb{E} (\mathcal X) \sim \mathbb{E}(\tilde{\mathcal X})$). For the constant volume Glow module, the resulting analytical formula indeed confirms that the log-likelihood of SVHN should be higher than CIFAR's, as observed in practice. --- ## References * [1] Conditional Image Generation with PixelCNN Decoders, van den Oord et al, 2016 * [2] Auto-Encoding Variational Bayes, Kingma and Welling, 2013 * [3] Density estimation using Real NVP, Dinh et al., ICLR 2015 * [4] Glow: Generative Flow with Invertible 1x1 Convolutions, Kingma and Dhariwal * [5] A Note on the Evaluation of Generative Models, Theis et al., ICLR 2016 |

The Reversible Residual Network: Backpropagation Without Storing Activations.

Aidan N. Gomez and Mengye Ren and Raquel Urtasun and Roger B. Grosse

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

Aidan N. Gomez and Mengye Ren and Raquel Urtasun and Roger B. Grosse

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

Residual Networks (ResNets) have greatly advanced the state-of-the-art in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feed-forward pass, leading to high memory requirements for these very deep networks. Instead, the authors propose a **reversible architecture** based on ResNets, in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage. * **Pros (+): ** The change does not negatively impact model accuracy (for equivalent number of model parameters) and it only requires a small change in the backpropagation algorithm. * **Cons (-): ** Increased number of parameters, thus need to change the unit depth to match the "equivalent" ResNet --- # Proposed Architecture ## RevNet This paper proposes to incorporate idea from previous reversible architectures, such as NICE [1], into a standard ResNet. The resulting model is called **RevNet** and is composed of reversible blocks, inspired from *additive coupling* [1, 2]: $ \begin{array}{r|r} \texttt{RevNet Block} & \texttt{Inverse Transformation}\\ \hline \mathbf{input }\ x & \mathbf{input }\ y \\ x_1, x_2 = \mbox{split}(x) & y1, y2 = \mbox{split}(y)\\ y_1 = x_1 + \mathcal{F}(x_2) & x_2 = y_2 - \mathcal{G}(y_1) \\ y_2 = x_2 + \mathcal{G}(y_1) & x_1 = y_1 - \mathcal{F}(x_2)\\ \mathbf{output}\ y = (y_1, y_2) & \mathbf{output}\ x = (x_1, x_2) \end{array} $ where $\mathcal F$ and $\mathcal G$ are residual functions, composed of sequences of convolutions, ReLU and Batch Normalization layers, analoguous to the ones in a standard ResNet block, although operations in the reversible blocks need to have a stride of 1 to avoid information loss and preserve invertibility. Finally, for the `split` operation, the authors consider spliting the input Tensor across the channel dimension as in [1, 2]. Similarly to ResNet, the final RevNet architecture is composed of these invertible residual blocks, as well as non-reversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical ResNet architecture. ## Backpropagation ### Standard The backpropagaton algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters in a neural network: given a loss function $L$, we want to compute the gradients of $L$ with respect to the parameters of each layer, indexed by $n \in [1, N]$, i.e., the quantities $ \overline{\theta_{n}} = \partial L /\ \partial \theta_n$. (where $\forall x, \bar{x} = \partial L / \partial x$). We roughly summarize the algorithm in the left column of **Table 1**: In order to compute the gradients for the $n$-th block, backpropagation requires the input and output activation of this block, $y_{n - 1}$ and $y_{n}$, which have been stored, and the derivative of the loss respectively to the output, $\overline{y_{n}}$, which has been computed in the backpropagation iteration of the upper layer; Hence the name backpropagation ### RevNet Since activations are not stored in RevNet, the algorithm needs to be slightly modified, which we describe in the right column of **Table 1**. In summary, we first need to recover the input activations of the RevNet block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation. Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. $\overline{y_{n - 1}} = (\overline{y_{n -1, 1}}, \overline{y_{n - 1, 2}})$, using the fact that: $ \begin{align} \overline{y_{n - 1, i}} = \overline{y_{n, 1}}\ \frac{\partial y_{n, 1}}{y_{n - 1, i}} + \overline{y_{n, 2}}\ \frac{\partial y_{n, 2}}{y_{n - 1, i}} \end{align} $ Once again, this result will be propagated further down the network. Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, $\theta_n$. $ \begin{array}{|c|l|l|} \hline & \mathbf{ResNet} & \mathbf{RevNet} \\ \hline \mathbf{Block} & y_{n} = y_{n - 1} + \mathcal F(y_{n - 1}) & y_{n - 1, 1}, y_{n - 1, 2} = \mbox{split}(y_{n - 1})\\ && y_{n, 1} = y_{n - 1, 1} + \mathcal{F}(y_{n - 1, 2})\\ && y_{n, 2} = y_{n - 1, 2} + \mathcal{G}(y_{n, 1})\\ && y_{n} = (y_{n, 1}, y_{n, 2})\\ \hline \mathbf{Params} & \theta = \theta_{\mathcal F} & \theta = (\theta_{\mathcal F}, \theta_{\mathcal G})\\ \hline \mathbf{Backprop} & \mathbf{in:}\ y_{n - 1}, y_{n}, \overline{ y_{n}} & \mathbf{in:}\ y_{n}, \overline{y_{n }}\\ & \overline{\theta_n} =\overline{y_n} \frac{\partial y_n}{\partial \theta_n} &\texttt{# recover activations} \\ &\overline{y_{n - 1}} = \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n-1}} &y_{n, 1}, y_{n, 2} = \mbox{split}(y_{n}) \\ &\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n -1}} & y_{n - 1, 2} = y_{n, 2} - \mathcal{G}(y_{n, 1})\\ &&y_{n - 1, 1} = y_{n, 1} - \mathcal{F}(y_{n - 1, 2})\\ &&\texttt{# gradients wrt. inputs} \\ &&\overline{y_{n -1, 1}} = \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\ &&\overline{y_{n -1, 2}} = \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\ &&\texttt{ gradients wrt. parameters} \\ &&\overline{\theta_{n, \mathcal G}} = \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\ &&\overline{\theta_{n, \mathcal F}} = \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\ &&\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n -1}}, y_{n - 1}\\ \hline \end{array} $ **Table 1:** Backpropagation in the standard case and for Reversible blocks --- ## Experiments ** Computational Efficiency.** RevNets trade off memory requirements, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, RevNet provides the best trade-off: no activations have to be stored, the spatial complexity is $O(1)$. For the computation complexity, it is linear in the number of layers, i.e. $O(L)$. One small disadvantage is that RevNets introduces additional parameters, as each block is composed of two residuals, $\mathcal F$ and $\mathcal G$, and their number of channels is also halved as the input is first split into two. **Results.** In the experiments section, the author compare ResNet architectures to their RevNets "counterparts": they build a RevNet with roughly the same number of parameters by halving the number of residual units and doubling the number of channels. Interestingly, RevNets achieve **similar performances** to their ResNet counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model. To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints. --- ## References * [1] NICE: Non-linear Independent Components Estimation, Dinh et al., ICLR 2015 * [2] Density estimation using Real NVP, Dinh et al., ICLR 2017 |

Gradient Reversal Against Discrimination

Edward Raff and Jared Sylvester

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG

**First published:** 2018/07/01 (5 months ago)

**Abstract:** No methods currently exist for making arbitrary neural networks fair. In this
work we introduce GRAD, a new and simplified method to producing fair neural
networks that can be used for auto-encoding fair representations or directly
with predictive networks. It is easy to implement and add to existing
architectures, has only one (insensitive) hyper-parameter, and provides
improved individual and group fairness. We use the flexibility of GRAD to
demonstrate multi-attribute protection.
more
less

Edward Raff and Jared Sylvester

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG

Given some input data $x$ and attribute $a_p$, the task is to predict label $y$ from $x$ while making $a_p$ *protected*, in other words, such that the model predictions are invariant to changes in $a_p$. * **Pros (+)**: Simple and intuitive idea, easy to train, naturally extended to protecting multiple attributes. * **Cons (-)**: Comparison to baselines could be more detailed / comprehensive, in particular the comparison to ALFR [4] which also relies on adversarial training. --- ## Proposed Method **Domain adversarial networks.** The proposed model builds on the *Domain Adversarial Network* [1], originally introduced for unsupervised domain adaptation. Given some labeled data $(x, y) \sim \mathcal X \times \mathcal Y$, and some unlabeled data $\tilde x \sim \tilde{\mathcal X}$, the goal is to learn a network that solves both classification tasks $\mathcal X \rightarrow \mathcal Y$ and $\tilde{\mathcal X} \rightarrow \mathcal Y$ while learning a shared representation between $\mathcal X$ and $\tilde{\mathcal X}$. The model is composed of a feature extractor $G_f$ which then branches off into a *target* branch, $G_t$, to predict the target label, and a *domain* branch, $G_d$, predicting whether the input data comes either from domain $\mathcal X$ or $\tilde{\mathcal X}$. The model parameters are trained with the following objective: $$ \begin{align} (\theta_{G_f}, \theta_{G_t} ) &= \arg\min \mathbb E_{(x, y) \sim \mathcal X \times \mathcal Y}\ \ell_t \left( G_t \circ G_f(x), y \right)\\ \theta_{G_d} &= \arg\max \mathbb E_{x \sim \mathcal X} \ \ell_d\left( G_d \circ G_f(x), 1 \right) + \mathbb E_{\tilde x \sim \tilde{\mathcal X}}\ \ell_d \left(G_d \circ G_f(\tilde x), 0\right)\\ \mbox{where } &\ell_t \mbox{ and } \ell_d \mbox{ are classification losses} \end{align} $$ The gradient updates for this saddle point problem can be efficiently implemented using the Gradient Reversal Layer introduced in [1] **GRAD-pred.** In **G**radient **R**eversal **A**gainst **D**iscrimination, samples come only from one domain $\mathcal X$, and the domain classifier $G_d$ is replaced by an *attribute* classifier, $G_p$, whose goal is to predict the value of the protected attribute $a_p$. In other words, the training objective strives to build a feature representation of $x$ that is good enough to predict the correct label $y$ but such that $a_p$ cannot easily be deduced from it. On the contrary, directly learning classification network $G_y \circ G_f$ penalized when predicting the correct value of attribute $a_p$ could instead lead to a model that learns $a_p$ and trivially outputs an incorrect value. This situation is prevented by the adversarial training scheme here. **GRAD-auto.** The authors also consider a variant of the described model where the target branch $G_t$ instead solves the auto-encoding/reconstruction task. The features learned by the encoder $G_f$ can then later be used as entry point of a smaller network for classification or any other task. --- ## Experiments **Evaluation metrics.** The model is evaluated on four metrics to qualify both accuracy and fairness, following the protocol in [2]: * *Accuracy*, the proportion of correct classifications * *Discrimination*, the average score differences (logits of the ground-truth class) between samples with $a_p = + 1$ and $a_p = -1 $ (assuming a binary attribute) * *Consistency*, the average difference between a sample score and the mean of its nearest neighbors' score. * *Delta = Accuracy - Discrimination*, a penalized version of accuracy **Baselines.** * **Vanilla** CNN trained without the protected attribute protection branch * **LFR** [2]: A classifier with an intermediate latent code $Z \in \{1 \dots K\}$ is trained with an objective that combines a classification loss (the model should accurately classify $x$), a reconstruction loss (the learned representation should encode enough information about the input to reconstruct it accurately) and a parity loss (estimate the probability $P(Z=z | x)$ for both populations with $a_p = 1$ and $a_p = -1$ and strive to make them equal) * **VFA** [3]: A VAE where the protected attribute $a_p$ is factorized out of the latent code $z$, and additional invariance is imposed via a MMD objective which tries to match the moments of the posterior distributions $q(z|a_p = -1)$ and $q(z| a_p = 1)$. * **ALFR** [4] : As in LFR, this paper proposes a model trained with a reconstruction loss and a classification loss. Additionally, they propose to quantify the dependence between the learned representation and the protected attribute by adding an adversary classifier that tries to extract the attribute value from the representation, formulated and trained as in the Generative Adversarial Network (GAN) setting. **Results.** GRAD always reaches highest consistency compared to baselines. For the other metrics, the results are more mitigated, although it usually achieves best or second best results. It's also not clear how to choose between GRAD-pred and GRAD-auto as there does not seem to be a clear winner, although GRAD-pred seems more intuitive when supervision is available, as it directly solves the classification task. Authors also report a small experiment showing that protecting several attributes at the same time can be more beneficial than protecting a single attribute. This can be expected as some attributes are highly correlated or interact in meaningful way. In particular, protecting several attributes at once can easily be done in the GRAD framework by making the attribute prediction branch multi-class for instance: however it is not clear in the paper how it is actually done in practice, nor whether the same idea could also be integrated in the baselines for further comparison. --- ## References * [1] Domain-Adversarial Training of Neural Networks, Ganin et al, JMRL 2016 * [2] Learning Fair Representations, Zemel et al, ICML 2013 * [3] The Variational Fair Autoencoder, Louizos et al, 2016 * [4] Censoring Representations with an Adversary, Edwards and Storkey, ICLR 2016 |

About