Once for All: Train One Network and Specialize it for Efficient Deployment

Cai, Han and Gan, Chuang and Han, Song

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

Cai, Han and Gan, Chuang and Han, Song

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

[link]
**Summary**: The goal of this work is to propose a "Once-for-all” (OFA) network: a large network which is trained such that its subnetworks (subsets of the network with smaller width, convolutional kernel sizes, shallower units) are also trained towards the target task. This allows to adapt the architecture to a given budget at inference time while preserving performance. **Elastic Parameters.** The goal is to train a large architecture that contains several well-trained subnetworks with different architecture configurations (in terms of depth, width, kernel size, and resolution). One of the key difficulties is to ensure that each subnetwork reaches high-accuracy even though it is not trained independently but as part of a larger architecture. This work considers standard CNN architectures (decreasing spatial resolution and increasing number of feature maps), which can be decomposed into units (A unit is a block of layers such that the first layer has stride 2, and the remaining ones have stride 1). The parameters of these units (depth, kernel size, input resolution, width) are denoted as *elastic parameters* in the sense that they can take different values, which defines different subnetworks, which still share the convolutional parameters. **Progressive Shrinking.** Additionally, the authors consider a curriculum-style training process which they call *progressive shrinking*. First, they train the model with the maximum depth, $D$, kernel size, $K$, and width, $W$, which yields convolutional parameters . Then they progressively fine-tune this weight, with an additional distillation term from the largest network, while considering different values for the elastic parameters, in the following order: * Elastic kernel size: Training for a kernel size $k < K$ is done by taking a linear transformation the center $k \times k$ patch in the full $K \times K$ kernels that are in . The linear transformation is useful to model the fact that different scales might be useful for different tasks. * Elastic depth: To train for depth $d < D$, simply skip the last $D-d$ layers of the unit (rather than looking at every subset of dlayers) * Elastic width: For a width $w < W$. First, the channels are reorganized by importance (decreasing order of the $L1$-norm of their weights), then use only the top wchannels * Elastic resolution: Simply train with different image resolutions / resizing: This is actually used for all training processes. **Experiments.** Having trained the Once-for-all (OFA) network, the goal is now to find the adequate architecture configuration, given a specific task/budget constraints. To do this automatically, they propose to train a small performance predictive model. They randomly sample 16K subnetworks from OFA, evaluate their accuracy on a validation set, and learn to predict accuracy based on architecture and input image resolution. (Note: It seems that this predictor is then used to perform a cheap evolutionary search, given latency constraints, to find the best architecture config but the whole process is not entirely clear to me. Compared to a proper neural architecture search, however it should be inexpensive). The main experiments are on ImageNet, using MobileNetv3 as the base full architecture, with the goal of applying the model across different platforms with different inference budget constraints. Overall, the proposed model achieves comparable or higher accuracies for reduced search time, compared to neural architecture search baselines. More precisely their model has a fixed training cost (the OFA network) and a small search cost (find best config based on target latency), which is still lower than doing exhaustive neural architecture search. Furthermore, progressive shrinking does have a significant positive impact on the subnetworks accuracy (+4%). |

Learning a SAT Solver from Single-Bit Supervision

Selsam, Daniel and Lamm, Matthew and Bünz, Benedikt and Liang, Percy and de Moura, Leonardo and Dill, David L.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Selsam, Daniel and Lamm, Matthew and Bünz, Benedikt and Liang, Percy and de Moura, Leonardo and Dill, David L.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
The goal is to solve SAT problems with weak supervision: In that case a model is trained only to predict ***the satisfiability*** of a formula in conjunctive normal form. As a byproduct, when the formula is satisfiable, an actual satisfying assignment can be worked out by clustering the network's activations in most cases. * **Pros (+):** Weak supervision, interesting structured architecture, seems to generalize nicely to harder problems by increasing the number message passing iterations. * **Cons (-):** Limited practical applicability since it is outperfomed by classical SAT solvers. --- # NeuroSAT ## Inputs We consider Boolean logic formulas in their ***conjunctive normal form*** (CNF), i.e. each input formula is represented as a conjunction ($\land$) of **clauses**, which are themselves disjunctions ($\lor$) of litterals (positive or negative instances of variables). The goal is to learn a classifier to predict whether such a formula is satisfiable. A first problem is how to encode the input formula in such a way that it preserves the CNF invariances (invariance to negating a litteral in all clauses, invariance to permutations in $\lor$ and $\land$ etc.). The authors use an ***undirected graph representation*** where: * $\mathcal V$: vertices are the litterals (positive and negative form of variables, denoted as $x$ and $\bar x$) and the clauses occuring in the input formula * $\mathcal E$: Edges are added to connect (i) the litterals with clauses they appear in and (ii) each litteral to its negative counterpart. The graph relations are encoded as an ***adjacency matrix***, $A$, with as many rows as there are litterals and as many columns as there are clauses. In particular, this structure does not constrain the vertices ordering, and does not make any preferential treatment between positive or negative litterals. However it still has some caveats, which can be avoided by pre-processing the formula. For instance when there are disconnected components in the graph, the averaging decision rule (see next paragraph) can lead to false positives. ## Message-passing model In a high-level view, the model keeps track of an embedding for each vertex (litterals, $L^t$ and clauses, $C^t$), updated via ***message-passing on the graph***, and combined via a Multi Layer perceptrion (MLP) to output the model prediction of the formula's satisfiability. The model updates are as follow: $$ \begin{align} C^t, h_C^t &= \texttt{LSTM}_\texttt{C}(h_C^{t - 1}, A^T \texttt{MLP}_{\texttt{L}}(L^{t - 1}) )\ \ \ \ \ \ \ \ \ \ \ (1)\\ L^t, h_L^t &= \texttt{LSTM}_\texttt{L}(h_L^{t - 1}, \overline{L^{t - 1}}, A\ \texttt{MLP}_{\texttt{C}}(C^{t }) )\ \ \ \ \ \ (2) \end{align} $$ where $h$ designates a hidden context vector for the LSTMs. The operator $L \mapsto \bar{L}$ returns $\overline{L}$, the embedding matrix $L$ where the row of each litteral is swapped with the one corresponding to the litteral's negation. In other words, in **(1)** each clause embedding is updated based on the litteral that composes it, while in **(2)** each litteral embedding is updated based on the clauses it appears in and its negated counterpart. After $T$ iterations of this message-passing scheme, the model computes a ***logit for the satisfiability classification problem***, which is trained via sigmoid cross-entropy: $$ \begin{align} L^t_{\mbox{vote}} &= \texttt{MLP}_{\texttt{vote}}(L^t)\\ y^t &= \mbox{mean}(L^t_{\mbox{vote}}) \end{align} $$ --- # Training and Inference ## Training Set The training set is built such that for any satisfiable training formula $S$, it also includes an unsatisfiable counterpart $S'$ which differs from $S$ ***only by negating one litteral in one clause***. These carefully curated samples should constrain the model to pick up substantial characteristics of the formula. In practice, the model is trained on formulas containing up to ***40 variables***, and on average ***200 clauses***. At this size, the SAT problem can still be solved by state-of-the-art solvers (yielding the supervision) but are large enough they prove challenging for Machine Learning models. ## Inferring the SAT assignment When a formula is satisfiable, one often also wants to know a ***valuation*** (variable assignment) that satisfies it. Recall that $L^t_{\mbox{vote}}$ encodes a "vote" for every litteral and its negative counterpart. Qualitative experiments show that thoses scores cannot be directly used for inferring the variable assignment, however they do induce a nice clustering of the variables (once the message passing has converged). Hence an assignment can be found as follows: * (1) Reshape $L^T_{\mbox{vote}}$ to size $(n, 2)$ where $n$ is the number of litterals. * (2) Cluster the litterals into two clusters with centers $\Delta_1$ and $\Delta_2$ using the following criterion: \begin{align} \|x_i - \Delta_1\|^2 + \|\overline{x_i} - \Delta_2\|^2 \leq \|x_i - \Delta_2\|^2 + \|\overline{x_i} - \Delta_1\|^2 \end{align} * (3) Try the two resulting assignments (set $\Delta_1$ to true and $\Delta_2$ to false, or vice-versa) and choose the one that yields satisfiability if any. In practice, this method retrieves a satistifiability assignment for over 70% of the satisfiable test formulas. --- # Experiments In practice, the ***NeuroSAT*** model is trained with embeddings of dimension 128 and 26 message passing iterations using standard MLPs: 3 layers followed by ReLU activations. The final model obtains 85% accuracy in predicting a formula's satisfiability on the test set. It also can generalize to ***larger problems***, requiring to increase the number of message passing iterations, although the classification performance decreases as the problem size grows (e.g. 25% for 200 variables). Interestingly, the model also generalizes well to other classes of problems that were first ***reduced to SAT***, although they have different structure than the random formulas generated for training, which seems to show that the model does learn some general structural characteristics of Boolean formulas. |

Do Deep Generative Models Know What They Don't Know?

Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/10/22 (2 years ago)

**Abstract:** A neural network deployed in the wild may be asked to make predictions for
inputs that were drawn from a different distribution than that of the training
data. A plethora of work has demonstrated that it is easy to find or synthesize
inputs for which a neural network is highly confident yet wrong. Generative
models are widely viewed to be robust to such mistaken confidence as modeling
the density of the input features can be used to detect novel,
out-of-distribution inputs. In this paper we challenge this assumption. We find
that the density learned by flow-based models, VAEs, and PixelCNNs cannot
distinguish images of common objects such as dogs, trucks, and horses (i.e.
CIFAR-10) from those of house numbers (i.e. SVHN), assigning a higher
likelihood to the latter when the model is trained on the former. Moreover, we
find evidence of this phenomenon when pairing several popular image data sets:
FashionMNIST vs MNIST, CelebA vs SVHN, ImageNet vs CIFAR-10 / CIFAR-100 / SVHN.
To investigate this curious behavior, we focus analysis on flow-based
generative models in particular since they are trained and evaluated via the
exact marginal likelihood. We find such behavior persists even when we restrict
the flow models to constant-volume transformations. These transformations admit
some theoretical analysis, and we show that the difference in likelihoods can
be explained by the location and variances of the data and the model curvature.
Our results caution against using the density estimates from deep generative
models to identify inputs similar to the training distribution until their
behavior for out-of-distribution inputs is better understood.
more
less

Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
CNNs predictions are known to be very sensitive to adversarial examples, which are samples generated to be wrongly classifiied with high confidence. On the other hand, probabilistic generative models such as `PixelCNN` and `VAEs` learn a distribution over the input domain hence could be used to detect ***out-of-distribution inputs***, e.g., by estimating their likelihood under the data distribution. This paper provides interesting results showing that distributions learned by generative models are not robust enough yet to employ them in this way. * **Pros (+):** convincing experiments on multiple generative models, more detailed analysis in the invertible flow case, interesting negative results. * **Cons (-):** It would be interesting to provide further results for different datasets / domain shifts to observe if this property can be quanitfied as a characteristics of the model or of the input data. --- ## Experimental negative result Three classes of generative models are considered in this paper: * **Auto-regressive** models such as `PixelCNN` [1] * **Latent variable** models, such as `VAEs` [2] * Generative models with **invertible flows** [3], in particular `Glow` [4]. The authors train a generative model $G$ on input data $\mathcal X$ and then use it to evaluate the likelihood on both the training domain $\mathcal X$ and a different domain $\tilde{\mathcal X}$. Their main (negative) result is showing that **a model trained on the CIFAR-10 dataset yields a higher likelihood when evaluated on the SVHN test dataset than on the CIFAR-10 test (or even train) split**. Interestingly, the converse, when training on SVHN and evaluating on CIFAR, is not true. This result was consistantly observed for various architectures including [1], [2] and [4], although it is of lesser effect in the `PixelCNN` case. Intuitively, this could come from the fact that both of these datasets contain natural images and that CIFAR-10 is strictly more diverse than SVHN in terms of semantic content. Nonetheless, these datasets vastly differ in appearance, and this result is counter-intuitive as it goes against the direction that generative models can reliably be use to detect out-of-distribution samples. Furthermore, this observation also confirms the general idea that higher likelihoods does not necessarily coincide with better generated samples [5]. --- ## Further analysis for invertible flow models The authors further study this phenomenon in the invertible flow models case as they provide a more rigorous analytical framework (exact likelihood inference unlike VAE which only provide a bound on the true likelihood). More specifically invertible flow models are characterized with a ***diffeomorphism*** (invertible function), $f(x; \phi)$, between input space $\mathcal X$ and latent space $\mathcal Z$, and choice of the latent distribution $p(z; \psi)$. The ***change of variable formula*** links the density of $x$ and $z$ as follows: $$ \int_x p_x(x)d_x = \int_x p_z(f(x)) \left| \frac{\partial f}{\partial x} \right| dx $$ And the training objective under this transformation becomes $$ \arg\max_{\theta} \log p_x(\mathbf{x}; \theta) = \arg\max_{\phi, \psi} \sum_i \log p_z(f(x_i; \phi); \psi) + \log \left| \frac{\partial f_{\phi}}{\partial x_i} \right| $$ Typically, $p_z$ is chosen to be Gaussian, and samples are build by inverting $f$, i.e.,$z \sim p(\mathbf z),\ x = f^{-1}(z)$. And $f_{\phi}$ is build such that computing the log determinant of the Jacabian in the previous equation can be done efficiently. First, they observe that contribution of the flow can be decomposed in a ***density*** element (left term) and a ***volume*** element (right term), resulting from the change of variables formula. Experiment results with Glow [4] show that the higher density on SVHN mostly comes from the ***volume element contribution***. Secondly, they try to directly analyze the difference in likelihood between two domains $\mathcal X$ and $\tilde{\mathcal X}$; which can be done by a second-order expansion of the log-likelihood locally around the expectation of the distribution (assuming $\mathbb{E} (\mathcal X) \sim \mathbb{E}(\tilde{\mathcal X})$). For the constant volume Glow module, the resulting analytical formula indeed confirms that the log-likelihood of SVHN should be higher than CIFAR's, as observed in practice. --- ## References * [1] Conditional Image Generation with PixelCNN Decoders, van den Oord et al, 2016 * [2] Auto-Encoding Variational Bayes, Kingma and Welling, 2013 * [3] Density estimation using Real NVP, Dinh et al., ICLR 2015 * [4] Glow: Generative Flow with Invertible 1x1 Convolutions, Kingma and Dhariwal * [5] A Note on the Evaluation of Generative Models, Theis et al., ICLR 2016 |

The Reversible Residual Network: Backpropagation Without Storing Activations.

Aidan N. Gomez and Mengye Ren and Raquel Urtasun and Roger B. Grosse

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

Aidan N. Gomez and Mengye Ren and Raquel Urtasun and Roger B. Grosse

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

[link]
Residual Networks (ResNets) have greatly advanced the state-of-the-art in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feed-forward pass, leading to high memory requirements for these very deep networks. Instead, the authors propose a **reversible architecture** based on ResNets, in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage. * **Pros (+): ** The change does not negatively impact model accuracy (for equivalent number of model parameters) and it only requires a small change in the backpropagation algorithm. * **Cons (-): ** Increased number of parameters, thus need to change the unit depth to match the "equivalent" ResNet --- # Proposed Architecture ## RevNet This paper proposes to incorporate idea from previous reversible architectures, such as NICE [1], into a standard ResNet. The resulting model is called **RevNet** and is composed of reversible blocks, inspired from *additive coupling* [1, 2]: $ \begin{array}{r|r} \texttt{RevNet Block} & \texttt{Inverse Transformation}\\ \hline \mathbf{input }\ x & \mathbf{input }\ y \\ x_1, x_2 = \mbox{split}(x) & y1, y2 = \mbox{split}(y)\\ y_1 = x_1 + \mathcal{F}(x_2) & x_2 = y_2 - \mathcal{G}(y_1) \\ y_2 = x_2 + \mathcal{G}(y_1) & x_1 = y_1 - \mathcal{F}(x_2)\\ \mathbf{output}\ y = (y_1, y_2) & \mathbf{output}\ x = (x_1, x_2) \end{array} $ where $\mathcal F$ and $\mathcal G$ are residual functions, composed of sequences of convolutions, ReLU and Batch Normalization layers, analoguous to the ones in a standard ResNet block, although operations in the reversible blocks need to have a stride of 1 to avoid information loss and preserve invertibility. Finally, for the `split` operation, the authors consider spliting the input Tensor across the channel dimension as in [1, 2]. Similarly to ResNet, the final RevNet architecture is composed of these invertible residual blocks, as well as non-reversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical ResNet architecture. ## Backpropagation ### Standard The backpropagaton algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters in a neural network: given a loss function $L$, we want to compute the gradients of $L$ with respect to the parameters of each layer, indexed by $n \in [1, N]$, i.e., the quantities $ \overline{\theta_{n}} = \partial L /\ \partial \theta_n$. (where $\forall x, \bar{x} = \partial L / \partial x$). We roughly summarize the algorithm in the left column of **Table 1**: In order to compute the gradients for the $n$-th block, backpropagation requires the input and output activation of this block, $y_{n - 1}$ and $y_{n}$, which have been stored, and the derivative of the loss respectively to the output, $\overline{y_{n}}$, which has been computed in the backpropagation iteration of the upper layer; Hence the name backpropagation ### RevNet Since activations are not stored in RevNet, the algorithm needs to be slightly modified, which we describe in the right column of **Table 1**. In summary, we first need to recover the input activations of the RevNet block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation. Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. $\overline{y_{n - 1}} = (\overline{y_{n -1, 1}}, \overline{y_{n - 1, 2}})$, using the fact that: $ \begin{align} \overline{y_{n - 1, i}} = \overline{y_{n, 1}}\ \frac{\partial y_{n, 1}}{y_{n - 1, i}} + \overline{y_{n, 2}}\ \frac{\partial y_{n, 2}}{y_{n - 1, i}} \end{align} $ Once again, this result will be propagated further down the network. Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, $\theta_n$. $ \begin{array}{|c|l|l|} \hline & \mathbf{ResNet} & \mathbf{RevNet} \\ \hline \mathbf{Block} & y_{n} = y_{n - 1} + \mathcal F(y_{n - 1}) & y_{n - 1, 1}, y_{n - 1, 2} = \mbox{split}(y_{n - 1})\\ && y_{n, 1} = y_{n - 1, 1} + \mathcal{F}(y_{n - 1, 2})\\ && y_{n, 2} = y_{n - 1, 2} + \mathcal{G}(y_{n, 1})\\ && y_{n} = (y_{n, 1}, y_{n, 2})\\ \hline \mathbf{Params} & \theta = \theta_{\mathcal F} & \theta = (\theta_{\mathcal F}, \theta_{\mathcal G})\\ \hline \mathbf{Backprop} & \mathbf{in:}\ y_{n - 1}, y_{n}, \overline{ y_{n}} & \mathbf{in:}\ y_{n}, \overline{y_{n }}\\ & \overline{\theta_n} =\overline{y_n} \frac{\partial y_n}{\partial \theta_n} &\texttt{# recover activations} \\ &\overline{y_{n - 1}} = \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n-1}} &y_{n, 1}, y_{n, 2} = \mbox{split}(y_{n}) \\ &\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n -1}} & y_{n - 1, 2} = y_{n, 2} - \mathcal{G}(y_{n, 1})\\ &&y_{n - 1, 1} = y_{n, 1} - \mathcal{F}(y_{n - 1, 2})\\ &&\texttt{# gradients wrt. inputs} \\ &&\overline{y_{n -1, 1}} = \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\ &&\overline{y_{n -1, 2}} = \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\ &&\texttt{ gradients wrt. parameters} \\ &&\overline{\theta_{n, \mathcal G}} = \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\ &&\overline{\theta_{n, \mathcal F}} = \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\ &&\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n -1}}, y_{n - 1}\\ \hline \end{array} $ **Table 1:** Backpropagation in the standard case and for Reversible blocks --- ## Experiments ** Computational Efficiency.** RevNets trade off memory requirements, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, RevNet provides the best trade-off: no activations have to be stored, the spatial complexity is $O(1)$. For the computation complexity, it is linear in the number of layers, i.e. $O(L)$. One small disadvantage is that RevNets introduces additional parameters, as each block is composed of two residuals, $\mathcal F$ and $\mathcal G$, and their number of channels is also halved as the input is first split into two. **Results.** In the experiments section, the author compare ResNet architectures to their RevNets "counterparts": they build a RevNet with roughly the same number of parameters by halving the number of residual units and doubling the number of channels. Interestingly, RevNets achieve **similar performances** to their ResNet counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model. To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints. --- ## References * [1] NICE: Non-linear Independent Components Estimation, Dinh et al., ICLR 2015 * [2] Density estimation using Real NVP, Dinh et al., ICLR 2017 |

Gradient Reversal Against Discrimination

Edward Raff and Jared Sylvester

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG

**First published:** 2018/07/01 (2 years ago)

**Abstract:** No methods currently exist for making arbitrary neural networks fair. In this
work we introduce GRAD, a new and simplified method to producing fair neural
networks that can be used for auto-encoding fair representations or directly
with predictive networks. It is easy to implement and add to existing
architectures, has only one (insensitive) hyper-parameter, and provides
improved individual and group fairness. We use the flexibility of GRAD to
demonstrate multi-attribute protection.
more
less

Edward Raff and Jared Sylvester

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG

[link]
Given some input data $x$ and attribute $a_p$, the task is to predict label $y$ from $x$ while making $a_p$ *protected*, in other words, such that the model predictions are invariant to changes in $a_p$. * **Pros (+)**: Simple and intuitive idea, easy to train, naturally extended to protecting multiple attributes. * **Cons (-)**: Comparison to baselines could be more detailed / comprehensive, in particular the comparison to ALFR [4] which also relies on adversarial training. --- ## Proposed Method **Domain adversarial networks.** The proposed model builds on the *Domain Adversarial Network* [1], originally introduced for unsupervised domain adaptation. Given some labeled data $(x, y) \sim \mathcal X \times \mathcal Y$, and some unlabeled data $\tilde x \sim \tilde{\mathcal X}$, the goal is to learn a network that solves both classification tasks $\mathcal X \rightarrow \mathcal Y$ and $\tilde{\mathcal X} \rightarrow \mathcal Y$ while learning a shared representation between $\mathcal X$ and $\tilde{\mathcal X}$. The model is composed of a feature extractor $G_f$ which then branches off into a *target* branch, $G_t$, to predict the target label, and a *domain* branch, $G_d$, predicting whether the input data comes either from domain $\mathcal X$ or $\tilde{\mathcal X}$. The model parameters are trained with the following objective: $$ \begin{align} (\theta_{G_f}, \theta_{G_t} ) &= \arg\min \mathbb E_{(x, y) \sim \mathcal X \times \mathcal Y}\ \ell_t \left( G_t \circ G_f(x), y \right)\\ \theta_{G_d} &= \arg\max \mathbb E_{x \sim \mathcal X} \ \ell_d\left( G_d \circ G_f(x), 1 \right) + \mathbb E_{\tilde x \sim \tilde{\mathcal X}}\ \ell_d \left(G_d \circ G_f(\tilde x), 0\right)\\ \mbox{where } &\ell_t \mbox{ and } \ell_d \mbox{ are classification losses} \end{align} $$ The gradient updates for this saddle point problem can be efficiently implemented using the Gradient Reversal Layer introduced in [1] **GRAD-pred.** In **G**radient **R**eversal **A**gainst **D**iscrimination, samples come only from one domain $\mathcal X$, and the domain classifier $G_d$ is replaced by an *attribute* classifier, $G_p$, whose goal is to predict the value of the protected attribute $a_p$. In other words, the training objective strives to build a feature representation of $x$ that is good enough to predict the correct label $y$ but such that $a_p$ cannot easily be deduced from it. On the contrary, directly learning classification network $G_y \circ G_f$ penalized when predicting the correct value of attribute $a_p$ could instead lead to a model that learns $a_p$ and trivially outputs an incorrect value. This situation is prevented by the adversarial training scheme here. **GRAD-auto.** The authors also consider a variant of the described model where the target branch $G_t$ instead solves the auto-encoding/reconstruction task. The features learned by the encoder $G_f$ can then later be used as entry point of a smaller network for classification or any other task. --- ## Experiments **Evaluation metrics.** The model is evaluated on four metrics to qualify both accuracy and fairness, following the protocol in [2]: * *Accuracy*, the proportion of correct classifications * *Discrimination*, the average score differences (logits of the ground-truth class) between samples with $a_p = + 1$ and $a_p = -1 $ (assuming a binary attribute) * *Consistency*, the average difference between a sample score and the mean of its nearest neighbors' score. * *Delta = Accuracy - Discrimination*, a penalized version of accuracy **Baselines.** * **Vanilla** CNN trained without the protected attribute protection branch * **LFR** [2]: A classifier with an intermediate latent code $Z \in \{1 \dots K\}$ is trained with an objective that combines a classification loss (the model should accurately classify $x$), a reconstruction loss (the learned representation should encode enough information about the input to reconstruct it accurately) and a parity loss (estimate the probability $P(Z=z | x)$ for both populations with $a_p = 1$ and $a_p = -1$ and strive to make them equal) * **VFA** [3]: A VAE where the protected attribute $a_p$ is factorized out of the latent code $z$, and additional invariance is imposed via a MMD objective which tries to match the moments of the posterior distributions $q(z|a_p = -1)$ and $q(z| a_p = 1)$. * **ALFR** [4] : As in LFR, this paper proposes a model trained with a reconstruction loss and a classification loss. Additionally, they propose to quantify the dependence between the learned representation and the protected attribute by adding an adversary classifier that tries to extract the attribute value from the representation, formulated and trained as in the Generative Adversarial Network (GAN) setting. **Results.** GRAD always reaches highest consistency compared to baselines. For the other metrics, the results are more mitigated, although it usually achieves best or second best results. It's also not clear how to choose between GRAD-pred and GRAD-auto as there does not seem to be a clear winner, although GRAD-pred seems more intuitive when supervision is available, as it directly solves the classification task. Authors also report a small experiment showing that protecting several attributes at the same time can be more beneficial than protecting a single attribute. This can be expected as some attributes are highly correlated or interact in meaningful way. In particular, protecting several attributes at once can easily be done in the GRAD framework by making the attribute prediction branch multi-class for instance: however it is not clear in the paper how it is actually done in practice, nor whether the same idea could also be integrated in the baselines for further comparison. --- ## References * [1] Domain-Adversarial Training of Neural Networks, Ganin et al, JMRL 2016 * [2] Learning Fair Representations, Zemel et al, ICML 2013 * [3] The Variational Fair Autoencoder, Louizos et al, 2016 * [4] Censoring Representations with an Adversary, Edwards and Storkey, ICLR 2016 |

About