Given some input data $x$ and attribute $a_p$, the task is to predict label $y$ from $x$ while making $a_p$ *protected*, in other words, such that the model predictions are invariant to changes in $a_p$. * **Pros (+)**: Simple and intuitive idea, easy to train, naturally extended to protecting multiple attributes. * **Cons ()**: Comparison to baselines could be more detailed / comprehensive, in particular the comparison to ALFR [4] which also relies on adversarial training.  ## Proposed Method **Domain adversarial networks.** The proposed model builds on the *Domain Adversarial Network* [1], originally introduced for unsupervised domain adaptation. Given some labeled data $(x, y) \sim \mathcal X \times \mathcal Y$, and some unlabeled data $\tilde x \sim \tilde{\mathcal X}$, the goal is to learn a network that solves both classification tasks $\mathcal X \rightarrow \mathcal Y$ and $\tilde{\mathcal X} \rightarrow \mathcal Y$ while learning a shared representation between $\mathcal X$ and $\tilde{\mathcal X}$. The model is composed of a feature extractor $G_f$ which then branches off into a *target* branch, $G_t$, to predict the target label, and a *domain* branch, $G_d$, predicting whether the input data comes either from domain $\mathcal X$ or $\tilde{\mathcal X}$. The model parameters are trained with the following objective: $$ \begin{align} (\theta_{G_f}, \theta_{G_t} ) &= \arg\min \mathbb E_{(x, y) \sim \mathcal X \times \mathcal Y}\ \ell_t \left( G_t \circ G_f(x), y \right)\\ \theta_{G_d} &= \arg\max \mathbb E_{x \sim \mathcal X} \ \ell_d\left( G_d \circ G_f(x), 1 \right) + \mathbb E_{\tilde x \sim \tilde{\mathcal X}}\ \ell_d \left(G_d \circ G_f(\tilde x), 0\right)\\ \mbox{where } &\ell_t \mbox{ and } \ell_d \mbox{ are classification losses} \end{align} $$ The gradient updates for this saddle point problem can be efficiently implemented using the Gradient Reversal Layer introduced in [1] **GRADpred.** In **G**radient **R**eversal **A**gainst **D**iscrimination, samples come only from one domain $\mathcal X$, and the domain classifier $G_d$ is replaced by an *attribute* classifier, $G_p$, whose goal is to predict the value of the protected attribute $a_p$. In other words, the training objective strives to build a feature representation of $x$ that is good enough to predict the correct label $y$ but such that $a_p$ cannot easily be deduced from it. On the contrary, directly learning classification network $G_y \circ G_f$ penalized when predicting the correct value of attribute $a_p$ could instead lead to a model that learns $a_p$ and trivially outputs an incorrect value. This situation is prevented by the adversarial training scheme here. **GRADauto.** The authors also consider a variant of the described model where the target branch $G_t$ instead solves the autoencoding/reconstruction task. The features learned by the encoder $G_f$ can then later be used as entry point of a smaller network for classification or any other task.  ## Experiments **Evaluation metrics.** The model is evaluated on four metrics to qualify both accuracy and fairness, following the protocol in [2]: * *Accuracy*, the proportion of correct classifications * *Discrimination*, the average score differences (logits of the groundtruth class) between samples with $a_p = + 1$ and $a_p = 1 $ (assuming a binary attribute) * *Consistency*, the average difference between a sample score and the mean of its nearest neighbors' score. * *Delta = Accuracy  Discrimination*, a penalized version of accuracy **Baselines.** * **Vanilla** CNN trained without the protected attribute protection branch * **LFR** [2]: A classifier with an intermediate latent code $Z \in \{1 \dots K\}$ is trained with an objective that combines a classification loss (the model should accurately classify $x$), a reconstruction loss (the learned representation should encode enough information about the input to reconstruct it accurately) and a parity loss (estimate the probability $P(Z=z  x)$ for both populations with $a_p = 1$ and $a_p = 1$ and strive to make them equal) * **VFA** [3]: A VAE where the protected attribute $a_p$ is factorized out of the latent code $z$, and additional invariance is imposed via a MMD objective which tries to match the moments of the posterior distributions $q(za_p = 1)$ and $q(z a_p = 1)$. * **ALFR** [4] : As in LFR, this paper proposes a model trained with a reconstruction loss and a classification loss. Additionally, they propose to quantify the dependence between the learned representation and the protected attribute by adding an adversary classifier that tries to extract the attribute value from the representation, formulated and trained as in the Generative Adversarial Network (GAN) setting. **Results.** GRAD always reaches highest consistency compared to baselines. For the other metrics, the results are more mitigated, although it usually achieves best or second best results. It's also not clear how to choose between GRADpred and GRADauto as there does not seem to be a clear winner, although GRADpred seems more intuitive when supervision is available, as it directly solves the classification task. Authors also report a small experiment showing that protecting several attributes at the same time can be more beneficial than protecting a single attribute. This can be expected as some attributes are highly correlated or interact in meaningful way. In particular, protecting several attributes at once can easily be done in the GRAD framework by making the attribute prediction branch multiclass for instance: however it is not clear in the paper how it is actually done in practice, nor whether the same idea could also be integrated in the baselines for further comparison.  ## References * [1] DomainAdversarial Training of Neural Networks, Ganin et al, JMRL 2016 * [2] Learning Fair Representations, Zemel et al, ICML 2013 * [3] The Variational Fair Autoencoder, Louizos et al, 2016 * [4] Censoring Representations with an Adversary, Edwards and Storkey, ICLR 2016
Your comment:
