Unsupervised Domain Adaptation by Backpropagation Unsupervised Domain Adaptation by Backpropagation
Paper summary The goal of this method is to create a feature representation $f$ of an input $x$ that is domain invariant over some domain $d$. The feature vector $f$ is obtained from $x$ using an encoder network (e.g. $f = G_f(x)$). The reason this is an issue is that the input $x$ is correlated with $d$ and this can confuse the model to extract features that capture differences in domains instead of differences in classes. Here I will recast the problem differently from in the paper: **Problem:** Given a conditional probability $p(x|d=0)$ that may be different from $p(x|d=1)$: $$p(x|d=0) \stackrel{?}{\ne} p(x|d=1)$$ we would like it to be the case that these distributions are equal. $$p(G_f(x) |d=0) = p(G_f(x)|d=1)$$ aka: $$p(f|d=0) = p(f|d=1)$$ Of course this is an issue if some class label $y$ is correlated with $d$ meaning that we may hurt the performance of a classifier that now may not be able to predict $y$ as well as before. https://i.imgur.com/WR2ujRl.png The paper proposes adding a domain classifier network to the feature vector using a reverse gradient layer. This layer simply flips the sign on the gradient. Here is an example in [Theano](https://github.com/Theano/Theano): ``` class ReverseGradient(theano.gof.Op): ... def grad(self, input, output_gradients): return [-output_gradients[0]] ``` You then train this domain network as if you want it to correctly predict the domain (appending it's error to your loss function). As the domain network learns new ways to correctly predict an output these gradients will be flipped and the information in feature vector $f$ will be removed. There are two major hyper parameters of the method. The number of dimensions at the bottleneck is one but it is linked to your network. The second is a scalar on the gradient so you can increase or decrease the effect of the gradient on the embedding.
Unsupervised Domain Adaptation by Backpropagation
Yaroslav Ganin and Victor Lempitsky
arXiv e-Print archive - 2014 via arXiv
Keywords: stat.ML, cs.LG, cs.NE


I like the interpretation of minimising the classification loss with the constraint that the class conditional marginal (for all $x$ conditioned on the domain source) distribution of the internal representation (learned features) should match each other. This, though, could be better formulated as a soft constraint (as an optimisation problem devised in the paper): $$ \min_{\theta_f,\theta_y}-\mathbf{E}_x [p_{\theta_f,\theta_y}(y|x)] + \mathcal{D}(p_{\theta_f}(f|d=0)||p_{\theta_f}(f|d=1)) $$ where the first term is the standard probabilistic loss, regularised by the distance between the internal distributions. Since the domain label and image datapoint come in pairs, we can always marginalise out the data point and have $p(f|d)=\mathbf{E}_{p(x|d)}[p(f|x)]$. In our case here, p(f|x) is deterministic. The original author uses an "adversarial"-like methodology that introduces a discriminator for domain classification, where a possible choice of the distance metric ($\mathcal{D}$) could be the Jensen Shannon divergence. The adversarial training makes it possible to train the feature extractor like a generator to match the conditionals $p(f|d=0)$ and $p(f|d=1)$ through sampling.

Your comment:

ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!