Unsupervised Domain Adaptation by Backpropagation Unsupervised Domain Adaptation by Backpropagation
Paper summary _Objective:_ Build a network easily trainable by back-propagation to perform unsupervised domain adaptation while at the same time learning a good embedding for both source and target domains. _Dataset:_ [SVHN](ufldl.stanford.edu/housenumbers/), [MNIST](yann.lecun.com/exdb/mnist/), [USPS](https://www.otexts.org/1577), [CIFAR](https://www.cs.toronto.edu/%7Ekriz/cifar.html) and [STL](https://cs.stanford.edu/%7Eacoates/stl10/). #### Architecture: Very similar to RevGrad but with some differences. Basically a shared encoder and then a classifier and a reconstructor. [![screen shot 2017-05-22 at 6 11 22 pm](https://cloud.githubusercontent.com/assets/17261080/26318076/21361592-3f1a-11e7-9213-9cc07cfe2f2a.png)](https://cloud.githubusercontent.com/assets/17261080/26318076/21361592-3f1a-11e7-9213-9cc07cfe2f2a.png) The two losses are: * the usual cross-entropy with softmax for the classifier * the pixel-wise squared loss for reconstruction Which are then combined using a trade-off hyper-parameter between classification and reconstruction. They also use data augmentation to generate additional training data during the supervised training using only geometrical deformation: translation, rotation, skewing, and scaling Plus denoising to reconstruct clean inputs given their noisy counterparts (zero-masked noise and Gaussian noise). #### Results: Outperforms state of the art on most tasks at the time, now outperformed itself by Generate To Adapt on most tasks.
Unsupervised Domain Adaptation by Backpropagation
Yaroslav Ganin and Victor Lempitsky
arXiv e-Print archive - 2014 via Local arXiv
Keywords: stat.ML, cs.LG, cs.NE


Summary by Joseph Paul Cohen 1 year ago
I like the interpretation of minimising the classification loss with the constraint that the class conditional marginal (for all $x$ conditioned on the domain source) distribution of the internal representation (learned features) should match each other. This, though, could be better formulated as a soft constraint (as an optimisation problem devised in the paper): $$ \min_{\theta_f,\theta_y}-\mathbf{E}_x [p_{\theta_f,\theta_y}(y|x)] + \mathcal{D}(p_{\theta_f}(f|d=0)||p_{\theta_f}(f|d=1)) $$ where the first term is the standard probabilistic loss, regularised by the distance between the internal distributions. Since the domain label and image datapoint come in pairs, we can always marginalise out the data point and have $p(f|d)=\mathbf{E}_{p(x|d)}[p(f|x)]$. In our case here, p(f|x) is deterministic. The original author uses an "adversarial"-like methodology that introduces a discriminator for domain classification, where a possible choice of the distance metric ($\mathcal{D}$) could be the Jensen Shannon divergence. The adversarial training makes it possible to train the feature extractor like a generator to match the conditionals $p(f|d=0)$ and $p(f|d=1)$ through sampling.

Your comment:
Summary by Léo Paillier 3 months ago
Your comment:

ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and