Disentangling factors of variation in deep representations using adversarial training
Disentangling factors of variation in deep representations using adversarial training
Michael Mathieu and Junbo Zhao and Pablo Sprechmann and Aditya Ramesh and Yann LeCun
2016

Paper summary
gngdb
When training a [VAE][] you will have an inference network $q(z|x)$. If you have another source of information you'd like to base the approximate posterior on, like some labels $s$, then you would make $q(z|x,s)$. But $q$ is a complicated function, and it can ignore $s$ if it wants, and still perform well. This paper describes an adversarial way to force $q$ to use $s$.
This is made more complicated in the paper, because $s$ is not necessarily a label, and in fact _is real and continuous_ (because it's easier to backprop in that case). In fact, we're going to _learn_ the representation of $s$, but force it to contain the label information using the training procedure. To be clear, with $x$ as our input (image or whatever):
$$
s = f_{s}(x)
$$
$$
\mu, \sigma = f_{z}(x,s)
$$
We sample $z$ using $\mu$ and $\sigma$ [according to the reparameterization trick, as this is a VAE][vae]:
$$
z \sim \mathcal{N}(\mu, \sigma)
$$
And then we use our decoder to turn these latent variables into images:
$$
\tilde{x} = \text{Dec}(z,s)
$$
Training Procedure
--------------------------
We are going to create four parallel loss functions, and incorporate a discriminator to train this:
1. Reconstruction loss plus variational regularizer; propagate a $x_1$ through the VAE to get $s_1$, $z_1$ (latent) and $\tilde{x}_{1}$.
2. Reconstruction loss with a different $s$:
1. Propagate $x_1'$, a different sample with the __same class__ as
$x_1$
2. Pass $z_1$ and $s_1'$ to your decoder.
3. As $s_1'$ _should_ include the label information, you should have reproduced $x_1$, so apply reconstruction loss to whatever your decoder has given you (call it $\tilde{x}_1'$).
3. Adversarial Loss encouraging realistic examples from the same class, regardless of $z$.
1. Propagate $x_2$ (totally separate example) through the network to get $s_2$.
2. Generate two $\tilde{x}_{2}$ variables, one with the prior by sampling from $p(z)$ and one using $z_{1}$.
3. Get the adversary to classify these as fake versus the real sample $x_{2}$.
This is pretty well described in Figure 1 in the paper.
Experiments show that $s$ ends up coding for the class, and $z$ codes for other stuff, like the angle of digits or line thickness. They also try to classify using $z$ and $s$ and show that $s$ is useful but $z$ is not (can only predict as well as chance). So, it works.
[vae]: https://jaan.io/unreasonable-confusion/
Disentangling factors of variation in deep representations using adversarial training

Michael Mathieu and Junbo Zhao and Pablo Sprechmann and Aditya Ramesh and Yann LeCun

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2016/11/10 (2 years ago)

**Abstract:** We introduce a conditional generative model for learning to disentangle the
hidden factors of variation within a set of labeled observations, and separate
them into complementary codes. One code summarizes the specified factors of
variation associated with the labels. The other summarizes the remaining
unspecified variability. During training, the only available source of
supervision comes from our ability to distinguish among different observations
belonging to the same class. Examples of such observations include images of a
set of labeled objects captured at different viewpoints, or recordings of set
of speakers dictating multiple phrases. In both instances, the intra-class
diversity is the source of the unspecified factors of variation: each object is
observed at multiple viewpoints, and each speaker dictates multiple phrases.
Learning to disentangle the specified factors from the unspecified ones becomes
easier when strong supervision is possible. Suppose that during training, we
have access to pairs of images, where each pair shows two different objects
captured from the same viewpoint. This source of alignment allows us to solve
our task using existing methods. However, labels for the unspecified factors
are usually unavailable in realistic scenarios where data acquisition is not
strictly controlled. We address the problem of disentanglement in this more
general setting by combining deep convolutional autoencoders with a form of
adversarial training. Both factors of variation are implicitly captured in the
organization of the learned embedding space, and can be used for solving
single-image analogies. Experimental results on synthetic and real datasets
show that the proposed method is capable of generalizing to unseen classes and
intra-class variabilities.
more
less

Michael Mathieu and Junbo Zhao and Pablo Sprechmann and Aditya Ramesh and Yann LeCun

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG, stat.ML

You must log in before you can submit this summary! Your draft will not be saved!

Preview:

About