This paper merges a GAN and VAE to improve pose estimation on depth hand images. They used paired data (where both depth image ($x$) and pose ($y$) is provided) and merge that with unlabelled data where only depth image ($x$) is given. The model is shown below: https://i.imgur.com/BvjZekU.png The VAE model takes $y$ and projects it to latent space ($z_y$) using encoder and then reconstructs it back to $\bar y$. Ali is used to map between latent space of VAE $z_y$ and the latent space of GAN $z_x$. The depth image synthesizer takes $z_x$ and generates a depth image $\bar x$. The Discriminator does three tasks: 1$L_{gan}$: distinguishing between true ($x$) and generated sample ($\bar x$). 2 $L_{pos}$: predicting the pose of the true depth image $x$. 3: $L_{smo}$: a smoothing loss to enforce the difference between two latent spaces in the generator and the ones predicted by discriminator to be the same (see below for more details). $\textbf{Here is how the data flows and losses are defined:}$ Given a pair of labelled data $(x,y)$, the pose $y$ is projected to latent space $z_y$, then projected back to estimate pose $\bar y$. Using VAE model, a reconstruction loss $L_{recons}$ is defined on pose. Using Ali, the latent variable $z_y$ is projected to $z_x$ and then the depth image $\bar{x}$ is generated $\bar{x} = Gen(Ali({z_y}))$. A reconstruction loss between x and $\bar{x}$ is defined (d_{self}). A random noise is samples from pose latent space ($\hat{z_y}$) and projected to a depth map using $\hat{x} = Gen(Ali(\hat{z_y}))$. Discriminator then takes $x$ and $\hat{x}$. It estimates pose on $x$ using $L_{pos}$. It also distinguishes between $x$ and $\hat{x}$ with $L_{gan}$. Finally, it measures the $x$ and $\hat{x}$'s latent space difference $smo(x, \hat x)$, which should be similar to the distance between $z_y$ and $\hat{z_y}$, so the smoloss is: $L_{smo} =  smo(x, \hat x)  (z_y  \hat{z_y})^2 + d_{self}$. In general the the VAE model and the depth image synthesizer can be considered as the Generator of the network. The total loss can be written as: $L_G = L_{recons} + L_{smo}  L_{gan}\\$ $L_D = L_{pos} + L_{smo}  L_{gan}\\$ The generator loss contains pose reconstruction, smoloss, and gan loss on generated depth maps. The discriminator loss contains pose estimation loss, smoloss, and gan loss on distinguishing fake and real depth images. Note that in the gen and disc losses all except the gan loss need paired data and the unlabelled data can be used for only ganloss. However, the unlabelled data would train the lowest layers of the disc (for pose estimation) and the image synthesis part of gen. But for pose estimation (the final target of the paper), training the VAE model, and also mapping between VAE and GAN using Ali, labelled data should be provided. Also note that $ L_{smo}$ trains both generator and discriminator parameters. In terms of performance the model improves the results on partially labelled data. On fully labelled data it shows either improvement or comparable results w.r.t to previous models. I find the strongest aspect of the paper in semisupervised learning where smaller portion of labelled data is provided, However, due to the way parameters are binded together, the model needs some labelled data to train the model completely.
Your comment:
