Everybody Dance Now on ShortScience.org

arxiv.org
scholar.google.com

Everybody Dance Now
Caroline Chan and Shiry Ginosar and Tinghui Zhou and Alexei A. Efros
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.GR, cs.CV
more

Summaries/Notes 1

[link] Summary by Oleksandr Bailo 5 years ago

This paper presents a per-frame image-to-image translation system enabling copying of a motion of a person from a source video to a target person. For example, a source video might be a professional dancer performing complicated moves, while the target person is you. By utilizing this approach, it is possible to generate a video of you dancing as a professional. Check the authors' [video](https://www.youtube.com/watch?v=PCBTZh41Ris) for the visual explanation.

**Data preparation**

The authors have manually recorded high-resolution video ( at 120fps ) of a person performing various random moves. The video is further decomposed to frames, and person's pose keypoints (body joints, hands, face) are extracted for each frame. These keypoints are further connected to form a person stick figure. In practice, pose estimation is performed by open source project [OpenPose](https://github.com/CMU-Perceptual-Computing-Lab/openpose).

**Training**
https://i.imgur.com/VZCXZMa.png

Once the data is prepared the training is performed in two stages:
1. **Training pix2pixHD model with temporal smoothing**.
    
   The core model is an original [pix2pixHD](https://tcwang0509.github.io/pix2pixHD/)[1] model with temporal smoothing. Specifically, if we were to use vanilla pix2pixHD, the input to the model would be a stick person image, and the target is the person's image corresponding to the pose. The network's objective would be $min_{G} (Loss1 + Loss2 + Loss3)$, where: 

 - $Loss1 = max_{D_1, D_2, D_3} \sum_{k=1,2,3} \alpha_{GAN}(G, D_k)$ is adverserial loss;
 
 - $Loss2 = \lambda_{FM} \sum_{k=1,2,3} \alpha_{FM}(G,D_k)$ is feature matching loss;

  - $Loss3 = \lambda_{VGG}\alpha_{VGG}(G(x),y)]$ is VGG perceptual loss.

 However, this objective does not account for the fact that we want to generate video composed of frames that are temporally coherent. The authors propose to ensure *temporal smoothing* between adjacent frames by including pose, corresponding image, and generated image from the previous step (zero image for the first frame) as shown in the figure below:

  https://i.imgur.com/0NSeBVt.png

   Since the generated output $G(x_t; G(x_{t-1}))$ at time step $t$ is now conditioned on the previously generated frame $G(x_{t-1})$ as well as current stick image $x_t$, better temporal consistency is ensured. Consequently, the discriminator is now trying to determine both correct generation, as well as temporal consitency for a fake sequence $[x_{t-1}; x_t; G(x_{t-1}), G(x_t)]$.

2. **Training FaceGAN model**.

   https://i.imgur.com/mV1xuMi.png
  
   In order to improve face generation, the authors propose to use specialized FaceGAN. In practice, this is another smaller pix2pixHD model (with a global generator only, instead of local+global) which is fed with a cropped face area of a stick image and cropped face area of a corresponding generated image (from previous step 1) and aims to generate a residual which is added to the previously generated full image. 

**Testing**

 During testing, we extract frames from the input video, obtain pose stick image for each frame, normalize the stick pose image and feed it to pix2pixHD (with temporal consistency) and, further, to FaceGAN to produce final generated image with improved face features. Normalization is needed to capture possible pose variation between a source and a target input video.

**Remarks**

While this method produces a visually appealing result, it is not perfect. The are several reasons for being so:

 1. *Quality of a pose stick image*: if the pose detector "misses" the keypoint, the generator might have difficulties to generate a properly rendered image;
 2. *Motion blur*: motion blur causes pose detector to miss keypoints;
 3. *Severe scale change*: if source person is very far, keypoint detector might fail to detect proper keypoints.

Among video rendering challenges, the authors mention self-occlusion, cloth texture generation, video jittering (training-test motion mismatch).

References:

[1] "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs"

Have people attempted to reproduce? Is there a piece of code somewhere with the end-to-end, or do I need to put the different pieces together (i.e. OpenPose, pix2pixHD, FaceGAN)?

Yes, I have reproduced the code (combined different pieces together), and it works very well. It is basically [pix2pixHD](https://github.com/NVIDIA/pix2pixHD) with temporal smoothing. For FaceGAN, you could just use a global generator from pix2pixHD model (instead of local and global generators together). To get the poses, I have utilized out-of-box executables provided by [OpenPose](https://github.com/CMU-Perceptual-Computing-Lab/openpose).

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private