This paper synthesizes a high-quality video of Barack Obama given the audio. Practically, it only synthesizes the region around the mouth, while the rest of the elements (i.e. pixels) come from a video in a database.
The overall pipeline is the following:
- Given a video, an audio and a mouth shape are extracted. Audio is represented as MFCC coefficients; mouth shape - 18 lip markers;
- Train audio to mouth shape mapping with time-delayed unidirectional LSTM.
- Synthesize mouth texture: retrieve a number of video frames in a database where a mouth shape is similar to the output of LSTM; synthesize median texture by applying weighted median on mouth shapes from retrieved video frames; manually select teeth target frame (selection criteria are purely subjected) and enhance teeth median texture with selected teeth target frame.
- Re-timing to avoid situations where Obama is not speaking but his head is moving which looks very unnatural.
- Final composition into the target video involves jaw correction to make it more natural.
The results look ridiculously natural. Authors suggest that one of the applications of this paper is speech summarization, where you summarize a speech not only with selected parts as text and audio but also synthesize a video for it. Personally, this work inspires me to work on a method that is able to generate natural sign language interpreter that takes sound/text as input and produces sign language moves.
_Objective:_ Transfer visual attribute (color, tone, texture, and style, etc) between two semantically-meaningful images such as a picture and a sketch.
## Inner workings:
### Image analogy
An image analogy A:A′::B:B′ is a relation where:
* B′ relates to B in the same way as A′ relates to A
* A and A′ are in pixel-wise correspondences
* B and B′ are in pixel-wise correspondences
In this paper only a source image A and an example image B′ are given, and both A′ and B represent latent images to be estimated.
[![screen shot 2017-05-18 at 10 43 48 am](https://cloud.githubusercontent.com/assets/17261080/26193907/f080e212-3bb6-11e7-9441-7b255e4219f5.png)](https://cloud.githubusercontent.com/assets/17261080/26193907/f080e212-3bb6-11e7-9441-7b255e4219f5.png)
### Dense correspondence
In order to find dense correspondences between two images they use features from previously trained CNN (VGG-19) and retrieve all the ReLU layers.
The mapping is divided in two sub-mappings that are easier to compute, first a visual attribute transformation and then a space transformation.
[![screen shot 2017-05-18 at 11 04 58 am](https://cloud.githubusercontent.com/assets/17261080/26194835/03ccd94a-3bba-11e7-93ca-9420d4d96162.png)](https://cloud.githubusercontent.com/assets/17261080/26194835/03ccd94a-3bba-11e7-93ca-9420d4d96162.png)
The algorithm proceeds as follow:
1. Compute features at each layer for the input image using a pre-trained CNN and initialize feature maps of latent images with coarsest layer.
2. For said layer compute a forward and reverse nearest-neighbor field (NNF, basically an offset field).
3. Use this NNF with the feature of the input current layer to compute the features of the latent images.
4. Upsample the NNF and use it as the initialization for the NNF of the next layer.
[![screen shot 2017-05-18 at 11 14 33 am](https://cloud.githubusercontent.com/assets/17261080/26195178/35277e0e-3bbb-11e7-82ce-037466314640.png)](https://cloud.githubusercontent.com/assets/17261080/26195178/35277e0e-3bbb-11e7-82ce-037466314640.png)
Impressive quality on all type of visual transfer but veryyyyy slow! (~3min on GPUs for one image).
[![screen shot 2017-05-18 at 11 36 47 am](https://cloud.githubusercontent.com/assets/17261080/26196151/54ef423c-3bbe-11e7-9433-b29be5091fae.png)](https://cloud.githubusercontent.com/assets/17261080/26196151/54ef423c-3bbe-11e7-9433-b29be5091fae.png)