Synthesizing Obama: learning lip sync from audio Synthesizing Obama: learning lip sync from audio
Paper summary This paper synthesizes a high-quality video of Barack Obama given the audio. Practically, it only synthesizes the region around the mouth, while the rest of the elements (i.e. pixels) come from a video in a database. The overall pipeline is the following: - Given a video, an audio and a mouth shape are extracted. Audio is represented as MFCC coefficients; mouth shape - 18 lip markers; - Train audio to mouth shape mapping with time-delayed unidirectional LSTM. - Synthesize mouth texture: retrieve a number of video frames in a database where a mouth shape is similar to the output of LSTM; synthesize median texture by applying weighted median on mouth shapes from retrieved video frames; manually select teeth target frame (selection criteria are purely subjected) and enhance teeth median texture with selected teeth target frame. - Re-timing to avoid situations where Obama is not speaking but his head is moving which looks very unnatural. - Final composition into the target video involves jaw correction to make it more natural. ![Algorithm flow]( The results look ridiculously natural. Authors suggest that one of the applications of this paper is speech summarization, where you summarize a speech not only with selected parts as text and audio but also synthesize a video for it. Personally, this work inspires me to work on a method that is able to generate natural sign language interpreter that takes sound/text as input and produces sign language moves.

Summary by Oleksandr Bailo 3 weeks ago
Your comment: allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and