[link]
Summary by Oleksandr Bailo 1 month ago
This paper presents a method to extract motion (dynamic) and skeleton / camera-view (static) representations from the video of a person represented as a 2D joints skeleton. This decomposition allows transferring the motion to different skeletons (retargeting) and many more. It does so by utilizing deep neural networks.
https://i.imgur.com/J5jBzcs.png
The architecture consists of motion and skeleton / camera-view encoders that decompose an input sequence of 2D joint positions into latent spaces and a decoder that reconstructs a sequence from such components. The motion vector varies in length, while skeleton and camera view representations are fixed.
https://i.imgur.com/QaDksg1.png
This is achieved by the nature of the network design. Specifically, motion encoder uses 1D convolutions with strides, thus output dimensions are proportionally related to the input. On the other hand, the static encoder uses global average pooling in the final layer to produce a fixed-size latent representation:
https://i.imgur.com/Cf7TVKA.png
More detailed design of the encoders and decoder is shown below:
https://i.imgur.com/cpaveFm.png
**Dataset**. Adobe Mixamo is used to obtain sequences of poses of different 3D characters. It allows creating multiple samples where different characters (with different skeleton structure) perform the same motions. These 3D video clips are then projected into 2D by selecting arbitrary view angles and distance to the object. Thus, we can easily create multiple pairs of 2D image sequences of characters (same or different) performing various actions (same or different) from various views.
**Loss functions** used to for training (refer the paper for the detailed formulas):
- *Cross Reconstruction Loss*
It is a sum of two other losses. The first one is the reconstruction loss where the network tries to reconstruct original input. The second one is cross reconstruction loss where the network tries to reconstruct the sequence where a different character performs the exact same action as the input. It is best shown in the Figure below:
https://i.imgur.com/ewZOAox.png
- *Triplet Loss*
This loss aims to bring latent spaces of similar motions closer together, while separate apart the ones that are different. It takes two triplets, where each contains two samples that share the same (or very similar) motion and one with different. The same concept is applied to the static latent space.
- Foot velocity loss
This loss helps to remove the foot skating phenomenon - hands and feet exhibit larger errors that the other keypoints.
https://i.imgur.com/DclJEde.png
where $V_{global}$ and $V_{joint_n}$ extract the global and local ($n$th joint) velocities from the reconstructed output $\hat{p}_{ij}$, respectively, and map them back to the image units, and $V_{orig_n}$ returns the original global velocity of the $n$th joint from the ground truth, $p_{ij}$
**Normalization**
- subtract the root position from all joint locations in every frame
- subtract the mean joint position and divide by the standard deviation (averaged over the entire dataset)
- per-frame global velocity is not touched
**Data Augmentation** applied during training:
- temporal clipping during the batch creation process
- scaling - same as to use different camera distance to the object
- flipping symmetrical joints
- dropping joints to simulate behavior of a real keypoint detector as they often miss some joints
- adding real video data to the training and use reprojection loss in case no labels are given
**Results and Evaluation** (to be continued) ...
While the summary becomes too long to be a called a summary it is worth mentioning that there are several applications possible with this approach:
- performance cloning - make any 2D skeleton repeat particular motions
- motion retrieval - search videos that contain the particular target motion

more
less