First published: 2018/02/12 (1 year ago) Abstract: This paper addresses the problem of 3D human pose estimation in the wild. A
significant challenge is the lack of training data, i.e., 2D images of humans
annotated with 3D poses. Such data is necessary to train state-of-the-art CNN
architectures. Here, we propose a solution to generate a large set of
photorealistic synthetic images of humans with 3D pose annotations. We
introduce an image-based synthesis engine that artificially augments a dataset
of real images with 2D human pose annotations using 3D motion capture data.
Given a candidate 3D pose, our algorithm selects for each joint an image whose
2D pose locally matches the projected 3D pose. The selected images are then
combined to generate a new synthetic image by stitching local image patches in
a kinematically constrained manner. The resulting images are used to train an
end-to-end CNN for full-body 3D pose estimation. We cluster the training data
into a large number of pose classes and tackle pose estimation as a $K$-way
classification problem. Such an approach is viable only with large training
sets such as ours. Our method outperforms most of the published works in terms
of 3D pose estimation in controlled environments (Human3.6M) and shows
promising results for real-world images (LSP). This demonstrates that CNNs
trained on artificial images generalize well to real images. Compared to data
generated from more classical rendering engines, our synthetic images do not
require any domain adaptation or fine-tuning stage.
Aim: generate realistic-looking synthetic data that can be used to train 3D Human Pose Estimation methods. Instead of rendering 3D models, they choose to combine parts of real images.
Input: RGB images with 2D annotations + a query 3D pose.
Output: A synthetic image, stitched from patches of the images, so that it looks like a person in the query 3D pose.
- Project 3D pose on random camera to get 2D coords
- For each joint, find an image in the 2D annotated dataset whose annotation is locally similar
- Based on the similarities, decide for each pixel which image is most relevant.
- For each pixel, take the histogram of the chosen images in a neighborhood, and use this as blending factors to generate the result.
They also present a method that they trained on this synthetic dataset.
First published: 2018/02/01 (1 year ago) Abstract: In this work, we establish dense correspondences between RGB image and a
surface-based representation of the human body, a task we refer to as dense
human pose estimation. We first gather dense correspondences for 50K persons
appearing in the COCO dataset by introducing an efficient annotation pipeline.
We then use our dataset to train CNN-based systems that deliver dense
correspondence 'in the wild', namely in the presence of background, occlusions
and scale variations. We improve our training set's effectiveness by training
an 'inpainting' network that can fill in missing groundtruth values and report
clear improvements with respect to the best results that would be achievable in
the past. We experiment with fully-convolutional networks and region-based
models and observe a superiority of the latter; we further improve accuracy
through cascading, obtaining a system that delivers highly0accurate results in
real time. Supplementary materials and videos are provided on the project page
They introduce a dense version of the human pose estimation task: predict body surface coordinates for each pixel in an RGB image.
Body surface is representated on two levels:
- Body part label (24 parts)
- Head, torso, hands, feet, etc.
- Each leg split in 4 parts: upper/lower front/back. Same for arms.
- 2 coordinates (u,v) within body part
- head, hands, feet: based on SMPL model
- others: determined by Multidimensional Scaling on geodesic distances
* They annotate COCO for this task
- annotation tool: draw mask, then click on a 3D rendering for each of up to 14 points sampled from the mask
- annotator accuracy on synthetic renderings (average geodesic distance)
- small parts (e.g. feet): ~2 cm
- large parts (e.g. torso): ~7 cm
- 25-way body part classification head (cross-entropy loss)
- Regression head with 24*2 outputs per pixel (Huber loss)
- Like Mask-RCNN
- New branch with same architecture as the keypoint branch
- ResNet-50-FPN (Feature Pyramid Net) backbone
- Multi-task learning
- Train keypoint/mask and dense pose task at once
- Interaction implicit by sharing backbone net
- Multi-task *cross-cascading*
- Explicit interaction of tasks
- Introduce second stage that depends on the first-stage-output of all tasks
- Ground truth interpolation (distillation)
- Train a "teacher" FCN with the pointwise annotations
- Use its dense predictions as ground truth to train final net
- (To make the teacher as accurate as possible, they use ground-truth mask to remove background)
**Single-person results (train and test on single-person crops)**
Pointwise eval measure:
- Compute geodesic distance between prediction and ground truth at each annotated point
- For various error thresholds, plot percentage of points with lower error than the threshold
- Compute Area Under this Curve
Training (non-regional) FCN on new dataset vs. synthetic data improves AUC10 from 0.20 to 0.38
This paper's FCN method vs. model-fitting baseline
- Baseline: Estimate body keypoint locations in 2D (usual "pose estimation" task) + fit 3D model
- AUC10 improves from 0.23 to 0.43
- Speed: 4-25 fps for FCN vs. model-fitting taking 1-3 minutes per frame (!).
- Region-based method outperforms FCN baseline: 0.25 -> 0.32
- FCN cannot deal well with varying person scales (despite multi-scale testing)
- Training on points vs interpolated ground-truth (distillation) 0.32 -> 0.38
- AUC10 with cross-task cascade: 0.39
Also: Per-instance eval ("Geodesic Point Similarity" - GPS)
- Compute a Gaussian function on the geodesic distances
- Average it within each person instance (=> GPS)
- Compute precision and recall of persons for various thresholds of GPS
- Compute average precision and recall over thresholds
Comparison of multi-task approaches:
1. Just dense pose branch (single-task) (AP 51)
2. Adding keypoint (AP 53) OR mask branch (multi-task without cross-cascade) (AP 52)
3. Refinement stage without cross-links (AP 52)
4. Multi-task cross-cascade (keypoints: AP 56, masks: AP 53)
First published: 2016/11/16 (2 years ago) Abstract: We present a simple, highly modularized network architecture for image
classification. Our network is constructed by repeating a building block that
aggregates a set of transformations with the same topology. Our simple design
results in a homogeneous, multi-branch architecture that has only a few
hyper-parameters to set. This strategy exposes a new dimension, which we call
"cardinality" (the size of the set of transformations), as an essential factor
in addition to the dimensions of depth and width. On the ImageNet-1K dataset,
we empirically show that even under the restricted condition of maintaining
complexity, increasing cardinality is able to improve classification accuracy.
Moreover, increasing cardinality is more effective than going deeper or wider
when we increase the capacity. Our models, named ResNeXt, are the foundations
of our entry to the ILSVRC 2016 classification task in which we secured 2nd
place. We further investigate ResNeXt on an ImageNet-5K set and the COCO
detection set, also showing better results than its ResNet counterpart. The
code and models are publicly available online.
* Presents an architecture dubbed ResNeXt
* They use modules built of
* 1x1 conv
* 3x3 group conv, keeping the depth constant. It's like a usual conv, but it's not fully connected along the depth axis, but only connected within groups
* 1x1 conv
* plus a skip connection coming from the module input
* Fewer parameters, since the full connections are only within the groups
* Allows more feature channels at the cost of more aggressive grouping
* Better performance when keeping the number of params constant
* Instead of keeping the num of params constant, how about aiming at constant memory consumption? Having more feature channels requires more RAM, even if the connections are sparser and hence there are fewer params
* Not so much improvement over ResNet
* Semi-supervised method
* There is a teacher net and a student net, with identical architecture.
* The teacher makes predictions on unlabeled data, which are used as ground-truth for training the student net.
* After each gradient descent update on the student, the teacher's weights are updated so that it becomes an exponential moving average of the weights of the student at previous timesteps. It's called a "mean teacher" because of this moving average.
* It's a semi-supervised method (the goal is to make use of unlabeled data in addition to labeled data).
* They first train a neural net normally, in the supervised way, on a labeled dataset.
* Then **they retrain the net using *its own predictions* on the originally unlabeled data as if it was ground truth** (but only when the net is confident enough about the prediction).
* More precisely they retrain on the union of the original dataset and the examples labeled by the net itself. (Each minibatch is on average 60% original and 40% self-labeled)
* When making these predictions (that will subsequently used for training), they use **multi-transform inference**.
* They apply the net to differently transformed versions of the image (mirroring, scaling), transform the outputs back accordingly and combine the results.