isarandi's profile - ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Image-based Synthesis for Deep 3D Human Pose Estimation
Grégory Rogez and Cordelia Schmid
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV
more

[link] Summary by isarandi 6 years ago

Aim: generate realistic-looking synthetic data that can be used to train 3D Human Pose Estimation methods. Instead of rendering 3D models, they choose to combine parts of real images.

Input: RGB images with 2D annotations + a query 3D pose.
Output: A synthetic image, stitched from patches of the images, so that it looks like a person in the query 3D pose.

Steps:
- Project 3D pose on random camera to get 2D coords
- For each joint, find an image in the 2D annotated dataset whose annotation is locally similar
- Based on the similarities, decide for each pixel which image is most relevant.
- For each pixel, take the histogram of the chosen images in a neighborhood, and use this as blending factors to generate the result.

They also present a method that they trained on this synthetic dataset.

arxiv.org
arxiv-vanity.com
scholar.google.com

DensePose: Dense Human Pose Estimation In The Wild
Rıza Alp Güler and Natalia Neverova and Iasonas Kokkinos
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV
more

[link] Summary by isarandi 6 years ago

## Task
They introduce a dense version of the human pose estimation task: predict body surface coordinates for each pixel in an RGB image.

Body surface is representated on two levels:
- Body part label (24 parts)
    - Head, torso, hands, feet, etc.
    - Each leg split in 4 parts: upper/lower front/back. Same for arms.
- 2 coordinates (u,v) within body part
    - head, hands, feet: based on SMPL model
    - others: determined by Multidimensional Scaling on geodesic distances

## Data
* They annotate COCO for this task
    - annotation tool: draw mask, then click on a 3D rendering for each of up to 14 points sampled from the mask
    - annotator accuracy on synthetic renderings (average geodesic distance)
       - small parts (e.g. feet): ~2 cm
       - large parts (e.g. torso): ~7 cm

## Method

Fully-convolutional baseline
  - ResNet-50/101
  - 25-way body part classification head (cross-entropy loss)
  - Regression head with 24*2 outputs per pixel (Huber loss)

Region-based approach
  - Like Mask-RCNN
  - New branch with same architecture as the keypoint branch
  - ResNet-50-FPN (Feature Pyramid Net) backbone

Enhancements tested:

- Multi-task learning
  - Train keypoint/mask and dense pose task at once
  - Interaction implicit by sharing backbone net

- Multi-task *cross-cascading*
  - Explicit interaction of tasks
  - Introduce second stage that depends on the first-stage-output of all tasks

- Ground truth interpolation (distillation)
  - Train a "teacher" FCN with the pointwise annotations
  - Use its dense predictions as ground truth to train final net
  - (To make the teacher as accurate as possible, they use ground-truth mask to remove background)

## Results

**Single-person results (train and test on single-person crops)**

Pointwise eval measure:
   - Compute geodesic distance between prediction and ground truth at each annotated point
   - For various error thresholds, plot percentage of points with lower error than the threshold
   - Compute Area Under this Curve

Training (non-regional) FCN on new dataset vs. synthetic data improves AUC10 from 0.20 to 0.38

This paper's FCN method vs. model-fitting baseline
- Baseline: Estimate body keypoint locations in 2D (usual "pose estimation" task) + fit 3D model
- AUC10 improves from 0.23 to 0.43
- Speed: 4-25 fps for FCN vs. model-fitting taking 1-3 minutes per frame (!).

**Multi-person results**

- Region-based method outperforms FCN baseline: 0.25 -> 0.32
    - FCN cannot deal well with varying person scales (despite multi-scale testing)
- Training on points vs interpolated ground-truth (distillation) 0.32 -> 0.38
- AUC10 with cross-task cascade: 0.39

Also: Per-instance eval ("Geodesic Point Similarity" - GPS)
   - Compute a Gaussian function on the geodesic distances
   - Average it within each person instance (=> GPS)
   - Compute precision and recall of persons for various thresholds of GPS
   - Compute average precision and recall over thresholds

Comparison of multi-task approaches:
1. Just dense pose branch (single-task) (AP 51)
2. Adding keypoint (AP 53) OR mask branch (multi-task without cross-cascade) (AP 52)
3. Refinement stage without cross-links (AP 52)
4. Multi-task cross-cascade (keypoints: AP 56, masks: AP 53)

arxiv.org
arxiv-vanity.com
scholar.google.com

Aggregated Residual Transformations for Deep Neural Networks
Saining Xie and Ross Girshick and Piotr Dollár and Zhuowen Tu and Kaiming He
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV
more

[link] Summary by isarandi 6 years ago

* Presents an architecture dubbed ResNeXt
* They use modules built of
    * 1x1 conv
    * 3x3 group conv, keeping the depth constant. It's like a usual conv, but it's not fully connected along the depth axis, but only connected within groups
    * 1x1 conv
    * plus a skip connection coming from the module input

* Advantages:
    * Fewer parameters, since the full connections are only within the groups
    * Allows more feature channels at the cost of more aggressive grouping
    * Better performance when keeping the number of params constant

* Questions/Disadvantages:
    * Instead of keeping the num of params constant, how about aiming at constant memory consumption? Having more feature channels requires more RAM, even if the connections are sparser and hence there are fewer params
    * Not so much improvement over ResNet

papers.nips.cc
scholar.google.com

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.
Antti Tarvainen and Harri Valpola
Neural Information Processing Systems Conference - 2017 via Local dblp
Keywords:

[link] Summary by isarandi 6 years ago

* Semi-supervised method
* There is a teacher net and a student net, with identical architecture.
* The teacher makes predictions on unlabeled data, which are used as ground-truth for training the student net.
* After each gradient descent update on the student, the teacher's weights are updated so that it becomes an exponential moving average of the weights of the student at previous timesteps. It's called a "mean teacher" because of this moving average.

arxiv.org
scholar.google.com

Data Distillation: Towards Omni-Supervised Learning
Radosavovic, Ilija and Dollár, Piotr and Girshick, Ross B. and Gkioxari, Georgia and He, Kaiming
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by isarandi 6 years ago

* It's a semi-supervised method (the goal is to make use of unlabeled data in addition to labeled data).
* They first train a neural net normally, in the supervised way, on a labeled dataset.
* Then **they retrain the net using *its own predictions* on the originally unlabeled data as if it was ground truth** (but only when the net is confident enough about the prediction).
  * More precisely they retrain on the union of the original dataset and the examples labeled by the net itself. (Each minibatch is on average 60% original and 40% self-labeled)
* When making these predictions (that will subsequently used for training), they use **multi-transform inference**.
  * They apply the net to differently transformed versions of the image (mirroring, scaling), transform the outputs back accordingly and combine the results.

isarandi

sciscore: 1.625