The method is a multi-task learning model performing person detection, keypoint detection, person segmentation, and pose estimation. It is a bottom-up approach as it first localizes identity-free semantics and then group them into instances. https://i.imgur.com/kRs9687.png Model structure: - **Backbone**. A feature extractor is presented by ResNet-(50 or 101) with one [Feature Pyramid Network](https://arxiv.org/pdf/1612.03144.pdf) (FPN) for keypoint branch and one for person detection branch. FPN enhances extracted features through multi-level representation. - **Keypoint detection** detects keypoints as well as produces a pixel-level segmentation mask. https://i.imgur.com/XFAi3ga.png FPN features $K_i$ are processed with multiple $3\times3$ convolutions followed by concatenation and final $1\times1$ convolution to obtain predictions for each keypoint, as well as segmentation mask (see Figure for details). This results in #keypoints_in_dataset_per_person + 1 output layers. Additionally, intermediate supervision (i.e. loss) is applied at the FPN outputs. $L_2$ loss between predictions and Gaussian peaks at the keypoint locations is used. Similarly, $L_2$ loss is applied for segmentation predictions and corresponding ground truth masks. - **Person detection** is essentially a [RetinaNet](https://arxiv.org/pdf/1708.02002.pdf), a one-stage object detector, modified to only handle *person* class. - **Pose estimation**. Given initial keypoint predictions, Pose Estimation Network (PRN) selects a single keypoint for each class. https://i.imgur.com/k8wNP5p.png During inference, PRN takes cropped outputs from keypoint detection branch defined by the predicted bounding boxes from the person detection branch, resizes it to a fixed size, and forwards it through a multilayer perceptron with residual connection. During the training, the same process is performed, except the cropped keypoints come from the ground truth annotation defined by a labeled bounding box. This model is not an end-to-end trainable model. While keypoint and person detection branches can, in theory, be trained simultaneously, PRN network requires separate training. **Personal note**. Interestingly, PRN training with ground truth inputs (i.e. "perfect" inputs) only reaches 89.4 mAP validation score which is surprisingly quite far from the max possible score. This presumably means that even if preceding networks or branches perform god-like, the PRN might become a bottleneck in the performance. Therefore, more efforts should be directed to PRN itself. Moreover, modifying the network to support end-to-end training might help in boosting the performance. Open-source implementations used to make sure the paper apprehension is correct: [link1](https://github.com/LiMeng95/MultiPoseNet.pytorch), [link2](https://github.com/IcewineChen/pytorch-MultiPoseNet).