Mask R-CNNMask R-CNNHe, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.2017
Paper summaryleopaillier_Objective:_ Image segmentation and pose estimation with an extension of Faster R-CNN.
_Dataset:_ [COCO](http://mscoco.org/) and [Cityscapes](https://www.cityscapes-dataset.com/).
## Inner workings:
The core operator of Faster R-CNN is the _RoIPool_ which performs coarse spatial quantization for feature extraction but introduce misalignment for pixel-pixel comparison which is what segmentation is. The paper introduce a new layer _RoIAlign_ that faithfully preserves exact spatial location.
One important point is that mask and class prediction are decoupled, the segmentation is proposed for each class without competing and the class predictor finally elects the winner.
Based on Faster R-CNN but with an added mask subnetwork that computes a segmentation mask for each class.
Different feature extractors and proposers are tried, see two examples below:
[![screen shot 2017-05-22 at 7 25 04 pm](https://cloud.githubusercontent.com/assets/17261080/26320765/659bfd6e-3f24-11e7-9184-393e83e9108d.png)](https://cloud.githubusercontent.com/assets/17261080/26320765/659bfd6e-3f24-11e7-9184-393e83e9108d.png)
Runs at about 200ms per frame on a GPU for segmentation (2 days training on a single 8-GPU) and 5 fps for pose estimation.
Very impressive segmentation and pose estimation:
[![screen shot 2017-05-22 at 7 26 57 pm 1](https://cloud.githubusercontent.com/assets/17261080/26320824/a9a0909c-3f24-11e7-8e06-b2f132aad2d7.png)](https://cloud.githubusercontent.com/assets/17261080/26320824/a9a0909c-3f24-11e7-8e06-b2f132aad2d7.png)
[![screen shot 2017-05-22 at 7 29 26 pm](https://cloud.githubusercontent.com/assets/17261080/26320929/08b71c4a-3f25-11e7-8eb5-959ceb7b6112.png)](https://cloud.githubusercontent.com/assets/17261080/26320929/08b71c4a-3f25-11e7-8eb5-959ceb7b6112.png)
#### Mask R-CNN framework for instance segmentation
* classify individual objects
* localize each using a bounding box,
* semantic segmentation
* classify each pixel into a fixed set of categories without differentiating object instances.
* extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression.
* FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner
* Used to fix the misalignment that faithfully preserves exact spatial locations
* improves mask accuracy by relative 10% to 50%, fast speed
2. Decouple mask and class prediction:
* predict a binary mask for each class independently, without competition among classes
* RCNN: The Region-based CNN (R-CNN) approach to bounding-box object detection
* Fast RCNN: Speeding up and Simplifying R-CNN
* RoI (Region of Interest) Pooling
* jointly train the CNN, classifier, and bounding box regressor in a single model
* Faster R-CNN - Speeding Up Region Proposal
* reuse the same CNN results for region proposals instead of running a separate selective search algorithm it can be done by Region Proposal Network
* only one CNN needs to be trained
* Instance Segmentation: “fully convolutional instance segmentation” (FCIS)
* Faster R-CNN: * Region Proposal Network (RPN), proposes candidate object bounding boxes
* Fast R-CNN , extracts features using RoIPool from each candidate box and performs classification and bounding-box regression
* Mask R-CNN: Mask R-CNN adopts the same two-stage of Faster RCNN And has third stage i.e binary mask for each RoI
* Mask Representation: pixel to pixel representation of image done by RoIAlign layer (7X7)
#### Network Architecture
* convolutional backbone architecture used for feature extraction over an entire image (ResNet-50-C4, FPN)
* network head for bounding-box recognition (classification and regression) and mask prediction
* Images resized:800 pixel
* mini-batch : 2 images per GPU
* N : 64
* train: on 8 GPUs for 160k iterations
* learning : 0.02
* train images: 80K
* val images: 35K