Mask R-CNNMask R-CNNHe, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.2017
Paper summaryleopaillier_Objective:_ Image segmentation and pose estimation with an extension of Faster R-CNN.
_Dataset:_ [COCO](http://mscoco.org/) and [Cityscapes](https://www.cityscapes-dataset.com/).
## Inner workings:
The core operator of Faster R-CNN is the _RoIPool_ which performs coarse spatial quantization for feature extraction but introduce misalignment for pixel-pixel comparison which is what segmentation is. The paper introduce a new layer _RoIAlign_ that faithfully preserves exact spatial location.
One important point is that mask and class prediction are decoupled, the segmentation is proposed for each class without competing and the class predictor finally elects the winner.
Based on Faster R-CNN but with an added mask subnetwork that computes a segmentation mask for each class.
Different feature extractors and proposers are tried, see two examples below:
[![screen shot 2017-05-22 at 7 25 04 pm](https://cloud.githubusercontent.com/assets/17261080/26320765/659bfd6e-3f24-11e7-9184-393e83e9108d.png)](https://cloud.githubusercontent.com/assets/17261080/26320765/659bfd6e-3f24-11e7-9184-393e83e9108d.png)
Runs at about 200ms per frame on a GPU for segmentation (2 days training on a single 8-GPU) and 5 fps for pose estimation.
Very impressive segmentation and pose estimation:
[![screen shot 2017-05-22 at 7 26 57 pm 1](https://cloud.githubusercontent.com/assets/17261080/26320824/a9a0909c-3f24-11e7-8e06-b2f132aad2d7.png)](https://cloud.githubusercontent.com/assets/17261080/26320824/a9a0909c-3f24-11e7-8e06-b2f132aad2d7.png)
[![screen shot 2017-05-22 at 7 29 26 pm](https://cloud.githubusercontent.com/assets/17261080/26320929/08b71c4a-3f25-11e7-8eb5-959ceb7b6112.png)](https://cloud.githubusercontent.com/assets/17261080/26320929/08b71c4a-3f25-11e7-8eb5-959ceb7b6112.png)
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN).
Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool.
Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class.
##### Major Changes and intutions
Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask.
Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation
RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries
or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average).
Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN)
The training objective looks like this
Lmask is the addition from Faster RCNN. The method to calculate was mentioned above
Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper