Learning to Segment Object Candidates on ShortScience.org

arxiv.org
scholar.google.com

Learning to Segment Object Candidates
Pinheiro, Pedro H. O. and Collobert, Ronan and Dollár, Piotr
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Qure.ai 7 years ago

Facebook has [released a series of papers](https://research.facebook.com/blog/learning-to-segment/) for object segmentation and detection. This paper is the first in that series.

This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)):

1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm.
2. A CNN classifier is applied on each of the proposals.

The current paper improves the step 1, i.e., region/object proposals.

Most object proposals approaches fall into three categories:
* Objectness scoring
* Seed Segmentation
* Superpixel Merging

Current method is different from these three.
It share similarities with [Faster R-CNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN.
The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object.

## Model and Training

Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGG-A pretrained model.

Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints:
* the patch contains an object roughly centered in the input patch
* the object is fully contained in the patch and in a given scale range

Note that the network must output a mask for a single object at the center even when multiple objects are present.

Figure 1 shows the architecture and sampling for training.
![figure1](https://i.imgur.com/zSyP0ij.png)

Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation.

## Inference

During full image inference, model is applied densely at multiple locations and scales.
This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN).

![figure2](https://i.imgur.com/dQWfy8R.png)

This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private