R-FCN: Object Detection via Region-based Fully Convolutional NetworksR-FCN: Object Detection via Region-based Fully Convolutional NetworksDai, Jifeng and Li, Yi and He, Kaiming and Sun, Jian2016
Paper summaryaleju * They present a variation of Faster R-CNN, i.e. a model that predicts bounding boxes in images and classifies them.
* In contrast to Faster R-CNN, their model is fully convolutional.
* In contrast to Faster R-CNN, the computation per bounding box candidate (region proposal) is very low.
* The basic architecture is the same as in Faster R-CNN:
* A base network transforms an image to a feature map. Here they use ResNet-101 to do that.
* A region proposal network (RPN) uses the feature map to locate bounding box candidates ("region proposals") in the image.
* A classifier uses the feature map and the bounding box candidates and classifies each one of them into `C+1` classes,
where `C` is the number of object classes to spot (e.g. "person", "chair", "bottle", ...) and `1` is added for the background.
* During that process, small subregions of the feature maps (those that match the bounding box candidates) must be extracted and converted to fixed-sizes matrices.
The method to do that is called "Region of Interest Pooling" (RoI-Pooling) and is based on max pooling.
It is mostly the same as in Faster R-CNN.
* Visualization of the basic architecture:
* ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/R-FCN__architecture.jpg?raw=true "Architecture")
* Position-sensitive classification
* Fully convolutional bounding box detectors tend to not work well.
* The authors argue, that the problems come from the translation-invariance of convolutions, which is a desirable property in the case of classification but not when precise localization of objects is required.
* They tackle that problem by generating multiple heatmaps per object class, each one being slightly shifted ("position-sensitive score maps").
* More precisely:
* The classifier generates per object class `c` a total of `k*k` heatmaps.
* In the simplest form `k` is equal to `1`. Then only one heatmap is generated, which signals whether a pixel is part of an object of class `c`.
* They use `k=3*3`. The first of those heatmaps signals, whether a pixel is part of the *top left* corner of a bounding box of class `c`. The second heatmap signals, whether a pixel is part of the *top center* of a bounding box of class `c` (and so on).
* The RoI-Pooling is applied to these heatmaps.
* For `k=3*3`, each bounding box candidate is converted to `3*3` values. The first one resembles the top left corner of the bounding box candidate. Its value is generated by taking the average of the values in that area in the first heatmap.
* Once the `3*3` values are generated, the final score of class `c` for that bounding box candidate is computed by averaging the values.
* That process is repeated for all classes and a softmax is used to determine the final class.
* The graphic below shows examples for that:
* ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/R-FCN__examples.jpg?raw=true "Examples")
* The above described RoI-Pooling uses only averages and hence is almost (computationally) free.
* They make use of that during the training by sampling many candidates and only backpropagating on those with high losses (online hard example mining, OHEM).
* À trous trick
* In order to increase accuracy for small bounding boxes they use the à trous trick.
* That means that they use a pretrained base network (here ResNet-101), then remove a pooling layer and set the à trous rate (aka dilation) of all convolutions after the removed pooling layer to `2`.
* The á trous rate describes the distance of sampling locations of a convolution. Usually that is `1` (sampled locations are right next to each other). If it is set to `2`, there is one value "skipped" between each pair of neighbouring sampling location.
* By doing that, the convolutions still behave as if the pooling layer existed (and therefore their weights can be reused). At the same time, they work at an increased resolution, making them more capable of classifying small objects. (Runtime increases though.)
* Training of R-FCN happens similarly to Faster R-CNN.
* Similar accuracy as the most accurate Faster R-CNN configurations at a lower runtime of roughly 170ms per image.
* Switching to ResNet-50 decreases accuracy by about 2 percentage points mAP (at faster runtime). Switching to ResNet-152 seems to provide no measureable benefit.
* OHEM improves mAP by roughly 2 percentage points.
* À trous trick improves mAP by roughly 2 percentage points.
* Training on `k=1` (one heatmap per class) results in a failure, i.e. a model that fails to predict bounding boxes. `k=7` is slightly more accurate than `k=3`.