Rich Feature Hierarchies for Accurate Object Detection and Semantic SegmentationRich Feature Hierarchies for Accurate Object Detection and Semantic SegmentationGirshick, Ross B. and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra2014
Paper summaryevansuThis paper presents a object detection algorithm that improves mAP on PASCAL VOC dataset by over 20% to previous state-of-the-art. Unlike image classification which take an image or the center part of an image as input, object detection task requires an algorithm to detect bounding boxes of objects in an image. To use the high capacity CNN features in object detection, the proposed algorithm first generates region proposals. CNN features are extracted from those region proposals and are feed to a set of class-specific linear SVMs which tell whether objects are detected in those regions.
The figure below show the object detection system in this paper.
Because the PASCAL VOC dataset is not large enough for training high capacity CNN features, this paper use supervised pre-training on a large auxiliary dataset (ILSVRC 2012). The CNN is then fine-tuned with a portion of the PASCAL VOC dataset.
The following table shows the detection mAP on VOC 2007 test.
The [R-CNN](http://arxiv.org/abs/1311.2524) paper presents a method based on convolutional neural networks (CNNs) for object detection. It does so by region proposals (hence the "R"). The key insight was to train CNNs on classification tasks and use the learned features for the region proposals. The do *not* use a sliding window approach such as Overfeat. They create around 2000 category-independent region proposals. For each proposal, they crop the part of that image. Then they resize the cropped part to fit into the CNN and classify it.
Notable follow-ups are:
* [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15)
* [Faster R-CNNs](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15)