First published: 2016/04/12 (3 years ago) Abstract: The field of object detection has made significant advances riding on the
wave of region-based ConvNets, but their training procedure still includes many
heuristics and hyperparameters that are costly to tune. We present a simple yet
surprisingly effective online hard example mining (OHEM) algorithm for training
region-based ConvNet detectors. Our motivation is the same as it has always
been -- detection datasets contain an overwhelming number of easy examples and
a small number of hard examples. Automatic selection of these hard examples can
make training more effective and efficient. OHEM is a simple and intuitive
algorithm that eliminates several heuristics and hyperparameters in common use.
But more importantly, it yields consistent and significant boosts in detection
performance on benchmarks like PASCAL VOC 2007 and 2012. Its effectiveness
increases as datasets become larger and more difficult, as demonstrated by the
results on the MS COCO dataset. Moreover, combined with complementary advances
in the field, OHEM leads to state-of-the-art results of 78.9% and 76.3% mAP on
PASCAL VOC 2007 and 2012 respectively.
The problem statement this paper tries to address is that the training set is distinguished by a large imbalance between the number of foreground examples and background examples-To make the point concrete cases like sliding window object detectors like deformable parts model, the imbalance may be as extreme as 100,000 background examples to one annotated foreground example.
Before i proceed to give you the details of Hard Example mining, i just want to note that HEM in its essence is mostly while training you sort your losses and train your model on the most difficult examples which mostly means the ones with the most loss.(An extension to this can be found in the paper Focal Loss). This is a simple but powerful technique.
So taking this as out background,The authors propose a simple but effective method to train an Fast-RCNN.
Their approach is as follows,
1. For an input image at SGD iteration t, they first compute a convolution feature map using the conv-Network
2. The ROI Network uses this feature map and all the input ROI's to do a forward pass
3. Hard examples are sorted by loss and taking the B/N examples for which the current network performs worse.(Here B is batch size and N is Number of examples)
4. While doing this, The researchers notice that Co-located ROI's with high overlap are likely to have co-related losses. Also If you notice Overlapping ROI's will project onto the mostly the same region in the Conv-feature map because the feature map is a denser/smaller representation of the feature map.So this might lead to loss double counting.To deal with this They use standard Non-Maximum Supression.
5. Now how NMS works here is, It iteratively selects the ROI with the highest loss and removes all lower loss ROI's that have high overlap with the selected region.Here they use a IOU threshold of 0.7
First published: 2017/08/07 (2 years ago) Abstract: The highest accuracy object detectors to date are based on a two-stage
approach popularized by R-CNN, where a classifier is applied to a sparse set of
candidate object locations. In contrast, one-stage detectors that are applied
over a regular, dense sampling of possible object locations have the potential
to be faster and simpler, but have trailed the accuracy of two-stage detectors
thus far. In this paper, we investigate why this is the case. We discover that
the extreme foreground-background class imbalance encountered during training
of dense detectors is the central cause. We propose to address this class
imbalance by reshaping the standard cross entropy loss such that it
down-weights the loss assigned to well-classified examples. Our novel Focal
Loss focuses training on a sparse set of hard examples and prevents the vast
number of easy negatives from overwhelming the detector during training. To
evaluate the effectiveness of our loss, we design and train a simple dense
detector we call RetinaNet. Our results show that when trained with the focal
loss, RetinaNet is able to match the speed of previous one-stage detectors
while surpassing the accuracy of all existing state-of-the-art two-stage
detectors. Code is at: https://github.com/facebookresearch/Detectron.
In object detection the boost in speed and accuracy is mostly gained through network architecture changes.This paper takes a different route towards achieving that goal,They introduce a new loss function called focal loss.
The authors identify class imbalance as the main obstacle toward one stage detectors achieving results which are as good as two stage detectors.
The loss function they introduce is a dynamically scaled cross entropy loss,Where the scaling factor decays to zero as the confidence in the correct class increases.
They add a modulating factor as shown in the image below to the cross- entropy loss https://i.imgur.com/N7R3M9J.png
Which ends up looking like this https://i.imgur.com/kxC8NCB.png
in experiments though they add an additional alpha term to it,because it gives them better results.
The network consists of a single unified network which is composed of a backbone network and two task specific subnetworks.The backbone network computes the feature maps for the input images.The first sub-network helps in object classification of the backbone networks output and the second sub-network helps in bounding box regression.
The backbone network they use is Feature Pyramid Network,Which they build on top of ResNet.