[link]
The problem statement this paper tries to address is that the training set is distinguished by a large imbalance between the number of foreground examples and background examples-To make the point concrete cases like sliding window object detectors like deformable parts model, the imbalance may be as extreme as 100,000 background examples to one annotated foreground example. Before i proceed to give you the details of Hard Example mining, i just want to note that HEM in its essence is mostly while training you sort your losses and train your model on the most difficult examples which mostly means the ones with the most loss.(An extension to this can be found in the paper Focal Loss). This is a simple but powerful technique. So taking this as out background,The authors propose a simple but effective method to train an Fast-RCNN. Their approach is as follows, 1. For an input image at SGD iteration t, they first compute a convolution feature map using the conv-Network 2. The ROI Network uses this feature map and all the input ROI's to do a forward pass 3. Hard examples are sorted by loss and taking the B/N examples for which the current network performs worse.(Here B is batch size and N is Number of examples) 4. While doing this, The researchers notice that Co-located ROI's with high overlap are likely to have co-related losses. Also If you notice Overlapping ROI's will project onto the mostly the same region in the Conv-feature map because the feature map is a denser/smaller representation of the feature map.So this might lead to loss double counting.To deal with this They use standard Non-Maximum Supression. 5. Now how NMS works here is, It iteratively selects the ROI with the highest loss and removes all lower loss ROI's that have high overlap with the selected region.Here they use a IOU threshold of 0.7 |
[link]
In object detection the boost in speed and accuracy is mostly gained through network architecture changes.This paper takes a different route towards achieving that goal,They introduce a new loss function called focal loss. The authors identify class imbalance as the main obstacle toward one stage detectors achieving results which are as good as two stage detectors. The loss function they introduce is a dynamically scaled cross entropy loss,Where the scaling factor decays to zero as the confidence in the correct class increases. They add a modulating factor as shown in the image below to the cross- entropy loss https://i.imgur.com/N7R3M9J.png Which ends up looking like this https://i.imgur.com/kxC8NCB.png in experiments though they add an additional alpha term to it,because it gives them better results. **Retina Net** The network consists of a single unified network which is composed of a backbone network and two task specific subnetworks.The backbone network computes the feature maps for the input images.The first sub-network helps in object classification of the backbone networks output and the second sub-network helps in bounding box regression. The backbone network they use is Feature Pyramid Network,Which they build on top of ResNet. |