Fast R-CNN on ShortScience.org

11

Fast RCNN is a proposal detection net for object detection tasks.

##### Input & Output
The input to a Fast RCNN would be the input image and the region proposals (generated using Selective Search). There are 2 outputs of the net, probability map of all possible objects & background ( e.g. 21 classes for Pascal VOC'12) and corresponding bounding box parameters for each object classes.

##### Architecture
The Fast RCNN version of any deep net would need 3 major modifications. For e.g. for VGG'16
1. A ROI pooling layer needs to be added after the final maxpool output before fully connected layers
2. The final FC layer is replaced by 2 sibling branched layers - one for giving a softmax output for probability classes, other one is for predicting an encoding of 4 bounding box parameters (x,y, width,height) w.r.t. region proposals
3. Modifying the input 2 take 2 input. images and corresponding prposals

**ROI Pooling layer** - The most notable contribution from the paper is designed to maxpool the features inside a proposed region into a fixed size (for VGG'16 version of FCNN it was 7 x 7) . The intuition behind the layer is make it faster as compared to SPPNets, (which used spatial pyramidal pooling) and RCNN.

##### Results
The net is trained with dual loss (log loss on probability output + squared error loss on bounding box parameters) .
The results were very impressive, on the VOC '07, '10 & '12 datasets with Fast RCNN outperforming the rest of the nets, in terms of mAp accuracy

Your comment:

5

[link] Summary by Joseph Paul Cohen 8 years ago

This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14}

1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training.
2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell."
3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values.

This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15}

Your comment:

3

[link] Summary by Alexander Jung 6 years ago

* The original R-CNN had three major disadvantages:
1. Two-staged training pipeline: Instead of only training a CNN, one had to train first a CNN and then multiple SVMs.
2. Expensive training: Training was slow and required lots of disk space (feature vectors needed to be written to disk for all region proposals (2000 per image) before training the SVMs).
3. Slow test: Each region proposal had to be handled independently.
* Fast R-CNN ist an improved version of R-CNN and tackles the mentioned problems.
* It no longer uses SVMs, only CNNs (single-stage).
* It does one single feature extraction per image instead of per region, making it much faster (9x faster at training, 213x faster at test).
* It is more accurate than R-CNN.

### How
* The basic architecture, training and testing methods are mostly copied from R-CNN.
* For each image at test time they do:
* They generate region proposals via selective search.
* They feed the image once through the convolutional layers of a pre-trained network, usually VGG16.
* For each region proposal they extract the respective region from the features generated by the network.
* The regions can have different sizes, but the following steps need fixed size vectors. So each region is downscaled via max-pooling so that it has a size of 7x7 (so apparently they ignore regions of sizes below 7x7...?).
* This is called Region of Interest Pooling (RoI-Pooling).
* During the backwards pass, partial derivatives can be transferred to the maximum value (as usually in max pooling). That derivative values are summed up over different regions (in the same image).
* They reshape the 7x7 regions to vectors of length `F*7*7`, where `F` was the number of filters in the last convolutional layer.
* They feed these vectors through another network which predicts:
1. The class of the region (including background class).
2. Top left x-coordinate, top left y-coordinate, log height and log width of the bounding box (i.e. it fine-tunes the region proposal's bounding box). These values are predicted once for every class (so `K*4` values).
* Architecture as image:
* ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Fast_R-CNN__architecture.jpg?raw=true "Architecture")
* Sampling for training
* Efficiency
* If batch size is `B` it is inefficient to sample regions proposals from `B` images as each image will require a full forward pass through the base network (e.g. VGG16).
* It is much more efficient to use few images to share most of the computation between region proposals.
* They use two images per batch (each 64 region proposals) during training.
* This technique introduces correlations between examples in batches, but they did not observe any problems from that.
* They call this technique "hierarchical sampling" (first images, then region proposals).
* IoUs
* Positive examples for specific classes during training are region proposals that have an IoU with ground truth bounding boxes of `>=0.5`.
* Examples for background region proposals during training have IoUs with any ground truth box in the interval `(0.1, 0.5]`.
* Not picking IoUs below 0.1 is similar to hard negative mining.
* They use 25% positive examples, 75% negative/background examples per batch.
* They apply horizontal flipping as data augmentation, nothing else.
* Outputs
* For their class predictions the use a simple softmax with negative log likelihood.
* For their bounding box regression they use a smooth L1 loss (similar to mean absolute error, but switches to mean squared error for very low values).
* Smooth L1 loss is less sensitive to outliers and less likely to suffer from exploding gradients.
* The smooth L1 loss is only active for positive examples (not background examples). (Not active means that it is zero.)
* Training schedule
* The use SGD.
* They train 30k batches with learning rate 0.001, then 0.0001 for another 10k batches. (On Pascal VOC, they use more batches on larger datasets.)
* They use twice the learning rate for the biases.
* They use momentum of 0.9.
* They use parameter decay of 0.0005.
* Truncated SVD
* The final network for class prediction and bounding box regression has to be applied to every region proposal.
* It contains one large fully connected hidden layer and one fully connected output layer (`K+1` classes plus `K*4` regression values).
* For 2000 proposals that becomes slow.
* So they compress the layers after training to less weights via truncated SVD.
* A weights matrix is approximated via ![T-SVD equation](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Fast_R-CNN__tsvd.jpg?raw=true "T-SVD equation")
* U (`u x t`) are the first `t` left-singular vectors of W.
* Sigma is a `t x t` diagonal matrix of the top `t` singular values.
* V (`v x t`) are the first `t` right-singular vectors of W.
* W is then replaced by two layers: One contains `Sigma V^T` as weights (no biases), the other contains `U` as weights (with original biases).
* Parameter count goes down to `t(u+v)` from `uv`.

### Results
* They try three base models:
* AlexNet (Small, S)
* VGG-CNN-M-1024 (Medium, M)
* VGG16 (Large, L)
* On VGG16 and Pascal VOC 2007, compared to original R-CNN:
* Training time down to 9.5h from 84h (8.8x faster).
* Test rate *with SVD* (1024 singular values) improves from 47 seconds per image to 0.22 seconds per image (213x faster).
* Test rate *without SVD* improves similarly to 0.32 seconds per image.
* mAP improves from 66.0% to 66.6% (66.9% without SVD).
* Per class accuracy results:
* Fast_R-CNN__pvoc2012.jpg
* ![VOC2012 results](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Fast_R-CNN__pvoc2012.jpg?raw=true "VOC2012 results")
* Fixing the weights of VGG16's convolutional layers and only fine-tuning the fully connected layers (those are applied to each region proposal), decreases the accuracy to 61.4%.
* This decrease in accuracy is most significant for the later convolutional layers, but marginal for the first layers.
* Therefor they only train the convolutional layers starting with `conv3_1` (9 out of 13 layers), which speeds up training.
* Multi-task training
* Training models on classification and bounding box regression instead of only on classification improves the mAP (from 62.6% to 66.9%).
* Doing this in one hierarchy instead of two seperate models (one for classification, one for bounding box regression) increases mAP by roughly 2-3 percentage points.
* They did not find a significant benefit of training the model on multiple scales (e.g. same image sometimes at 400x400, sometimes at 600x600, sometimes at 800x800 etc.).
* Note that their raw CNN (everything before RoI-Pooling) is fully convolutional, so they can feed the images at any scale through the network.
* Increasing the amount of training data seemed to improve mAP a bit, but not as much as one might hope for.
* Using a softmax loss instead of an SVM seemed to marginally increase mAP (0-1 percentage points).
* Using more region proposals from selective search does not simply increase mAP. Instead it can lead to higher recall, but lower precision.
* ![Proposal schemes](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Fast_R-CNN__proposal_schemes.jpg?raw=true "Proposal schemes")
* Using densely sampled region proposals (as in sliding window) significantly reduces mAP (from 59.2% to 52.9%). If SVMs instead of softmaxes are used, the results are even worse (49.3%).

Your comment:

2

[link] Summary by Henry Z Lo 8 years ago

This paper is awesome in that it is full of content.

They replace W with its TSVD.  When t, the reduced rank, is small, it saves computation time because you multiply smaller matrices twice rather than multiplying bigger matrices once.

In terms of units in hidden layers, they turn n->m into n->t->m

This only works for the forward pass though.  If you were to train this, you would only learn a rank t matrix.  In which case, there would be no reason to have the t->m layer.  Unless you want more nonlinearities, but less rank; haven't seen that before.

Your comment:

2

[link] Summary by Martin Thoma 7 years ago

## See also
Related papers:
* [R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen)
* [Faster R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma)
* [Mask R-CNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17)

Blog posts:
* Dhruv Parthasarathy: [A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN](https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4)

Your comment:

2

[link] Summary by Léo Paillier 6 years ago

Improve on [R-CNN](https://arxiv.org/abs/1311.2524) and [SPPnet](https://arxiv.org/abs/1406.4729) with easier and faster training.

Region-based Convolutional Neural Network (R-CNN), basically takes as input and image and several possibles objects (corresponding to Region of Interest) and score each of them.

## Architecture:

The feature map is computed for the whole image and then for each region of interest a new fixed-length feature vector is computed using max-pooling. From it two predictions are made for classification and bounding-box offsets.

[![screen shot 2017-04-14 at 12 46 38 pm](https://cloud.githubusercontent.com/assets/17261080/25041460/6e7cba40-2110-11e7-8650-faae2a6b0a92.png)](https://cloud.githubusercontent.com/assets/17261080/25041460/6e7cba40-2110-11e7-8650-faae2a6b0a92.png)

## Results:

By sharing computation for RoIs of the same image and allowing simple SGD training it really improves performance training although at testing it's still not as fast as YOLO9000.

Your comment: