ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

dx.doi.org
sci-hub
scholar.google.com

Fast R-CNN
Girshick, Ross B.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14}

1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training.
2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell."
3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values.

This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15}

dx.doi.org
sci-hub
scholar.google.com

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Girshick, Ross B. and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra
Conference and Computer Vision and Pattern Recognition - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by nandini 6 years ago

# Object detection system overview.

https://i.imgur.com/vd2YUy3.png

1. takes an input image,
2. extracts around 2000 bottom-up region proposals,
3. computes features for each proposal using a large convolutional neural network (CNN), and then
4. classifies each region using class-specific linear SVMs.
* R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010.
* On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%.

## There is a 2 challenges faced in object detection
1. localization problem
2. labeling the data

1 localization problem :
* One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method.
* An alternative is to build a sliding-window detector. considered adopting a sliding-window approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm.

2 labeling the data:
* The conventional solution to this problem is to use unsupervised pre-training, followed by supervise fine-tuning
* supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL),
* fine-tuning for detection improves mAP performance by 8 percentage points.
* Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs)

## Object detection with R-CNN
This system consists of three modules
* The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector.
* The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region.
* The third module is a set of class specific linear SVMs.

Module design

1 Region proposals
* which detect mitotic cells by applying a CNN to regularly-spaced square crops.
* use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute).
* the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU)

2 Feature extraction.
* extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN
* Features are computed by forward propagating a mean-subtracted 227x227 RGB image through five convolutional layers and two fully connected layers.
* warp all pixels in a tight bounding box around it to the required size
* The feature matrix is typically 2000x4096

3 Test time detection
* At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments).
* warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class.
* Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold.
## Training

1 Supervised pre-training:
* pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data)

2 Domain-specific fine-tuning.
* use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001.

3 Object category classifiers.
* use intersection-over union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3.
* Once features are extracted and training labels are applied, we optimize one linear SVM per class.
* adopt the standard hard negative mining method to fit large training data in memory.

### Results on PASCAL VOC 201012

1 VOC 2010
* compared against four strong baselines including SegDPM, DPM, UVA, Regionlets.
* Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster
https://i.imgur.com/0dGX9b7.png
2 ILSVRC2013 detection.
* ran R-CNN on the 200-class ILSVRC2013 detection dataset
* R-CNN achieves a mAP of 31.4%
https://i.imgur.com/GFbULx3.png
#### Performance layer-by-layer, without fine-tuning
1 pool5 layer
* which is the max pooled output of the network’s fifth and final convolutional layer.
*The pool5 feature map is 6 x6 x 256 = 9216 dimensional
* each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input

2 Layer fc6
* fully connected to pool5
* it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases

3 Layer fc7
* It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification
#### Performance layer-by-layer, with fine-tuning
* CNN’s parameters fine-tuned on PASCAL.
* fine-tuning increases mAP by 8.0 % points to 54.2%

### Network architectures
* 16-layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.
* RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0%
* drawback in terms of compute time, with in terms of compute time, with than T-Net.

1 The ILSVRC2013 detection dataset
* dataset is split into three sets: train (395,918), val (20,121), and test (40,152)

#### CNN features for segmentation.
* full R-CNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap.
* fg R-CNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction.
* full+fg R-CNN: The third strategy (full+fg) simply concatenates the full and fg features
https://i.imgur.com/n1bhmKo.png

1 Comments

openaccess.thecvf.com
sci-hub
scholar.google.com

3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
Pavllo, Dario and Feichtenhofer, Christoph and Grangier, David and Auli, Michael
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - 2019 via Local Bibsonomy
Keywords: 3D, Human, estimation, pose

[link] Summary by Oleksandr Bailo 4 years ago

This paper proposes a 3D human pose estimation in video method based on the dilated temporal convolutions applied on 2D keypoints (input to the network). 2D keypoints can be obtained using any person keypoint detector, but Mask R-CNN with ResNet-101 backbone, pre-trained on COCO and fine-tuned on 2D projections from Human3.6M, is used in the paper.
https://i.imgur.com/CdQONiN.png

The poses are presented as 2D keypoint coordinates in contrast to using heatmaps (i.e. Gaussian operation applied at the keypoint 2D location). Thus, 1D convolutions over the time series are applied, instead of 2D convolutions over heatmaps. The model is a fully convolutional architecture with residual connections that takes a sequence of 2D poses ( concatenated $(x,y)$ coordinates of the joints in each frame) as input and transforms them through temporal convolutions.
https://i.imgur.com/tCZvt6M.png
The `Slice` layer in the residual connection performs padding (or slicing) the sequence with replicas of boundary frames (to both left and right) to match the dimensions with the main block as zero-padding is not used in the convolution operations.

3D pose estimation is a difficult task particularly due to the limited data available online. Therefore, the authors propose semi-supervised approach of training the 2D->3D pose estimation by exploiting unlabeled video. Specifically, 2D keypoints are detected in the unlabeled video with any keypoint detector, then 3D keypoints are predicted from them and these 3D points are reprojected back to 2D (camera intrinsic parameters are required). This is idea similar to cycle consistency in the [CycleGAN](https://junyanz.github.io/CycleGAN/), for instance.
https://i.imgur.com/CBHxFOd.png
In the semi-supervised part (bottom part of the image above) training penalizes when the reprojected 2D keypoints are far from the original input. Weighted mean per-joint position error (WMPJPE) loss, weighted by the inverse of the depth to the object (since far objects should contribute less to the training than close ones) is used as the optimization goal.

The two networks (`supervised` above, `semi-supervised` below) have the same architecture but do not share any weights. They are jointly optimized where `semi-supervised` part serves as a regularizer. They communicate through the path aiming to make sure that the mean bone length of the above and below branches match.

The interesting tendency is observed from the MPJPE analysis with different amounts of supervised and unsupervised data available. Basically, the `semi-supervised` approach becomes more effective when less labeled data is available.
https://i.imgur.com/bHpVcSi.png

Additionally, the error is reduced when the ground truth keypoints are used. This means that a robust and accurate 2D keypoint detector is essential for the accurate 3D pose estimation in this setting.
https://i.imgur.com/rhhTDfo.png

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

DeepFace: Closing the Gap to Human-Level Performance in Face Verification
Taigman, Yaniv and Yang, Ming and Ranzato, Marc'Aurelio and Wolf, Lior
Conference and Computer Vision and Pattern Recognition - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 7 years ago

## General stuff about face recognition

Face recognition has 4 main tasks:

* **Face detection**: Given an image, draw a rectangle around every face
* **Face alignment**: Transform a face to be in a canonical pose
* **Face representation**: Find a representation of a face which is suitable for follow-up tasks (small size, computationally cheap to compare, invariant to irrelevant changes)
* **Face verification**: Images of two faces are given. Decide if it is the same person or not.

The face verification task is sometimes (more simply) a face classification task (given a face, decide which of a fixed set of people it is).

Datasets being used are:

* **LFW** (Labeled Faces in the Wild): 97.35% accuracy; 13 323 web photos of 5 749 celebrities
* **YTF** (YouTube Faces): 3425 YouTube videos of 1 595 subjects
* **SFC** (Social Face Classification): 4.4 million labeled faces from 4030 people, each 800 to 1200 faces
* **USF** (Human-ID database): 3D scans of faces

## Ideas in this paper

This paper deals with face alignment and face representation.

**Face Alignment**

They made an average face with the USF dataset. Then, for each new face, they apply the following procedure:

* Find 6 points in a face (2 eyes, 1 nose tip, 2 corners of the lip, 1 middle point of the bottom lip)
* Crop according to those
* Find 67 points in the face / apply them to a normalized 3D model of a face
* Transform (=align) face to a normalized position

**Representation**

Train a neural network on 152x152 images of faces to classify 4030 celebrities. Remove the softmax output layer and use the output of the second-last layer as the transformed representation.

The network is:

* C1 (convolution): 32 filters of size $11 \times 11 \times 3$ (RGB-channels) (returns $142\times 142$ "images")
* M2 (max pooling): $3 \times 3$, stride of 2  (returns $71\times 71$ "images")
* C3 (convolution): 16 filters of size $9 \times 9 \times 16$ (returns $63\times 63$ "images")
* L4 (locally connected): $16\times9\times9\times16$ (returns $55\times 55$ "images")
* L5 (locally connected): $16\times7\times7\times16$ (returns $25\times 25$ "images")
* L6 (locally connected): $16\times5\times5\times16$ (returns $21\times 21$ "images")
* F7 (fully connected): ReLU, 4096 units
* F8 (fully connected): softmax layer with 4030 output neurons

The training was done with:

* Stochastic Gradient Descent (SGD)
* Momentum of 0.9
* Performance scheduling (LR starting at 0.01, ending at 0.0001)
* Weight initialization: $w \sim \mathcal{N}(\mu=0, \sigma=0.01)$, $b = 0.5$
* ~15 epochs ($\approx$ 3 days) of training


## Evaluation results

* **Quality**:
  * 97.35% accuracy (or mean accuracy?) with an Ensemble of DNNs for LFW
  * 91.4% accuracy with a single network on YTF
* **Speed**: DeepFace runs in 0.33 seconds per image (I'm not sure which size). This includes image decoding, face detection and alignment, **the** feed forward network (why only one? wasn't this the best performing Ensemble?) and final classification output

## See also

* Andrew Ng: [C4W4L03 Siamese Network](https://www.youtube.com/watch?v=6jfw8MuKwpI)

aclweb.org
scholar.google.com

Glove: Global Vectors for Word Representation
Pennington, Jeffrey and Socher, Richard and Manning, Christopher D.
Empirical Methods on Natural Language Processing (EMNLP) - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* Introduces a new global log-bilinear regression model which combines the benefits of both global matrix factorization and local context window methods.


#### Global Matrix Factorization Methods

* Decompose large matrices into low-rank approximations.
* eg - Latent Semantic Analysis (LSA)

##### Limitations

* Poor performance on word analogy task
* Frequent words contribute disproportionately high to the similarity measure.

#### Shallow, Local Context-Based Window Methods

* Learn word representations using adjacent words.
* eg - Continous bag-of-words (CBOW) model and skip-gram model.

##### Limitations

* Since they do not operate directly on the global co-occurrence counts, they can not utilise the statistics of the corpus effectively.


#### GloVe Model

* To capture the relationship between words $i$ and $j$, word vector models should use ratios of co-occurene probabilites (with other words $k$) instead of using raw probabilites themselves.
* In most general form:
     * $F(w_{i}, w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$
* We want $F$ to encode information in the vector space (which have a linear structure), so we can restrict to the difference of $w_{i}$ and $w_{j}$
    * $F(w_{i} - w_{j}, w_{k}^{~} ) = P_{ik}/P_{jk}$
* Since right hand side is a scalar and left hand side is a vector, we take dot product of the arguments.
    * $F( (w_{i} - w_{j})^{T}, w_{k}^{~} ) = P_{ik}/P_{jk}$
* *F* should be invariant to order of the word pair $i$ and $j$.
    * $F(w_{i}^{T}w_{k}^{~}) = P_{ik}$
* Doing further simplifications and optimisations (refer paper), we get cost function, 
    * $J = \sum_{\text{over all i, j pairs in the vocabulary}}[w_{i}^{T}w_{k}^{~} + b_{i} + b_{k}^{~} - log(X_{ik})]^{2}$
    * $f$ is a weighing function.
    * $f(x) = min((x/x_{max})^{\alpha}, 1)$
    * Typical values, $x_{\max} = 100$ and $\alpha = 3/4$
    * *b* are the bias terms.

##### Complexity

* Depends on a number of non-zero elements in the input matrix.
* Upper bound by the square of vocabulary size
* Since for shallow window-based approaches, complexity depends on $|C|$ (size of the corpus), tighter bounds are needed.
* By modelling number of co-occurrences of words as power law function of frequency rank, the complexity can be shown to be proportional to $|C|^{0.8}$

#### Evaluation

##### Tasks

* Word Analogies
    * a is to b as c is to ___?
    * Both semantic and syntactic pairs
    * Find closest d to $w_{b} - w_{c} + w_{a}$ (using cosine similarity)

* Word Similarity
* Named Entity Recognition

##### Datasets

* Wikipedia Dumps - 2010 and 2014
* Gigaword5
* Combination of Gigaword5 and Wikipedia2014
* CommonCrawl
* 400,000 most frequent words considered from the corpus.

##### Hyperparameters

* Size of context window.
* Whether to distinguish left context from right context.
* $f$ - Word pairs that are $d$ words apart contribute $1/d$ to the total count.
* $xmax = 100$
* $\alpha = 3/4$
* AdaGrad update

##### Models Compared With

* Singular Value Decomposition
* Continous Bag-Of-Words
* Skip-Gram

##### Results

* Glove outperforms all other models significantly.
* Diminishing returns for vectors larger than 200 dimensions.
* Small and asymmetric context windows (context window only to the left) works better for syntactic tasks.
* Long and symmetric context windows (context window to both the sides) works better for semantic tasks.
* Syntactic task benefited from larger corpus though semantic task performed better with Wikipedia instead of Gigaword5 probably due to the comprehensiveness of Wikipedia and slightly outdated nature of Gigaword5.
* Word2vec’s performance decreases if the number of negative samples increases beyond about 10.
* For the same corpus, vocabulary, and window size GloVe consistently achieves better results, faster.