ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Mask R-CNN
He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 7 years ago

Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN).

Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool.

#### Methodology

Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class.

##### Major Changes and intutions

**Mask prediction**

Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask.

Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation

**RoIAlign**

RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of  quantization of the RoI boundaries
or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average).

**Backbone architecture**

Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) 

**Training Objective**

The training objective looks like this 
![](https://i.imgur.com/snUq73Q.png)

Lmask is the addition from Faster RCNN. The method to calculate was mentioned above

#### Observation

Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

You Only Look Once: Unified, Real-Time Object Detection
Redmon, Joseph and Divvala, Santosh Kumar and Girshick, Ross B. and Farhadi, Ali
Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 6 years ago

This paper models object detection as a regression problem for bounding
boxes and object class probabilities with a single pass through the CNN. The
main contribution is the idea of dividing the image into a 7x7 grid, and having
each cell predict a distribution over class labels as well as a bounding box
for the object whose center falls into it. It's much faster than R-CNN and
Fast R-CNN, as the additional step of extracting region proposals has been
removed.

## Strengths

- Works real-time. Base model runs at 45fps and a faster version goes up to
150fps, and they claim that it's more than twice as fast as other works on
real-time detection.

- End-to-end model; Localization and classification errors can be jointly
optimized.

- YOLO makes more localization errors and fewer background mistakes than
Fast R-CNN, so using YOLO to eliminate false background detections from
Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN
is much slower).

## Weaknesses / Notes

- Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN).

- Performs worse at detecting small objects, as at most one object per grid
cell can be detected.

scholar.google.com

Imagenet classification with deep convolutional neural networks
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E
Neural Information Processing Systems Conference - 2012 via Local Bibsonomy
Keywords: image, imagenet, thema:deepwalk, classification

[link] Summary by Martin Thoma 7 years ago

This paper is about Convolutional Neural Networks for Computer Vision. It was the first break-through in the ImageNet classification challenge (LSVRC-2010, 1000 classes).

ReLU was a key aspect which was not so often used before. The paper also used Dropout in the last two layers.

## Training details

* Momentum of 0.9
* Learning rate of $\varepsilon$ (initialized at 0.01)
* Weight decay of $0.0005 \cdot \varepsilon$.
* Batch size of 128
* The training took 5 to 6 days on two NVIDIA GTX 580 3GB GPUs.

## See also

* [Stanford presentation](http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf)

arxiv.org
arxiv-vanity.com
scholar.google.com

Query-Regression Networks for Machine Comprehension
Minjoon Seo and Hannaneh Hajishirzi and Ali Farhadi
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.NE
more

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* **Machine Comprehension (MC)** - given a natural language sentence, answer a natural language question.
* **End-To-End MC** - can not use language resources like dependency parsers. The only supervision during training is the correct answer.
* **Query Regression Network (QRN)** - Variant of Recurrent Neural Network (RNN).
* [Link to the paper](http://arxiv.org/abs/1606.04582)

#### Related Work

* Long Short-Term Memory (LSTM) and Gated Recurrence Unit (GRU) are popular choices to model sequential data but perform poorly on end-to-end MC due to long-term dependencies.
* Attention Models with shared external memory focus on single sentences in each layer but the models tend to be insensitive to the time step of the sentence being accessed.
* **Memory Networks (and MemN2N)**
    * Add time-dependent variable to the sentence representation.
    * Summarize the memory in each layer to control attention in the next layer.
* **Dynamic Memory Networks (and DMN+)**
    * Combine RNN and attention mechanism to incorporate time dependency.
    * Uses 2 GRU
        * time-axis GRU - Summarize the memory in each layer.
        * layer-axis GRU - Control the attention in each layer.
* QRN is a much simpler model without any memory summarized node.

#### QRN

* Single recurrent unit that updates its internal state through time and layers.
* Inputs
    * $q_{t}$ - local query vector
    * $x_{t}$ - sentence vector
* Outputs
    * $h_{t}$ - reduced query vector
    * $x_{t}$ - sentence vector without any modifications
* Equations
    * $z_{t} = \alpha (x_{t}, q_{t})$
    * &alpha is the **update gate function** to measure the relevance between input sentence and local query.
    * $h`_{t} = \gamma (x_{t}, q_{t})$
    * &gamma is the **regression function** to transform the local query into regressed query.
    * $h_{t} = z_{t} \* h'_{t} + (1 - z_{t}) \* h_{t-1}$
* To create a multi layer model, output of current layer becomes input to the next layer.

#### Variants

* **Reset gate function** ($r_{t}$) to reset or nullify the regressed query $h`_{t}$ (inspired from GRU).
    * The new equation becomes $h_{t} = z_{t}\*r_{t}\* h`_{t} + (1 - z_{t})\*h_{t-1}$
* **Vector gates** - update and reset gate functions can produce vectors instead of scalar values (for finer control).
* **Bidirectional** - QRN can look at both past and future sentences while regressing the queries.
    * $q_{t}^{k+1} = h_{t}^{k, \text{forward}} + h_{t}^{k, \text{backward}}$.
    * The variables of update and regress functions are shared between the two directions.

#### Parallelization

* Unlike most RNN based models, recurrent updates in QRN can be computed in parallel across time.
* For details and equations, refer the [paper](http://arxiv.org/abs/1606.04582).

#### Module Details

##### Input Modules
    
* A trainable embedding matrix A is used to encode the one-hot vector of each word in the input sentence into a d-dimensional vector.
* Position Encoder is used to obtain the sentence representation from the d-dimensional vectors.
* Question vectors are also obtained in a similar manner.

##### Output Module

* A V-way single-layer softmax classifier is used to map predicted answer vector $y$ to a V-dimensional sparse vector $v$.
* The natural language answer $y is the arg max word in $v$.


#### Results

* [bAbI QA](https://gist.github.com/shagunsodhani/12691b76addf149a224c24ab64b5bdcc) dataset used.
* QRN on 1K dataset with '2rb' (2 layers + reset gate + bidirectional) model and on 10K dataset with '2rvb' (2 layers + reset gate + vector gate + bidirectional) outperforms MemN2N 1K and 10K models respectively.
* Though DMN+ outperforms QRN with a small margin, QRN are simpler and faster to train (the paper made the comment on the speed of training without reporting the training time of the two models).
* With very few layers, the model lacks reasoning ability while with too many layers, the model becomes difficult to train.
* Using vector gates works for large datasets while hurts for small datasets.
* Unidirectional models perform poorly.
* The intermediate query updates can be interpreted in natural language to understand the flow of information in the network.

arxiv.org
scholar.google.com

Learning to Execute
Zaremba, Wojciech and Sutskever, Ilya
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 7 years ago

## Problem Statement

* Evaluating if LSTMs can express and learn short, simple programs (linear time, constant memory) in the sequence-to-sequence framework.
* [Link to paper](http://arxiv.org/pdf/1410.4615v3.pdf)

## Approach

* Formulate program evaluation task as a sequence-to-sequence learning problem using RNNs.
* Train on short programs that can be evaluated in linear time and constant memory - RNN can perform only a single pass over the data and its memory is limited.
* Two parameters to control the difficulty of the program:
* `length` : Number of digits in the integer that appears in the program.
* `nesting` : Number of times operations can be combined with each other.
* LSTM reads the input program, one character at a time and produces output, one character at a time.

### Additional Learning Tasks

* **Addition Task** - Given two numbers, the model learns to add them. This task becomes the baseline for comparing performance on other tasks.
* **Memorization Task** - Give a random number, the model memorizes it and outputs it. Following techniques enhance the accuracy of the model:
* **Input reversing** - Reversing the order of input, while keeping the output fixed introduces many short-term dependencies that help LSTM in learning the process.
* **Input doubling** - Presenting the same input to the network twice enhances the performance as the model gets to look at the input twice.

## Curriculum learning

Gradually increase the difficulty of the program fed to the system.

* **No Curriculum (baseline)** - Fixed `length` and fixed `nesting` programs are fed to the system.
* **Naive Curriculum** - Start with `length` = 1 and `nesting` = 1 and keep increasing the values iteratively.
* **Mix Strategy** - Randomly choose `length` and `nesting` to generate a mix of easy and difficult examples.
* **Combined Strategy** - Each training example is obtained either by Naive curriculum strategy or mix strategy.

## Network Architecture

* 2 layers, unrolled for 50 steps.
* 400 cells per layer.
* Parameters initialized uniformly in [-0.08, 0.08]
* minibatch size 100
* norm of gradient normalized to be less than 5
* start with learning rate = 0.5, further decreased by 0.8 after reaching target accuracy of 95%

## Observations

Teacher forcing technique used for computing accuracy ie when predicting the $i_{th}$ digit, the correct first i-1 digits of the output are provided as input to the LSTM.

The general trend is (combine, mix) > (naive, baseline).

In certain cases for program evaluation, baseline performs better than naive curriculum strategy. Intuitively, the model would use all its memory to store patterns for a given size input. Now when a higher size input is provided, the model would have to restructure its memory patterns to learn the output for this new class of inputs. The process of memory restructuring may be causing the degraded performance of the naive strategy. The combined strategy combines the naive and mix strategy and hence reduces the need to restructure the memory patterns.

While LSTMs can learn to map the character level representation of simple programs to their correct output, the idea can not extend to arbitrary programs due to the runtime limitations of conventional RNNs and LSTM. Moreover, while learning is essential, the optimal curriculum strategy needs to be understood further.