ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Deep Networks with Stochastic Depth
Huang, Gao and Sun, Yu and Liu, Zhuang and Sedra, Daniel and Weinberger, Kilian
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: deeplearning, acreuser

[link] Summary by Martin Thoma 7 years ago

**Dropout for layers** sums it up pretty well. The authors built on the idea of [deep residual networks](http://arxiv.org/abs/1512.03385) to use identity functions to skip layers. 

The main advantages:

* Training speed-ups by about 25%
* Huge networks without overfitting

## Evaluation

* [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html): 4.91% error ([SotA](https://martin-thoma.com/sota/#image-classification): 2.72 %) Training Time: ~15h
* [CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html): 24.58% ([SotA](https://martin-thoma.com/sota/#image-classification): 17.18 %) Training time: < 16h
* [SVHN](http://ufldl.stanford.edu/housenumbers/):  1.75% ([SotA](https://martin-thoma.com/sota/#image-classification): 1.59 %) - trained for 50 epochs, begging with a LR of 0.1, divided by 10 after 30 epochs and 35. Training time: < 26h

dx.doi.org
sci-hub
scholar.google.com

Identity Mappings in Deep Residual Networks
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
European Conference on Computer Vision - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 6 years ago

This is follow-up work to the ResNets paper. It studies the propagation formulations behind the connections of deep residual networks and performs ablation experiments. A residual block can be represented with the equations $y_l = h(x_l) + F(x_l, W_l); x_{l+1} = f(y_l)$. $x_l$ is the input to the l-th unit and $x_{l+1}$ is the output of the l-th unit. In the original ResNets paper, $h(x_l) = x_l$, $f$ is ReLu, and F consists of 2-3 convolutional layers (bottleneck architecture) with BN and ReLU in between. In this paper, they propose a residual block with both $h(x)$ and $f(x)$ as identity mappings, which trains faster and performs better than their earlier baseline. Main contributions:

- Identity skip connections work much better than other multiplicative interactions that they experiment with:
    - Scaling $(h(x) = \lambda x)$: Gradients can explode or vanish depending on whether modulating scalar \lambda > 1 or < 1.
    - Gating ($1-g(x)$ for skip connection and $g(x)$ for function F):
    For gradients to propagate freely, $g(x)$ should approach 1, but
    F gets suppressed, hence suboptimal. This is similar to highway
    networks. $g(x)$ is a 1x1 convolutional layer.
    - Gating (shortcut-only): Setting high biases pushes initial $g(x)$
    towards identity mapping, and test error is much closer to baseline.
    - 1x1 convolutional shortcut: These work well for shallower networks
    (~34 layers), but training error becomes high for deeper networks,
    probably because they impede gradient propagation.

- Experiments on activations.
    - BN after addition messes up information flow, and performs considerably
    worse.
    - ReLU before addition forces the signal to be non-negative, so the signal is monotonically increasing, while ideally a residual function should be free to take values in (-inf, inf).
    - BN + ReLU pre-activation works best. This also prevents overfitting, due
    to BN's regularizing effect. Input signals to all weight layers are normalized.

## Strengths

- Thorough set of experiments to show that identity shortcut connections
are easiest for the network to learn. Activation of any deeper unit can
be written as the sum of the activation of a shallower unit and a residual
function. This also implies that gradients can be directly propagated to
shallower units. This is in contrast to usual feedforward networks, where
gradients are essentially a series of matrix-vector products, that may vanish, as networks grow deeper.

- Improved accuracies than their previous ResNets paper.

## Weaknesses / Notes

- Residual units are useful and share the same core idea that worked in
LSTM units. Even though stacked non-linear layers are capable of asymptotically
approximating any arbitrary function, it is clear from recent work that
residual functions are much easier to approximate than the complete function.
The [latest Inception paper](http://arxiv.org/abs/1602.07261) also reports
that training is accelerated and performance is improved by using identity
skip connections across Inception modules.

- It seems like the degradation problem, which serves as motivation for
residual units, exists in the first place for non-idempotent activation
functions such as sigmoid, hyperbolic tan. This merits further
investigation, especially with recent work on function-preserving transformations such as [Network Morphism](http://arxiv.org/abs/1603.01670), which expands the Net2Net idea to sigmoid, tanh, by using parameterized activations, initialized to identity mappings.

arxiv.org
scholar.google.com

Multi-Scale Context Aggregation by Dilated Convolutions
Yu, Fisher and Koltun, Vladlen
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Alexander Jung 6 years ago

* They describe a variation of convolutions that have a differently structured receptive field.
* They argue that their variation works better for dense prediction, i.e. for predicting values for every pixel in an image (e.g. coloring, segmentation, upscaling).

### How
* One can image the input into a convolutional layer as a 3d-grid. Each cell is a "pixel" generated by a filter.
* Normal convolutions compute their output per cell as a weighted sum of the input cells in a dense area. I.e. all input cells are right next to each other.
* In dilated convolutions, the cells are not right next to each other. E.g. 2-dilated convolutions skip 1 cell between each input cell, 3-dilated convolutions skip 2 cells etc. (Similar to striding.)
* Normal convolutions are simply 1-dilated convolutions (skipping 0 cells).
* One can use a 1-dilated convolution and then a 2-dilated convolution. The receptive field of the second convolution will then be 7x7 instead of the usual 5x5 due to the spacing.
* Increasing the dilation factor by 2 per layer (1, 2, 4, 8, ...) leads to an exponential increase in the receptive field size, while every cell in the receptive field will still be part in the computation of at least one convolution.
* They had problems with badly performing networks, which they fixed using an identity initialization for the weights. (Sounds like just using resdiual connections would have been easier.)

![Receptive field](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Multi-Scale_Context_Aggregation_by_Dilated_Convolutions__receptive.png?raw=true "Receptive field")

*Receptive fields of a 1-dilated convolution (1st image), followed by a 2-dilated conv. (2nd image), followed by a 4-dilated conv. (3rd image). The blue color indicates the receptive field size (notice the exponential increase in size). Stronger blue colors mean that the value has been used in more different convolutions.*

### Results
* They took a VGG net, removed the pooling layers and replaced the convolutions with dilated ones (weights can be kept).
* They then used the network to segment images.
* Their results were significantly better than previous methods.
* They also added another network with more dilated convolutions in front of the VGG one, again improving the results.

![Segmentation performance](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Multi-Scale_Context_Aggregation_by_Dilated_Convolutions__segmentation.png?raw=true "Segmentation performance")

*Their performance on a segmentation task compared to two competing methods. They only used VGG16 without pooling layers and with convolutions replaced by dilated convolutions.*

arxiv.org
scholar.google.com

Convolutional Neural Networks for Sentence Classification
Kim, Yoon
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* The paper demonstrates how simple CNNs, built on top of word embeddings, can be used for sentence classification tasks.
* [Link to the paper](https://arxiv.org/abs/1408.5882)
* [Implementation](https://github.com/shagunsodhani/CNN-Sentence-Classifier)

#### Architecture

* Pad input sentences so that they are of the same length.
* Map words in the padded sentence using word embeddings (which may be either initialized as zero vectors or initialized as word2vec embeddings) to obtain a matrix corresponding to the sentence.
* Apply convolution layer with multiple filter widths and feature maps.
* Apply max-over-time pooling operation over the feature map.
* Concatenate the pooling results from different layers and feed to a fully-connected layer with softmax activation.
* Softmax outputs probabilistic distribution over the labels.
* Use dropout for regularisation.

#### Hyperparameters

* RELU activation for convolution layers
* Filter window of 3, 4, 5 with 100 feature maps each.
* Dropout - 0.5
* Gradient clipping at 3
* Batch size - 50
* Adadelta update rule.

#### Variants

* CNN-rand
    * Randomly initialized word vectors.
* CNN-static
    * Uses pre-trained vectors from word2vec and does not update the word vectors.
* CNN-non-static
    * Same as CNN-static but updates word vectors during training.
* CNN-multichannel
    * Uses two set of word vectors (channels).
    * One set is updated and other is not updated.

#### Datasets

* Sentiment analysis datasets for Movie Reviews, Customer Reviews etc.
* Classification data for questions.
* Maximum number of classes for any dataset - 6

#### Strengths

* Good results on benchmarks despite being a simple architecture.
* Word vectors obtained by non-static channel have more meaningful representation. 

#### Weakness

* Small data with few labels.
* Results are not very detailed or exhaustive.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

You Only Look Once: Unified, Real-Time Object Detection
Redmon, Joseph and Divvala, Santosh Kumar and Girshick, Ross B. and Farhadi, Ali
Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 6 years ago

This paper models object detection as a regression problem for bounding
boxes and object class probabilities with a single pass through the CNN. The
main contribution is the idea of dividing the image into a 7x7 grid, and having
each cell predict a distribution over class labels as well as a bounding box
for the object whose center falls into it. It's much faster than R-CNN and
Fast R-CNN, as the additional step of extracting region proposals has been
removed.

## Strengths

- Works real-time. Base model runs at 45fps and a faster version goes up to
150fps, and they claim that it's more than twice as fast as other works on
real-time detection.

- End-to-end model; Localization and classification errors can be jointly
optimized.

- YOLO makes more localization errors and fewer background mistakes than
Fast R-CNN, so using YOLO to eliminate false background detections from
Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN
is much slower).

## Weaknesses / Notes

- Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN).

- Performs worse at detecting small objects, as at most one object per grid
cell can be detected.