Very Deep Convolutional Networks for Large-Scale Image Recognition on ShortScience.org

arxiv.org
scholar.google.com

Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, Karen and Zisserman, Andrew
- 2014 via Local Bibsonomy
Keywords: deep-learning, VGG

Summaries/Notes 2

[link] Summary by Tiago Vinhoza 6 years ago

#### Goal:
+ Train deep convolutional neural networks with small convolutional filters to classify images into 1000 different categories.

#### Dataset
+ ImageNet Large-Scale Visual Recognition Challenge (ILSVRC): subset of ImageNet
    + 1.2 million training images, 50000 validation images, 150000 test images.
    + 1000 categories.

#### Architecture:
+ Convolutional layers followed by fully-connected layers and 1000-way softmax at the output.

![Configurations](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Simonyan2015_architectures.png?raw=true "Convnet configurations")

+ Convolutional Layers
    + Convolutional filter: 3x3, stride = 1. 
    + 'Same' convolution, padding = 1.
    + Width of convolutional layers start at 64 and increases by a factor of 2 after max-pooling until reaching 512.
+ Max Pooling: 2x2 window, stride = 2
+ Activation function: ReLU
+ Number of parameters:
![Number of parameters](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Simonyan2015_parameters.png?raw=true "Number of parameters")

#### Discussion:

+ A stack of three 3x3 convolutional layers (without max-pooling in between) is equivalent to a 7x7 convolutional layer. Why is it better?
    + Three non-linearities instead of just one.
    + Reduced number of parameters. A n x n convolutional layer with C channels has (nC)^2 parameters.

    
Architecture |n | C | # of parameters
------|-------|----|---
1-layer CNN | 7 | 64 | 49*4096 = 200704
3-layer CNN | 3 | 64 | 9*4096 = 36864

+ The 1x1 convolution layers from configuration C aimed to increase the non-linearity of the decision function without affecting the the receptive fields of the convolutional layers.

#### Methodology:

+ Training
    + Optimize the multinomial logistic regression cost function.
    + Gradient descent. 
        + Mini batch size = 256, Momentum = 0.9, Weight decay = 0.0005
    + Initial learing rate: 0.01
        + Divided by 10 when the validation set accuracy stopped improving.
        + Decreased 3 times. Learning stopped after 370K iterations (74 epochs).
    + Weight initialization:
        + Configuration A was trained with random initialization of weights.
        + For the other configurations, the first convolutional nets and the fully connected nets were initialized using weights from configuration A. The other layers were randomly initialized.
        + Random initialization: weights are sampled from a zero-mean normal distribution with 0.01 variance. Biases are initialized wirh zero.
+ Reduce Overfitting:
    + Data Augmentation: followed [Krizhevsky2012](https://github.com/tiagotvv/ml-papers/blob/master/convolutional/ImageNet_Classification_with_Deep_Convolutional_Neural_Networks.md) principles with random flippings and changes in RGB levels.
    + Dropout regularization for the first two fully-connected layers - p(keep) = 0.5
+ Image Resolution:
    + Models were trained at two fixed scales S=256 and S=384.
    + Multi-scale training (randomly sampling S): minimum=256, maximum=512.
        + Can be seen as training set augmentation by scale jittering.
    + At test time, test scale Q is not necessarily equal to training scale S.

#### Results

+ Implementation derived from C++ Caffe toolbox.
+ Training and evaluation on multiple GPUs (no information regarding training time).

+ Single scale evaluation: 
    + Fixed training scale: Q=S.
    + Jittered training scale: Q=0.5(S_min + S_max).
    + Local Response Normalization did not improved results.

Configuration | S | Q | top-1 error (%) | top-5 error (%)
:--------------:|:---:|---|:-----------------:|:---------------: 
A | 256 | 256 | 29.6 | 10.4  
A-LRN | 256 | 256 | 29.7 | 10.5
B | 256 | 256 | 28.7 | 9.9
C | 256 | 256 | 28.1 | 9.4 
  | 384 | 384 | 28.1 | 9.3
  | [256;512] | 384 | 27.3 | 8.8 
D | 256 | 256 | 27.0 | 8.8
  | 384 | 384 | 26.8 | 8.7
  | [256;512] | 384 | 25.6 | 8.1 
E | 256 | 256 | 27.3 | 9.0 
  | 384 | 384 | 26.9 | 8.7
  | [256;512] | 384 | **25.5** | **8.0** 

+ Multi-scale evaluation: 
    + Fixed training scale: Q={S-32,S,S+32}. 
    + Jittered training scale: Q={S_min, 0.5(S_min + S_max), S_max}. 

Configuration | S | Q | top-1 error (%) | top-5 error (%)
:--------------:|:---:|---|:-----------------:|:---------------: 
B | 256 | 224,256,288 | 28.2 | 9.6
C | 256 | 224,256,288 | 27.7 | 9.2 
  | 384 | 352,384,416 | 27.8 | 9.2
  | [256;512] | 256,384,512 | 26.3 | 8.2 
D | 256 | 224,256,288 | 26.6 | 8.6
  | 384 | 352,384,416 | 26.5 | 8.6
  | [256;512] | 256,384,512 | **24.8** | **7.5** 
E | 256 | 224,256,288 | 26.9 | 8.7 
  | 384 | 352,384,416 | 26.7 | 8.6
  | [256;512] | 256,384,512 | **24.8** | **7.5** 

+ Dense versus multi-crop evaluation
    + Dense evaluation: fully connected layers are converted to convolutional layers at test time. Scores are obtained for full uncropped image and its flipped version and then averaged.
    + Multi-crop evaluation: average of scores obtained by passing multiple crops of the test image through the convolutional network. 
    + Combination of multi-crop and dense has best results: probably due to different treatment of convolution boundary conditions.

Configuration | Method | top-1 error (%) | top-5 error (%)
:--------------:|:---:|:-----------------:|:---------------: 
D | dense | 24.8 | 7.5
  | multi-crop | 24.6 | 7.5
  | multi-crop & dense | **24.4** | **7.2** 
E | dense | 24.8 | 7.5
  | multi-crop | 24.6 | 7.4
  | multi-crop & dense | **24.4** | **7.1** 
 

+ Comparison with State of the art solutions:

    + VGG (2 nets) = ensemble of 2 models trained using configurations D and E.
    + VGG (7 nets) = ensemble of 7 models different models trained using configurations C, D, E.

![Results](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Simonyan2015_results.png?raw=true "Results")

Your comment:

[link] Summary by Abhishek Das 6 years ago

This paper proposes a modified convolutional network architecture
by increasing the depth, using smaller filters, data augmentation
and a bunch of engineering tricks, an ensemble of which
achieves second place in the classification task and first place
in the localization task at ILSVRC2014.

Main contributions:

- Experiments with architectures with different depths from 11 to
19 weight layers.
- Changes in architecture
- Smaller convolution filters
- 1x1 convolutions: linear transformation of input channels
followed by a non-linearity, increases discriminative capability
of decision function.
- Varying image scales
- During training, the image is rescaled to set the length of the shortest side
to S and then 224x224 crops are taken.
- Fixed S; S=256 and S=384
- Multi-scale; Randomly sampled S from [256,512]
- This can be interpreted as a kind of data augmentation by scale jittering,
where a single model is trained to recognize objects over a wide range of scales.
- Single scale evaluation: At test time, Q=S for fixed S and Q=0.5(S_min + S_max)
for jittered S.
- Multi-scale evaluation: At test time, Q={S-32,S,S+32} for fixed S and Q={S_min,
0.5(S_min + S_max), S_max} for jittered S. Resulting class posteriors are averaged.
This performs the best.
- Dense v/s multi-crop evaluation
- In dense evaluation, the fully connected layers are converted to convolutional
layers at test time, and the uncropped image is passed through the fully convolutional net
to get dense class scores. Scores are averaged for the uncropped image and its
flip to obtain the final fixed-width class posteriors.
- This is compared against taking multiple crops of the test image and averaging scores
obtained by passing each of these through the CNN.
- Multi-crop evaluation works slightly better than dense evaluation, but the methods
are somewhat complementary as averaging scores from both did better than each of them
individually. The authors hypothesize that this is probably because of the different
boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured.

## Strengths

- Thoughtful design of network architectures and experiments to study the effect of
depth, LRN, 1x1 convolutions, pre-initialization of weights, image scales,
and dense v/s multi-crop evaluations.

## Weaknesses / Notes

- No analysis of how much time these networks take to train.
- It is interesting how the authors trained a deeper model (D,E)
by initializing initial and final layer parameters with those from
a shallower model (A).
- It would be interesting to visualize and see the representations
learnt by three stacked 3x3 conv layers and one 7x7 conv layer, and
maybe compare their receptive fields.
- They mention that performance saturates with depth while going
from D to E, but there should have been a more formal characterization
of why that happens (deeper is usually better, yes? no?).
- The ensemble consists of just 2 nets, yet performs really well.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private