Imagenet classification with deep convolutional neural networksImagenet classification with deep convolutional neural networksKrizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E2012
Paper summarytiagotvv#### Goal:
+ Train a deep convolutional neural network to classify 1.2 million images into 1000 different categories.
#### Convolutional Neural Networks:
+ Make strong and correct assumptions about the nature of the images (stationarity, pixel dependencies).
+ Much fewer connections and parameters: easier to train than fully connected neural networks.
+ ImageNet: 15 million labeled high-resolution images from 22000 categories. Labeled manually using Amazon Mechanical Turk.
+ ImageNet Large-Scale Visual Recognition Challenge (ILSVRC): subset of ImageNet
+ 1.2 million training images, 50000 validation images, 150000 test images.
+ 1000 categories
+ Variable resolution images:
+ Images downsampled to a fixed resolution of 256 x 256.
+ 8 layers: 5 convolutional and 3 fully-connected, 1000-way softmax at the output.
+ ReLU activation function: train several times faster than tanh units.
+ Faster learning had influence on the performance of large models trained on large datasets
+ Training on Multiple GPUs
+ Local Response Normalization
+ mimics a form of lateral inhibition found on real neurons.
+ applied after ReLU in the 1st and 2nd convolutional layers.
+ improves top-1 and top-5 error rates by 1.4% and 1.2%
+ Overlapping pooling
+ Neighborhood z = 3 and stride s = 2.
+ Max-pooling employed in the 1st and 2nd convolutional layers (after response normalization) and as well as after the 5th convolutinal layer.
+ Reducing Overfitting
+ Data Augmentation
+ Generate image translations and horizontal reflections.
+ Alter the intensities of RGB channels.
+ Used in the first two fully-connected layers - p(keep) = 0.5
+ Stochastic Gradient Descent, batch size = 128, momentum = 0.9, weight decay = 0.0005
+ Weights initialized from Gaussian distribution with mean = 0 and standard deviation = 0.01
+ Bias in 2nd, 4th, and 5th convolutional layers initialized as 1. This accelerated learning as the ReLU was fed with positive inputs from the start.
+ Bias in remaining layers initialized as zeros.
+ Learning rate ($\epsilon$)
+ Equal for all layers
+ Adjusted manually (divided by 10 when validation error stopped decreasing).
+ Initialized at 0.01 and reduced 3 times during training.
![Update equations](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_update.png?raw=true "Update equations")
+ Trained during 90 epochs (5-6 days on two NVIDIA GTX 580 3GB GPUs).
+ Results on ILSVRC-2010 images
+ Baselines: sparse coding and Fisher vectors
Model | Top-1 | Top-5
Sparse Coding | 47.1% | 28.2%
SIFT + FVs | 45.7% | 25.7%
CNN | 37.5% | 17.0%
+ Results on ILSVRC-2012
Model | Top-1 (val) | Top-5 (val) | Top-5 (test)
Sparse Coding | -- | -- | 26.2%
1 CNN | 40.7% | 18.2% | --
5 CNNs | 38.1% | 16.4% | 16.4%
1 CNN* | 39.0% | 16.6% | --
7 CNNs* | 36.7% | 15.4% | 15.3%
CNN* are convolutional neural networks pretrained on ImageNet 2011 Fall release and fine-tuned on ILSVRC-2012 training data.
+ Qualitative assessment
+ Convolutional kernels showed *specialization*
![Kernels](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_weights.png?raw=true "Convolutional kernels from 1st layer")
+ Most of top-5 labels were reasonable
+ Image similarity based on the feature activations induced at the last fully connected layer:
![Qualitative Assessment](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_qualitative.png?raw=true "Qualitative assessment")
+ Most of the choices made in the paper were based on experimental results. There is not too much theory behind.
This paper is about Convolutional Neural Networks for Computer Vision. It was the first break-through in the ImageNet classification challenge (LSVRC-2010, 1000 classes).
ReLU was a key aspect which was not so often used before. The paper also used Dropout in the last two layers.
## Training details
* Momentum of 0.9
* Learning rate of $\varepsilon$ (initialized at 0.01)
* Weight decay of $0.0005 \cdot \varepsilon$.
* Batch size of 128
* The training took 5 to 6 days on two NVIDIA GTX 580 3GB GPUs.
## See also
* [Stanford presentation](http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf)