Imagenet classification with deep convolutional neural networksImagenet classification with deep convolutional neural networksKrizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E2012
Paper summarytiagotvv#### Goal:
+ Train a deep convolutional neural network to classify 1.2 million images into 1000 different categories.
#### Convolutional Neural Networks:
+ Make strong and correct assumptions about the nature of the images (stationarity, pixel dependencies).
+ Much fewer connections and parameters: easier to train than fully connected neural networks.
+ ImageNet: 15 million labeled high-resolution images from 22000 categories. Labeled manually using Amazon Mechanical Turk.
+ ImageNet Large-Scale Visual Recognition Challenge (ILSVRC): subset of ImageNet
+ 1.2 million training images, 50000 validation images, 150000 test images.
+ 1000 categories
+ Variable resolution images:
+ Images downsampled to a fixed resolution of 256 x 256.
+ 8 layers: 5 convolutional and 3 fully-connected, 1000-way softmax at the output.
+ ReLU activation function: train several times faster than tanh units.
+ Faster learning had influence on the performance of large models trained on large datasets
+ Training on Multiple GPUs
+ Local Response Normalization
+ mimics a form of lateral inhibition found on real neurons.
+ applied after ReLU in the 1st and 2nd convolutional layers.
+ improves top-1 and top-5 error rates by 1.4% and 1.2%
+ Overlapping pooling
+ Neighborhood z = 3 and stride s = 2.
+ Max-pooling employed in the 1st and 2nd convolutional layers (after response normalization) and as well as after the 5th convolutinal layer.
+ Reducing Overfitting
+ Data Augmentation
+ Generate image translations and horizontal reflections.
+ Alter the intensities of RGB channels.
+ Used in the first two fully-connected layers - p(keep) = 0.5
+ Stochastic Gradient Descent, batch size = 128, momentum = 0.9, weight decay = 0.0005
+ Weights initialized from Gaussian distribution with mean = 0 and standard deviation = 0.01
+ Bias in 2nd, 4th, and 5th convolutional layers initialized as 1. This accelerated learning as the ReLU was fed with positive inputs from the start.
+ Bias in remaining layers initialized as zeros.
+ Learning rate ($\epsilon$)
+ Equal for all layers
+ Adjusted manually (divided by 10 when validation error stopped decreasing).
+ Initialized at 0.01 and reduced 3 times during training.
![Update equations](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_update.png?raw=true "Update equations")
+ Trained during 90 epochs (5-6 days on two NVIDIA GTX 580 3GB GPUs).
+ Results on ILSVRC-2010 images
+ Baselines: sparse coding and Fisher vectors
Model | Top-1 | Top-5
Sparse Coding | 47.1% | 28.2%
SIFT + FVs | 45.7% | 25.7%
CNN | 37.5% | 17.0%
+ Results on ILSVRC-2012
Model | Top-1 (val) | Top-5 (val) | Top-5 (test)
Sparse Coding | -- | -- | 26.2%
1 CNN | 40.7% | 18.2% | --
5 CNNs | 38.1% | 16.4% | 16.4%
1 CNN* | 39.0% | 16.6% | --
7 CNNs* | 36.7% | 15.4% | 15.3%
CNN* are convolutional neural networks pretrained on ImageNet 2011 Fall release and fine-tuned on ILSVRC-2012 training data.
+ Qualitative assessment
+ Convolutional kernels showed *specialization*
![Kernels](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_weights.png?raw=true "Convolutional kernels from 1st layer")
+ Most of top-5 labels were reasonable
+ Image similarity based on the feature activations induced at the last fully connected layer:
![Qualitative Assessment](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_qualitative.png?raw=true "Qualitative assessment")
+ Most of the choices made in the paper were based on experimental results. There is not too much theory behind.
Deep convolutional neural networks (DCNN) has been a popular model for image classification over the last few years. This paper proposes a DCNN structure, also known as AlexNet, for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). To train AlexNet, which has 60 million parameters, this paper uses Rectified Linear Units (ReLU) and multiple GPU to accelerate training. This paper also report that using local response normalization and overlapping pooling can reduce error rate. To prevent over fitting, they suggest data augmentation and apply dropout in the fully connected layer.
The following figure shows the architecture of AlexNet. It contains five convolutional and three fully connected layers. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow the first and second response-normalization layers and the fifth convolutional layer.