# Object detection system overview.
1. takes an input image,
2. extracts around 2000 bottom-up region proposals,
3. computes features for each proposal using a large convolutional neural network (CNN), and then
4. classifies each region using class-specific linear SVMs.
* R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010.
* On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%.
## There is a 2 challenges faced in object detection
1. localization problem
2. labeling the data
1 localization problem :
* One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method.
* An alternative is to build a sliding-window detector. considered adopting a sliding-window approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm.
2 labeling the data:
* The conventional solution to this problem is to use unsupervised pre-training, followed by supervise fine-tuning
* supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL),
* fine-tuning for detection improves mAP performance by 8 percentage points.
* Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs)
## Object detection with R-CNN
This system consists of three modules
* The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector.
* The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region.
* The third module is a set of class specific linear SVMs.
1 Region proposals
* which detect mitotic cells by applying a CNN to regularly-spaced square crops.
* use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute).
* the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU)
2 Feature extraction.
* extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN
* Features are computed by forward propagating a mean-subtracted 227x227 RGB image through five convolutional layers and two fully connected layers.
* warp all pixels in a tight bounding box around it to the required size
* The feature matrix is typically 2000x4096
3 Test time detection
* At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments).
* warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class.
* Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold.
1 Supervised pre-training:
* pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data)
2 Domain-specific fine-tuning.
* use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001.
3 Object category classifiers.
* use intersection-over union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3.
* Once features are extracted and training labels are applied, we optimize one linear SVM per class.
* adopt the standard hard negative mining method to fit large training data in memory.
### Results on PASCAL VOC 201012
1 VOC 2010
* compared against four strong baselines including SegDPM, DPM, UVA, Regionlets.
* Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster
2 ILSVRC2013 detection.
* ran R-CNN on the 200-class ILSVRC2013 detection dataset
* R-CNN achieves a mAP of 31.4%
#### Performance layer-by-layer, without fine-tuning
1 pool5 layer
* which is the max pooled output of the network’s fifth and final convolutional layer.
*The pool5 feature map is 6 x6 x 256 = 9216 dimensional
* each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input
2 Layer fc6
* fully connected to pool5
* it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases
3 Layer fc7
* It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification
#### Performance layer-by-layer, with fine-tuning
* CNN’s parameters fine-tuned on PASCAL.
* fine-tuning increases mAP by 8.0 % points to 54.2%
### Network architectures
* 16-layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.
* RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0%
* drawback in terms of compute time, with in terms of compute time, with than T-Net.
1 The ILSVRC2013 detection dataset
* dataset is split into three sets: train (395,918), val (20,121), and test (40,152)
#### CNN features for segmentation.
* full R-CNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap.
* fg R-CNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction.
* full+fg R-CNN: The third strategy (full+fg) simply concatenates the full and fg features
The paper introduces a sequential variational auto-encoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data.
#### What is DRAW:
The deep recurrent attention writer (DRAW) model has two differences with respect to other variational auto-encoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder.
#### What do we gain?
The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem.
#### What follows?
A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network.
* As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way.
* The attention model is fully differentiable.
* I think a better exposition of the attention mechanism would improve this paper.
This paper combines two ideas. The first is stochastic gradient Langevin dynamics (SGLD), which is an efficient Bayesian learning method for larger datasets, allowing to efficiently sample from the posterior over the parameters of a model (e.g. a deep neural network). In short, SGLD is stochastic (minibatch) gradient descent, but where Gaussian noise is added to the gradients before each update. Each update thus results in a sample from the SGLD sampler. To make a prediction for a new data point, a number of previous parameter values are combined into an ensemble, which effectively corresponds to Monte Carlo estimate of the posterior predictive distribution of the model.
The second idea is distillation or dark knowledge, which in short is the idea of training a smaller model (student) in replicating the behavior and performance of a much larger model (teacher), by essentially training the student to match the outputs of the teacher.
The observation made in this paper is that the step of creating an ensemble of several models (e.g. deep networks) can be expensive, especially if many samples are used and/or if each model is large. Thus, they propose to approximate the output of that ensemble by training a single network to predict to output of ensemble. Ultimately, this is done by having the student predict the output of a teacher corresponding to the model with the last parameter value sampled by SGLD.
Interestingly, this process can be operated in an online fashion, where one alternates between sampling from SGLD (i.e. performing a noisy SGD step on the teacher model) and performing a distillation update (i.e. updating the student model, given the current teacher model). The end result is a student model, whose outputs should be calibrated to the bayesian predictive distribution.
First published: 2016/05/20 (3 years ago) Abstract: We describe Swapout, a new stochastic training method, that outperforms
ResNets of identical network structure yielding impressive results on CIFAR-10
and CIFAR-100. Swapout samples from a rich set of architectures including
dropout, stochastic depth and residual architectures as special cases. When
viewed as a regularization method swapout not only inhibits co-adaptation of
units in a layer, similar to dropout, but also across network layers. We
conjecture that swapout achieves strong regularization by implicitly tying the
parameters across layers. When viewed as an ensemble training method, it
samples a much richer set of architectures than existing methods such as
dropout or stochastic depth. We propose a parameterization that reveals
connections to exiting architectures and suggests a much richer set of
architectures to be explored. We show that our formulation suggests an
efficient training method and validate our conclusions on CIFAR-10 and
CIFAR-100 matching state of the art accuracy. Remarkably, our 32 layer wider
model performs similar to a 1001 layer ResNet model.
This paper presents Swapout, a simple dropout method applied to Residual Networks (ResNets). In a ResNet, a layer $Y$ is computed from the previous layer $X$ as
$Y = X + F(X)$
where $F(X)$ is essentially the composition of a few convolutional layers. Swapout simply applies dropout separately on both terms of a layer's equation:
$Y = \Theta_1 \odot X + \Theta_2 \odot F(X)$
where $\Theta_1$ and $\Theta_2$ are independent dropout masks for each term.
The paper shows that this form of dropout is at least as good or superior as other forms of dropout, including the recently proposed [stochastic depth dropout]. Much like in the stochastic depth paper, better performance is achieved by linearly increasing the dropout rate (from 0 to 0.5) from the first hidden layer to the last.
In addition to this observation, I also note the following empirical observations:
1. At test time, averaging the output layers of multiple dropout mask samples (referenced to as stochastic inference) is better than replacing the masks by their expectation (deterministic inference), the latter being the usual standard.
2. Comparable performance is achieved by making the ResNet wider (e.g. 4 times) and with fewer layers (e.g. 32) than the orignal ResNet work with thin but very deep (more than 1000 layers) ResNets. This would confirm a similar observation from [this paper].
Overall, these are useful observations to be aware of for anyone wanting to use ResNets in practice.
First published: 2017/01/26 (2 years ago) Abstract: We introduce a new algorithm named WGAN, an alternative to traditional GAN
training. In this new model, we show that we can improve the stability of
learning, get rid of problems like mode collapse, and provide meaningful
learning curves useful for debugging and hyperparameter searches. Furthermore,
we show that the corresponding optimization problem is sound, and provide
extensive theoretical work highlighting the deep connections to other distances
This very new paper, is currently receiving quite a bit of attention by the [community](https://www.reddit.com/r/MachineLearning/comments/5qxoaz/r_170107875_wasserstein_gan/).
The paper describes a new training approach, which solves the two major practical problems with current GAN training:
1) The training process comes with a meaningful loss. This can be used as a (soft) performance metric and will help debugging, tune parameters and so on.
2) The training process does not suffer from all the instability problems. In particular the paper reduces mode collapse significantly.
On top of that, the paper comes with quite a bit mathematical theory, explaining why there approach works and other approachs have failed. This paper is a must read for anyone interested in GANs.
First published: 2017/06/08 (2 years ago) Abstract: Deep Learning has revolutionized vision via convolutional neural networks
(CNNs) and natural language processing via recurrent neural networks (RNNs).
However, success stories of Deep Learning with standard feed-forward neural
networks (FNNs) are rare. FNNs that perform well are typically shallow and,
therefore cannot exploit many levels of abstract representations. We introduce
self-normalizing neural networks (SNNs) to enable high-level abstract
representations. While batch normalization requires explicit normalization,
neuron activations of SNNs automatically converge towards zero mean and unit
variance. The activation function of SNNs are "scaled exponential linear units"
(SELUs), which induce self-normalizing properties. Using the Banach fixed-point
theorem, we prove that activations close to zero mean and unit variance that
are propagated through many network layers will converge towards zero mean and
unit variance -- even under the presence of noise and perturbations. This
convergence property of SNNs allows to (1) train deep networks with many
layers, (2) employ strong regularization, and (3) to make learning highly
robust. Furthermore, for activations not close to unit variance, we prove an
upper and lower bound on the variance, thus, vanishing and exploding gradients
are impossible. We compared SNNs on (a) 121 tasks from the UCI machine learning
repository, on (b) drug discovery benchmarks, and on (c) astronomy tasks with
standard FNNs and other machine learning methods such as random forests and
support vector machines. SNNs significantly outperformed all competing FNN
methods at 121 UCI tasks, outperformed all competing methods at the Tox21
dataset, and set a new record at an astronomy data set. The winning SNN
architectures are often very deep. Implementations are available at:
_Objective:_ Design Feed-Forward Neural Network (fully connected) that can be trained even with very deep architectures.
* _Dataset:_ [MNIST](yann.lecun.com/exdb/mnist/), [CIFAR10](https://www.cs.toronto.edu/%7Ekriz/cifar.html), [Tox21](https://tripod.nih.gov/tox21/challenge/) and [UCI tasks](https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits).
* _Code:_ [here](https://github.com/bioinf-jku/SNNs)
They introduce a new activation functio the Scaled Exponential Linear Unit (SELU) which has the nice property of making neuron activations converge to a fixed point with zero-mean and unit-variance.
They also demonstrate that upper and lower bounds and the variance and mean for very mild conditions which basically means that there will be no exploding or vanishing gradients.
The activation function is:
[![screen shot 2017-06-14 at 11 38 27 am](https://user-images.githubusercontent.com/17261080/27125901-1a4f7276-50f6-11e7-857d-ebad1ac94789.png)](https://user-images.githubusercontent.com/17261080/27125901-1a4f7276-50f6-11e7-857d-ebad1ac94789.png)
With specific parameters for alpha and lambda to ensure the previous properties. The tensorflow impementation is:
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
return scale*np.where(x>=0.0, x, alpha*np.exp(x)-alpha)
They also introduce a new dropout (alpha-dropout) to compensate for the fact that [![screen shot 2017-06-14 at 11 44 42 am](https://user-images.githubusercontent.com/17261080/27126174-e67d212c-50f6-11e7-8952-acad98b850be.png)](https://user-images.githubusercontent.com/17261080/27126174-e67d212c-50f6-11e7-8952-acad98b850be.png)
Batch norm becomes obsolete and they are also able to train deeper architectures. This becomes a good choice to replace shallow architectures where random forest or SVM used to be the best results. They outperform most other techniques on small datasets.
[![screen shot 2017-06-14 at 11 36 30 am](https://user-images.githubusercontent.com/17261080/27125798-bd04c256-50f5-11e7-8a74-b3b6a3fe82ee.png)](https://user-images.githubusercontent.com/17261080/27125798-bd04c256-50f5-11e7-8a74-b3b6a3fe82ee.png)
Might become a new standard for fully-connected activations in the future.
This paper describes how to apply the idea of batch normalization (BN) successfully to recurrent neural networks, specifically to LSTM networks. The technique involves the 3 following ideas:
**1) Careful initialization of the BN scaling parameter.** While standard practice is to initialize it to 1 (to have unit variance), they show that this situation creates problems with the gradient flow through time, which vanishes quickly. A value around 0.1 (used in the experiments) preserves gradient flow much better.
**2) Separate BN for the "hiddens to hiddens pre-activation and for the "inputs to hiddens" pre-activation.** In other words, 2 separate BN operators are applied on each contributions to the pre-activation, before summing and passing through the tanh and sigmoid non-linearities.
**3) Use of largest time-step BN statistics for longer test-time sequences.** Indeed, one issue with applying BN to RNNs is that if the input sequences have varying length, and if one uses per-time-step mean/variance statistics in the BN transformation (which is the natural thing to do), it hasn't been clear how do deal with the last time steps of longer sequences seen at test time, for which BN has no statistics from the training set. The paper shows evidence that the pre-activation statistics tend to gradually converge to stationary values over time steps, which supports the idea of simply using the training set's last time step statistics.
Among these ideas, I believe the most impactful idea is 1). The papers mentions towards the end that improper initialization of the BN scaling parameter probably explains previous failed attempts to apply BN to recurrent networks.
Experiments on 4 datasets confirms the method's success.
**My two cents**
This is an excellent development for LSTMs. BN has had an important impact on our success in training deep neural networks, and this approach might very well have a similar impact on the success of LSTMs in practice.
The paper is about squeezing the number of parameters in a convolutional neural network. The number of parameters in a convolutional layer is given by (number of input channels)$\times$(number of filters)$\times$(size of filter$\times$size of filter).
The paper proposes 2 strategies: (i) replace 3x3 filters with 1x1 filters and (ii) decrease the number of input channels. They assume the budget of the filter is given, i,e., they do not tinker with the number of filters. Decrease in number of parameters will lead to less accuracy. To compensate, the authors propose to downsample late in the network.
The results are quite impressive. Compared to AlexNet, they achieve a 50x reduction is model size while preserving the accuracy. Their model can be further compressed with existing methods like Deep Compression which are orthogonal to this paper's approach and this can give in total of around 510x reduction while still preserving accuracy of AlexNet.
$\bf Question$: The impact on running times (specially on feed forward phase which may be more typical on embedded devices) is not clear to me. Is it certain to be reduced as well or at least be *no worse* than the baseline models?
This paper introduces a neural network architecture
that is deeper and wider, yet optimizing for computational
efficiency by approximating the expected sparse structure
(following from Arora et al's work) using readily available
dense blocks. An ensemble of 7 models (all with the same
architecture but different image sampling) achieved top spot
in the classification task at ILSVRC2014.
"Their main result states that if the probability distribution of the data-set is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs."
- A more generalized exploration of the NIN architecture,
called the Inception module.
- 1x1 convolutions to capture dense information clusters
- 3x3 and 5x5 to capture more spatially spread out
- Ratio of 3x3 and 5x5 to 1x1 convolutions increases as we go deeper
as features of higher abstraction are less spatially
- To avoid the blow-up of output channels cause by merging outputs
of convolutional layers and pooling layer, they use 1x1 convolutions
for dimensionality reduction. This has the added benefit of another
layer of non-linearity (and thus increasing discriminative capability).
- Multiple intermediate layers are tied to the objective function. Since
features produced by intermediate layers of a deep network are
supposed to be very discriminative, and to strengthen the gradient signal
passing through them during back-propagation, they attach auxiliary classifiers
to intermediate layers.
- During training, they do a weighted sum of this loss with the total loss
of the network.
- At test time, these auxiliary networks are discarded.
- Architecture: average pooling, 1x1 convolution (for dimensionality reduction),
dropout, linear layer with softmax.
- Excellent results on ILSVRC2014.
## Weaknesses / Notes
- Even though the authors try to explain some of the intuition, most of
the design decisions seem arbitrary.
* They describe a variation of convolutions that have a differently structured receptive field.
* They argue that their variation works better for dense prediction, i.e. for predicting values for every pixel in an image (e.g. coloring, segmentation, upscaling).
* One can image the input into a convolutional layer as a 3d-grid. Each cell is a "pixel" generated by a filter.
* Normal convolutions compute their output per cell as a weighted sum of the input cells in a dense area. I.e. all input cells are right next to each other.
* In dilated convolutions, the cells are not right next to each other. E.g. 2-dilated convolutions skip 1 cell between each input cell, 3-dilated convolutions skip 2 cells etc. (Similar to striding.)
* Normal convolutions are simply 1-dilated convolutions (skipping 0 cells).
* One can use a 1-dilated convolution and then a 2-dilated convolution. The receptive field of the second convolution will then be 7x7 instead of the usual 5x5 due to the spacing.
* Increasing the dilation factor by 2 per layer (1, 2, 4, 8, ...) leads to an exponential increase in the receptive field size, while every cell in the receptive field will still be part in the computation of at least one convolution.
* They had problems with badly performing networks, which they fixed using an identity initialization for the weights. (Sounds like just using resdiual connections would have been easier.)
![Receptive field](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Multi-Scale_Context_Aggregation_by_Dilated_Convolutions__receptive.png?raw=true "Receptive field")
*Receptive fields of a 1-dilated convolution (1st image), followed by a 2-dilated conv. (2nd image), followed by a 4-dilated conv. (3rd image). The blue color indicates the receptive field size (notice the exponential increase in size). Stronger blue colors mean that the value has been used in more different convolutions.*
* They took a VGG net, removed the pooling layers and replaced the convolutions with dilated ones (weights can be kept).
* They then used the network to segment images.
* Their results were significantly better than previous methods.
* They also added another network with more dilated convolutions in front of the VGG one, again improving the results.
![Segmentation performance](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Multi-Scale_Context_Aggregation_by_Dilated_Convolutions__segmentation.png?raw=true "Segmentation performance")
*Their performance on a segmentation task compared to two competing methods. They only used VGG16 without pooling layers and with convolutions replaced by dilated convolutions.*