Visualizing and Understanding Convolutional Networks Visualizing and Understanding Convolutional Networks
Paper summary The main contribution of this paper is a new way to analyze CNNs by (a) visualizing intermediate learned features and (b) occlusion sensitivity analysis. ## Analyzation techniques ### Visualization A multi-layer deconvolutional network is used to project the feature activations back into pixel space, showing what input pattern originally caused a given activation in the feature maps. The idea is to train a network which is given the result of a layer $L_i$ and has to reconstruct the input feature map of $L_i$. This is repeated until the input image is reached. The deconv-net has a special **unpooling layer**: The max-pooling layers have to save where an activation came from and store those to a switch variable, which is used in unpooling. ### Occlusion sensitivity analysis * Occlude(I, x, y): Put a gray square centered at $(x, y)$ over a part of the image $I$. Run the classifier. * Create an image like this: * Run Occlude(I, x, y) for all $(x, y)$ (possible with stride) * At $(x, y)$, either ... * (d) ... place a pixel which color-encodes the probability of the correct class * (e) ... place a pixel which color-encodes the most probable class The following image from the Zeiler & Fergus paper visualizes this pretty well: If the dogs face is occluded, the probability of the correct class drops a lot: ![Imgur]( If the dogs face is occluded, the most likely class suddenly is "tennisball" and no longer "Pomeranian". ![Imgur]( See [LIME]( ## How visualization helped to construct ZF-Net * "The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies" -> Lower filter size from $11 \times 11$ to $7 \times 7$ * "the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions" -> Lower stride from 4 to 2 * The occlusion analysis helps to boost confidence that the kind of features being learned are actually correct. ## ZF-Net Zeiler and Fergus also created a new network for ImageNet. The network consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations and max pooling layers. Training setup: * **Preprocessing**: Resize smallest dimension to 256, per-pixel mean subtraction per channel, crop $224\text{px} \times 224\text{px}$ region * **Optimization**: Mini-Batch SGD, learning rate $= 10^{-2}$, momentum = $0.9$, 70 epochs * **Resources**: took around 12 days on a single GTX580 GPU The network was evaluated on * ImageNet 2012: 14.8% error * Caltech-101: $86.5 \pm 0.5$ (pretrained on ImageNet) * Caltech-256: $74.2\% \pm 0.3$ (pretrained on ImageNet) ## Minor errors * typo: "goes give" (also: something went wrong with the link there - the whole block is a link)
Visualizing and Understanding Convolutional Networks
Zeiler, Matthew D. and Fergus, Rob
arXiv e-Print archive - 2013 via Bibsonomy
Keywords: cnn, deeplearning

Your comment: allows researchers to publish paper summaries that are voted on and ranked!