[link]
#### Introduction * The paper presents gradient computation based techniques to visualise image classification models. * [Link to the paper](https://arxiv.org/abs/1312.6034) #### Experimental Setup * Single deep convNet trained on ILSVRC2013 dataset (1.2M training images and 1000 classes). * Weight layer configuration is: conv64conv256conv256conv256conv256full4096full4096full1000. #### Class Model Visualisation * Given a learnt ConvNet and a class (of interest), start with the zero image and perform optimisation by back propagating with respect to the input image (keeping the ConvNet weights constant). * Add the mean image (for training set) to the resulting image. * The paper used unnormalised class scores so that optimisation focuses on increasing the score of target class and not decreasing the score of other classes. #### ImageSpecific Class Saliency Visualisation * Given an image, class of interest, and trained ConvNet, rank the pixels of the input image based on their influence on class scores. * Derivative of the class score with respect to image gives an estimate of the importance of different pixels for the class. * The magnitude of derivative also indicated how much each pixel needs to be changed to improve the class score. ##### Class Saliency Extraction * Find the derivative of the class score with respect with respect to the input image. * This would result in one single saliency map per colour channel. * To obtain a single saliency map, take the maximum magnitude of derivative across all colour channels. ##### Weakly Supervised Object Localisation * The saliency map for an image provides a rough encoding of the location of the object of the class of interest. * Given an image and its saliency map, an object segmentation map can be computed using GraphCut colour segmentation. * Color continuity cues are needed as saliency maps might capture only the most dominant part of the object in the image. * This weakly supervised approach achieves 46.4% top5 error on the test set of ILSVRC2013. #### Relation to Deconvolutional Networks * DeconvNetbased reconstruction of the $n^{th}$ layer input is similar to computing the gradient of the visualised neuron activity $f$ with respect to the input layer. * One difference is in the way RELU neurons are treated: * In DeconvNet, the sign indicator (for the derivative of RELU) is computed on output reconstruction while in this paper, the sign indicator is computed on the layer input.
Your comment:

[link]
This paper presents methods for visualizing the behaviour of an object recognition convolutional neural network. The first method generates a "canonical image" for a given class that the network can recognize. The second generates a saliency map for a given input image and specified class, that illustrates the part of the image (pixels) that influence the most the given class's output probability. This can be used to seed a graphcut segmentation and localize objects of that class in the input image. Finally, a connection between the saliency map method and the work of Zeiler and Fergus on using deconvolutions to visualize deep networks is established. 
[link]
This paper attempts to understand the representations learnt by deep convolutional neural networks by introducing two interpretable visualization techniques. Main contributions:  Class model visualizations  These are obtained by making numerical optimizations in the input space to maximize the class score. Gradients are calculated wrt input and are used to update the input image (initialized with zero image), while weights are kept fixed to those obtained from training.  Imagespecific saliency map visualizations  These are approximated by using the same gradient as before (gradient of class score wrt input). The absolute pixelwise max across channels produces the saliency map.  Relation between DeconvNet and optimizationbased visualizations  Visualizations using DeconvNet are the same as gradientbased methods except for ReLU. In regular backprop, gradients flow through ReLU to units with positive input activations, whereas in case of a DeconvNet, it is computed on positive output reconstructions. ## Strengths  The visualization techniques are simple ideas and the results are interpretable. They show that the method proposed by Erhan et al. in an unsupervised setting is useful to CNNs trained in a supervised manner as well.  The imagespecific class saliency can be interpreted as those pixels which need to be changed the least to have a maximum impact on the classification score.  The relation between DeconvNet visualizations and optimizationbased visualizations is insightful. ## Weaknesses / Notes  The thinking behind initializing with zero image and L2 regularization in class model visualizations was missing. 