Welcome to ShortScience.org! |
[link]
The prediction gradient is just $\frac{\partial \mathbf{y}}{\partial w}$ where $\mathbf{y}$ is the output before the loss function. |
[link]
Hein and Andriushchenko give a intuitive bound on the robustness of neural networks based on the local Lipschitz constant. With robustness, the authors refer a small $\epsilon$-ball around each sample; this ball is supposed to describe the region where the neural network predicts a constant class. This means that adversarial examples have to compute changes large enough to leave these robust areas. Larger $\epsilon$-balls imply higher robustness to adversarial examples. When considering a single example $x$, and a classifier $f = (f_1, \ldots, f_K)^T$ (i.e. in a multi-class setting), the bound can be stated as follows. For $q$ and $p$ such that $\frac{1}{q} + \frac{1}{p} = 1$ and $c$ being the class predicted for $x$, the it holds $x = \arg\max_j f_j(x + \delta)$ for all $\delta$ with $\|\delta\|_p \leq \max_{R > 0}\min \left\{\min_{j \neq c} \frac{f_c(x) – f_j(x)}{\max_{y \in B_p(x, R)} \|\nabla f_c(y) - \nabla f_j(y)\|_q}, R\right\}$. Here, $B_p(x, R)$ describes the $R$-ball around $x$ measured using the $p$-norm. Based on the local Lipschitz constant (in the denominator), the bound essentially measures how far we can deviate from the sample $x$ (measured in the $p$-norm) until $f_j(x) > f_c(x)$ for some $j \neq c$. The higher the local Lipschitz constant, the smaller deviations are allowed, i.e. adversarial examples are easier to find. Note that the bound also depends on the confidence, i.e. the edge $f_c(x)$ has in comparison to all other $f_j(x)$. In the remaining paper, the authors also provide bounds for simple classifiers including linear classifiers, kernel methods and two-layer perceptrons (i.e. one hidden layer). For the latter, they also propose a new type of regularization called cross-Lipschitz regularization: $P(f) = \frac{1}{nK^2} \sum_{i = 1}^n \sum_{l,m = 1}^K \|\nabla f_l(x_i) - \nabla f_m(x_i)\|_2^2$. This regularization term is intended to reduce the Lipschitz constant locally around training examples. They show experimental results using this regularization on MNIST and CIFAR, see the paper for details. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
# Object detection system overview. https://i.imgur.com/vd2YUy3.png 1. takes an input image, 2. extracts around 2000 bottom-up region proposals, 3. computes features for each proposal using a large convolutional neural network (CNN), and then 4. classifies each region using class-specific linear SVMs. * R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. * On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%. ## There is a 2 challenges faced in object detection 1. localization problem 2. labeling the data 1 localization problem : * One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method. * An alternative is to build a sliding-window detector. considered adopting a sliding-window approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm. 2 labeling the data: * The conventional solution to this problem is to use unsupervised pre-training, followed by supervise fine-tuning * supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL), * fine-tuning for detection improves mAP performance by 8 percentage points. * Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs) ## Object detection with R-CNN This system consists of three modules * The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. * The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. * The third module is a set of class specific linear SVMs. Module design 1 Region proposals * which detect mitotic cells by applying a CNN to regularly-spaced square crops. * use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute). * the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) 2 Feature extraction. * extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN * Features are computed by forward propagating a mean-subtracted 227x227 RGB image through five convolutional layers and two fully connected layers. * warp all pixels in a tight bounding box around it to the required size * The feature matrix is typically 2000x4096 3 Test time detection * At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). * warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. * Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold. ## Training 1 Supervised pre-training: * pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data) 2 Domain-specific fine-tuning. * use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001. 3 Object category classifiers. * use intersection-over union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3. * Once features are extracted and training labels are applied, we optimize one linear SVM per class. * adopt the standard hard negative mining method to fit large training data in memory. ### Results on PASCAL VOC 201012 1 VOC 2010 * compared against four strong baselines including SegDPM, DPM, UVA, Regionlets. * Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster https://i.imgur.com/0dGX9b7.png 2 ILSVRC2013 detection. * ran R-CNN on the 200-class ILSVRC2013 detection dataset * R-CNN achieves a mAP of 31.4% https://i.imgur.com/GFbULx3.png #### Performance layer-by-layer, without fine-tuning 1 pool5 layer * which is the max pooled output of the network’s fifth and final convolutional layer. *The pool5 feature map is 6 x6 x 256 = 9216 dimensional * each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input 2 Layer fc6 * fully connected to pool5 * it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases 3 Layer fc7 * It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification #### Performance layer-by-layer, with fine-tuning * CNN’s parameters fine-tuned on PASCAL. * fine-tuning increases mAP by 8.0 % points to 54.2% ### Network architectures * 16-layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet. * RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0% * drawback in terms of compute time, with in terms of compute time, with than T-Net. 1 The ILSVRC2013 detection dataset * dataset is split into three sets: train (395,918), val (20,121), and test (40,152) #### CNN features for segmentation. * full R-CNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap. * fg R-CNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. * full+fg R-CNN: The third strategy (full+fg) simply concatenates the full and fg features https://i.imgur.com/n1bhmKo.png
1 Comments
|
[link]
Rozsa et al. describe an adersarial attack against OpenMax [1] by directly targeting the logits. Specifically, they assume a network using OpenMax instead of a SoftMax layer to compute the final class probabilities. OpenMax allows “open-set” networks by also allowing to reject input samples. By directly targeting the logits of the trained network, i.e. iteratively pushing the logits in a target direction, it does not matter whether SoftMax or OpenMax layers are used on top, the network can be fooled in both cases. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Ross and Doshi-Velez propose input gradient regularization to improve robustness and interpretability of neural networks. As the discussion of interpretability is quite limited in the paper, the main contribution is an extensive evaluation of input gradient regularization against adversarial examples – in comparison to defenses such as distillation or adversarial training. Specifically, input regularization as proposed in [1] is used: $\arg\min_\theta H(y,\hat{y}) + \lambda \|\nabla_x H(y,\hat{y})\|_2^2$ where $\theta$ are the network’s parameters, $x$ its input and $\hat{y}$ the predicted output. Here, $H$ might be a cross-entropy loss. It also becomes apparent why this regularization was originally called double-backpropagation because the second derivative is necessary during training. In experiments, the authors show that the proposed regularization is superior to many other defenses including distillation and adversarial training. Unfortunately, the comparison does not include other “regularization” techniques to improve robustness – such as Lipschitz regularization. This makes the comparison less interpretable, especially as the combination of input gradient regularization and adversarial training performs best (suggesting that adversarial training is a meaningful defense, as well). Still, I recommend a closer look on the experiments. For example, the authors also study the input gradients of defended models, leading to some interesting conclusions. [1] H. Drucket, Y. LeCun. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 1992. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |