Welcome to ShortScience.org! |
[link]
Oh et al. propose two different approaches for whitening black box neural networks, i.e. predicting details of their internals such as architecture or training procedure. In particular, they consider attributes regarding architecture (activation function, dropout, max pooling, kernel size of convolutional layers, number of convolutionaly/fully connected layers etc.), attributes concerning optimization (batch size and optimization algorithm) and attributes regarding the data (data split and size). In order to create a dataset of models, they trained roughly 11k models on MNIST; they ensured that these models have at least 98% accuracy on the validation set and they also consider ensembles. For predicting model attributes, they propose two models, called kennen-o and kennen-i, see Figure 1. Kennen-o takes as input a set of $100$ predictions of the models (i.e. final probability distributions) and tries to directly learn the attributes using a MLP of two fully connected layers. Kennen-i instead crafts a single input which allows to reason about a specific model attribute. An example for kennen-i is shown in Figure 2. In experiments, they demonstrate that both models are able to predict model attributes significantly better than chance. For details, I refer to the paper. https://i.imgur.com/YbFuniu.png Figure 1: Illustration of the two proposed approaches, kennen-o (top) and kennen-i (bottom). https://i.imgur.com/ZXj22zG.png Figure 2: Illustration of the images created by kennen-i to classify different attributes. See the paper for details. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
This paper introduces a CNN based segmentation of an object that is defined by a user using four extreme points (i.e. bounding box). Interestingly, in a related work, it has been shown that clicking extreme points is about 5 times more efficient than drawing a bounding box in terms of speed. https://i.imgur.com/9GJvf17.png The extreme points have several goals in this work. First, they are used as a bounding box to crop the object of interest. Secondly, they are utilized to create a heatmap with activations in the regions of extreme points. The heatmap is created as a 2D Gaussian centered around each of the extreme points. This heatmap is matched to the size of the resized crop (i.e. 512x512) and is concatenated with the original RGB channels of the crop. The concatenated input of channel depth=4 is fed to the network which is a ResNet-101 with FC and last two maxpool layers removed. In order to maintain the same receptive field, an astrous convolution is used. Pyramid scene parsing module from PSPNet is used to aggregate global context. The network is trained with a standard cross-entropy loss weighted by a normalization factor (i.e. a frequency of a class in a dataset). How does it compare to "Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++ " paper in terms of accuracy? Specifically, if the polygon is wrong it is easy to correct points on the polygon that are wrong. However, it is unclear how to obtain preferred segmentation when no matter how many (greater than four) extreme points are selected, the object of interest is not segmented properly. |
[link]
Ross and Doshi-Velez propose input gradient regularization to improve robustness and interpretability of neural networks. As the discussion of interpretability is quite limited in the paper, the main contribution is an extensive evaluation of input gradient regularization against adversarial examples – in comparison to defenses such as distillation or adversarial training. Specifically, input regularization as proposed in [1] is used: $\arg\min_\theta H(y,\hat{y}) + \lambda \|\nabla_x H(y,\hat{y})\|_2^2$ where $\theta$ are the network’s parameters, $x$ its input and $\hat{y}$ the predicted output. Here, $H$ might be a cross-entropy loss. It also becomes apparent why this regularization was originally called double-backpropagation because the second derivative is necessary during training. In experiments, the authors show that the proposed regularization is superior to many other defenses including distillation and adversarial training. Unfortunately, the comparison does not include other “regularization” techniques to improve robustness – such as Lipschitz regularization. This makes the comparison less interpretable, especially as the combination of input gradient regularization and adversarial training performs best (suggesting that adversarial training is a meaningful defense, as well). Still, I recommend a closer look on the experiments. For example, the authors also study the input gradients of defended models, leading to some interesting conclusions. [1] H. Drucket, Y. LeCun. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 1992. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Ulyanov et al. utilize untrained neural networks as regularizer/prior for various image restoration tasks such as denoising, inpainting and super-resolution. In particualr, the standard formulation of such tasks, i.e. $x^\ast = \arg\min_x E(x, x_0) + R(x)$ where $x_0$ is the input image and $E$ a task-dependent data term, is rephrased as follows: $\theta^\ast = \arg\min_\theta E(f_\theta(z); x_0)$ and $x^\ast = f_{\theta^\ast}(z)$ for a fixed but random $z$. Here, the regularizer $R$ is essentially replaced by an untrained neural network $f_\theta$ – usually in the form of a convolutional encoder. The authors argue that the regualizer is effectively $R(x) = 0$ if the image can be generated by the encoder from the fixed code $z$ and $R(x) = \infty$ if not. However, this argument does not necessarily provide any insights on why this approach works (as demonstrated in the paper). A main question addressed in the paper is why the network $f_\theta$ can be used as a prior – regarding the assumption that high-capacity networks can essentially fit any image (including random noise). In my opinion, the authors do not give a convincing answer to this question. Essentially, they argue that random noise is just harder to fit (i.e. it takes longer). Therefore, limiting the number of iterations is enough as regularization. Personally I would argue that this observation is mainly due to prior knowledge put into the encoder architecture and the idea that natural images (or any images with some structure) are easily embedded into low-dimensional latent spaced compared to fully I.i.d. random noise. They provide experiments on a range of tasks including denoising, image inpainting, super-resolution and neural network “inversion”. Figure 1 shows some results for image inpainting that I found quite convincing. For the remaining experiments I refer to the paper. https://i.imgur.com/BVQsaup.png Figure 1: Qualitative results for image inpainting. Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Liu et al. propose randomizing neural networks, implicitly learning an ensemble of models, to defend against adversarial attacks. In particular, they introduce Gaussian noise layers before regular convolutional layers. The noise can be seen as additional parameter of the model. During training, noise is randomly added. During testing, the model is evaluated on a single testing input using multiple random noise vectors; this essentially corresponds to an ensemble of different models (parameterized by the different noise vectors). Mathemtically, the authors provide two interesting interpretations. First, they argue that training essentially minimizes an upper bound of the (noisy) inference loss. Second, they show that their approach is equivalent to Lipschitz regularization [1]. [1] M. Hein, M. Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. ArXiv:1705.08475, 2017. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |