![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Miyato et al. propose distributional smoothing (or virtual adversarial training) as defense against adversarial examples. However, I think that both terms do not give a good intuition of what is actually done. Essentially, a regularization term is introduced. Letting $p(y|x,\theta)$ be the learned model, the regularizer is expressed as $\text{KL}(p(y|x,\theta)|p(y|x+r,\theta)$ where $r$ is the perturbation that maximizes the Kullback-Leibler divergence above, i.e. $r = \arg\max_r \{\text{KL}(p(y|x,\theta)|p(y|x+r,\theta) | \|r\|_2 \leq \epsilon\}$ with hyper-parameter $\epsilon$. Essentially, the regularizer is supposed to “simulate” adversarial training – thus, the method is also called virtual adversarial training. The discussed implementation, however, is somewhat cumbersome. In particular, $r$ cannot be computed using first-order methods as the gradient of $\text{KL}$ is $0$ for $r = 0$. So a second-order method is used – for which the Hessian needs to be approximated and the corresponding eigenvectors need to be computed. For me it is unclear why $r$ cannot be initialized randomly to solve this issue … Then, the derivative of the regularizer needs to be computed during training. Here, the authors make several simplifications (such as fixing $\theta$ in the first part of the Kullback-Leibler divergence and ignoring the derivative of $r$ w.r.t. $\theta$). Overall, however, I like the idea of “virtual” adversarial training as it avoids the need of explicitly using attacks during training to craft adversarial examples. Then, the trained model is often robust against the chosen attacks, but new adversarial examples can be found easily through novel attacks. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
This paper describes how to find local interpretable model-agnostic explanations (LIME) why a black-box model $m_B$ came to a classification decision for one sample $x$. The key idea is to evaluate many more samples around $x$ (local) and fit an interpretable model $m_I$ to it. The way of sampling and the kind of interpretable model depends on the problem domain. For computer vision / image classification, the image $x$ is divided into superpixels. Single super-pixels are made black, the new image $x'$ is evaluated $p' = m_B(x')$. This is done multiple times. The paper is also explained in [this YouTube video](https://www.youtube.com/watch?v=KP7-JtFMLo4) by Marco Tulio Ribeiro. A very similar idea is already in the [Zeiler & Fergus paper](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma). ## Follow-up Paper * June 2016: [Model-Agnostic Interpretability of Machine Learning](https://arxiv.org/abs/1606.05386) * November 2016: * [Nothing Else Matters: Model-Agnostic Explanations By Identifying Prediction Invariance](https://arxiv.org/abs/1611.05817) * [An unexpected unity among methods for interpreting model predictions](https://arxiv.org/abs/1611.07478) ![]() |
[link]
This paper performs a comparitive study of recent advances in deep learning with human-like learning from a cognitive science point of view. Since natural intelligence is still the best form of intelligence, the authors list a core set of ingredients required to build machines that reason like humans. - Cognitive capabilities present from childhood in humans. - Intuitive physics; for example, a sense of plausibility of object trajectories, affordances. - Intuitive psychology; for example, goals and beliefs. - Learning as rapid model-building (and not just pattern recognition). - Based on compositionality and learning-to-learn. - Humans learn by inferring a general schema to describe goals, object types and interactions. This enables learning from few examples. - Humans also learn richer conceptual models. - Indicator: variety of functions supported by these models: classification, prediction, explanation, communication, action, imagination and composition. - Models should hence have strong inductive biases and domain knowledge built into them; structural sharing of concepts by compositional reuse of primitives. - Use of both model-free and model-based learning. - Model-free, fast selection of actions in simple associative learning and discriminative tasks. - Model-based learning when a causal model has been built to plan future actions or maximize rewards. - Selective attention, augmented working memory, and experience replay are low-level promising trends in deep learning inspired from cognitive psychology. - Need for higher-level aforementioned ingredients. ![]() |
[link]
### Main Idea: It basically tunes the hyper-parameters of the neural network architecture using reinforcement learning. The reward signal is taken as evaluation on the validation set. The method is policy gradient as the cost function is non-differentiable. ### Method: #### i. Actions: 1. There is controller RNN which predicts some hyper-parameters of the layer conditioned on the previous predictions. This prediction is just a one-hot vector based on the previous hyper-parameter chosen. At the start for the first prediction - this vector is just all zeros. 2. Once the network is generated completely, it is trained for a fixed number of epochs, the reward signal is calculated based on the evaluation on the validation set. #### ii. Training: 1. Nothing fancy in the reinforcement learning approach simple policy gradients. 2. Baseline is added to reduce the variance. 3. It takes 2-3 weeks to train it over 800 GPUs!! #### iii. Results: 1. Use it to generate CNNs and LSTM cells. Close to state-of-art results with generated architectures. ### Possible new directions: 1. Use better techniques RL techniques like TRPO, PPO etc. 2. Right now, they generate fixed length architecture. Their reason is for variable length architectures, it is difficult to determine how much time each architecture is trained. Smaller networks are easier to train. Thus, somehow determine training time as function of the learning capacity of the network. Code: They haven't released the code yet. I tried to simulate it in torch.(https://github.com/abhigenie92/nn_search) ![]() |
[link]
Pereyra et al. propose an entropy regularizer for penalizing over-confident predictions of deep neural networks. Specifically, given the predicted distribution $p_\theta(y_i|x)$ for labels $y_i$ and network parameters $\theta$, a regularizer $-\beta \max(0, \Gamma – H(p_\theta(y|x))$ is added to the learning objective. Here, $H$ denotes the entropy and $\beta$, $\Gamma$ are hyper-parameters allowing to weight and limit the regularizers influence. In experiments, this regularizer showed slightly improved performance on MNIST and Cifar-10. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
Nayebi and Ganguli propose saturating neural networks as defense against adversarial examples. The main observation driving this paper can be stated as follows: Neural networks are essentially based on linear sums of neurons (e.g. fully connected layers, convolutiona layers) which are then activated; by injecting a small amount of noise per neuron it is possible to shift the final sum by large values, thereby propagating the noisy through the network and fooling the network into misclassifying an example. To prevent the impact of these adversarial examples, the network should be trained in a manner to drive many neurons into a saturated regime – noisy will, so the argument, have less impact then. The authors also give a biological motivation, which I won't go into detail here. Letting $\psi$ be the used activation function, e.g. sigmoid or ReLU, a regularizer is added to drive neurons into saturation. In particular, a penalty $\lambda \sum_l \sum_i \psi_c(h_i^l)$ is added to the loss. Here, $l$ indexes the layer and $i$ the unit in the layer; $h_i^l$ then describes the input to the non-linearity computed for unit $i$ in layer $l$. $\psi_c$ is the complementary function defined as $\psi_c(z) = \inf_{z': \psi'(z') = 0} |z – z'|$ It defines the distance of the point $z$ to the nearest saturated point $z'$ where $\psi'(z') = 0$. For ReLU activations, the complementary function is the ReLU function itself; for sigmoid activations, the complementary function is $\sigma_c(z) = |\sigma(z)(1 - \sigma(z))|$. In experiments, Nayebi and Ganguli show that training with the additional penalty yields networks with higher robustness against adversarial examples compared to adversarial training (i.e. training on adversarial examples). They also provide some insight, showing e.g. the activation and weight distribution of layers illustrating that neurons are indeed saturated in large parts. For details, see the paper. I also want to point to a comment on the paper written by Brendel and Bethge [1] questioning the effectiveness of the proposed defense strategy. They discuss a variant of the fast sign gradient method (FSGM) with stabilized gradients which is able to fool saturated networks. [1] W. Brendel, M. Behtge. Comment on “Biologically inspired protection of deep networks from adversarial attacks”, https://arxiv.org/abs/1704.01547. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
This paper introduces a new AI task - Embodied Question Answering. The goal of this task for an agent is to be able to answer the question by observing the environment through a single egocentric RGB camera while being able to navigate inside the environment. The agent has 4 natural modules: https://i.imgur.com/6Mjidsk.png 1. **Vision**. 224x224 RGB images are processed by CNN to produce a fixed-size representation. This CNN is pretrained on pixel-to-pixel tasks such as RGB reconstruction, semantic segmentation, and depth estimation. 2. **Language**. Questions are encoded with 2-layer LSTMs with 128-d hidden states. Separate question encoders are used for the navigation and answering modules to capture important words for each module. 3. **Navigation** is composed of a planner (forward, left, right, and stop actions) and a controller that executes planner selected action for a variable number of times. The planner is LSTM taking hidden state, image representation, question, and previous action. Contrary, a controller is an MLP with 1 hidden layer which takes planner's hidden state, action from the planner, and image representation to execute an action or pass the lead back to the planner. 4. **Answering** module computes an image-question similarity of the last 5 frames via a dot product between image features (passed through an fc-layer to align with question features) and question encoding. This similarity is converted to attention weights via a softmax, and the attention-weighted image features are combined with the question features and passed through an answer classifier. Visually this process is shown in the figure below. https://i.imgur.com/LeZlSZx.png [Successful results](https://www.youtube.com/watch?v=gVj-TeIJfrk) as well as [failure cases](https://www.youtube.com/watch?v=4zH8cz2VlEg) are provided. Generally, this is very promising work which literally just scratches the surface of what is possible. There are several constraints which can be mitigated to push this field to more general outcomes. For example, use more general environments with more realistic graphics and broader set of questions and answers. ![]() |
[link]
This paper presents a per-frame image-to-image translation system enabling copying of a motion of a person from a source video to a target person. For example, a source video might be a professional dancer performing complicated moves, while the target person is you. By utilizing this approach, it is possible to generate a video of you dancing as a professional. Check the authors' [video](https://www.youtube.com/watch?v=PCBTZh41Ris) for the visual explanation. **Data preparation** The authors have manually recorded high-resolution video ( at 120fps ) of a person performing various random moves. The video is further decomposed to frames, and person's pose keypoints (body joints, hands, face) are extracted for each frame. These keypoints are further connected to form a person stick figure. In practice, pose estimation is performed by open source project [OpenPose](https://github.com/CMU-Perceptual-Computing-Lab/openpose). **Training** https://i.imgur.com/VZCXZMa.png Once the data is prepared the training is performed in two stages: 1. **Training pix2pixHD model with temporal smoothing**. The core model is an original [pix2pixHD](https://tcwang0509.github.io/pix2pixHD/)[1] model with temporal smoothing. Specifically, if we were to use vanilla pix2pixHD, the input to the model would be a stick person image, and the target is the person's image corresponding to the pose. The network's objective would be $min_{G} (Loss1 + Loss2 + Loss3)$, where: - $Loss1 = max_{D_1, D_2, D_3} \sum_{k=1,2,3} \alpha_{GAN}(G, D_k)$ is adverserial loss; - $Loss2 = \lambda_{FM} \sum_{k=1,2,3} \alpha_{FM}(G,D_k)$ is feature matching loss; - $Loss3 = \lambda_{VGG}\alpha_{VGG}(G(x),y)]$ is VGG perceptual loss. However, this objective does not account for the fact that we want to generate video composed of frames that are temporally coherent. The authors propose to ensure *temporal smoothing* between adjacent frames by including pose, corresponding image, and generated image from the previous step (zero image for the first frame) as shown in the figure below: https://i.imgur.com/0NSeBVt.png Since the generated output $G(x_t; G(x_{t-1}))$ at time step $t$ is now conditioned on the previously generated frame $G(x_{t-1})$ as well as current stick image $x_t$, better temporal consistency is ensured. Consequently, the discriminator is now trying to determine both correct generation, as well as temporal consitency for a fake sequence $[x_{t-1}; x_t; G(x_{t-1}), G(x_t)]$. 2. **Training FaceGAN model**. https://i.imgur.com/mV1xuMi.png In order to improve face generation, the authors propose to use specialized FaceGAN. In practice, this is another smaller pix2pixHD model (with a global generator only, instead of local+global) which is fed with a cropped face area of a stick image and cropped face area of a corresponding generated image (from previous step 1) and aims to generate a residual which is added to the previously generated full image. **Testing** During testing, we extract frames from the input video, obtain pose stick image for each frame, normalize the stick pose image and feed it to pix2pixHD (with temporal consistency) and, further, to FaceGAN to produce final generated image with improved face features. Normalization is needed to capture possible pose variation between a source and a target input video. **Remarks** While this method produces a visually appealing result, it is not perfect. The are several reasons for being so: 1. *Quality of a pose stick image*: if the pose detector "misses" the keypoint, the generator might have difficulties to generate a properly rendered image; 2. *Motion blur*: motion blur causes pose detector to miss keypoints; 3. *Severe scale change*: if source person is very far, keypoint detector might fail to detect proper keypoints. Among video rendering challenges, the authors mention self-occlusion, cloth texture generation, video jittering (training-test motion mismatch). References: [1] "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs" ![]()
2 Comments
|
[link]
Gowal et al. propose interval bound propagation to obtain certified robustness against adversarial examples. In particular, given a neural network consisting of linear layers and monotonic increasing activation functions, a set of allowed perturbations is propagated to obtain upper and lower bounds at each layer. These lead to bounds on the logits of the network; these are used to verify whether the network changes its prediction on the allowed perturbations. Specifically, Gowal et al. consider an $L_\infty$ ball around input examples; the initial bounds are, thus, $\underline{z}_0 = x - \epsilon$ and $\overline{z}_0 = x + \epsilon$. For each layer, the bounds are defined as $\underline{z}_{k,i} = \min_{\underline{z}_{k – 1} \leq z_{k – 1} \leq \overline{z}_{k-1}} e_i^T h_k(z_{k – 1})$ and the analogous maximization problem for the upper bound; here, $h$ denotes the applied layer. For Linear layers and monotonic activation functions, this is easy to solve, as shown in the paper. Moreover, computing these bounds is very efficient, only needing roughly two times the computation of one forward pass. During training, a combination of a clean loss and adversarial loss is used: $\kappa l(z_K, y) + (1 - \kappa) l(\hat{z}_K, y)$ where $z_K$ are the logits of the input $x$, and $\hat{z}_K$ are the adversarial logits computed as $\hat{Z}_{K,y’} = \begin{cases} \overline{z}_{K,y’} & \text{if } y’ \neq y\\\underline{z}_{K,y} & \text{otherwise}\end{cases}$ Both $\epsilon$ and $\kappa$ are annealed during training. In experiments, it is shown that this method results in quite tight bounds on robustness. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
The main idea in this paper is to use the agent's ability to predict observations at the next step as a measure of how much exploration of that action should be encouraged. This prediction is based on a deep architecture, specifically a deep autoencoder representation of observations, and accuracy of prediction is measured at the level of that learned, deep representation. Exploration is encourage by increasing the reward whenever the models prediction of the representation at the next time step is bad. #### My two cents I'm not sure how novel this idea is in RL, but at the very least it's interesting that it was explored the way it was here, with deep learning. As a non-expert in RL, I certainly enjoyed reading the paper. Also, this implements nicely an idea that just seems like common sense, as an exploration strategy for an agent: actions that merit exploration are those that yield results that are unexpected to you. It will be interesting to see if this general approach will be able to exploit upcoming progress in the development of better generative deep learning models, an area that is currently very active. ![]() |