Welcome to ShortScience.org! 
[link]
Carlini and Wagner propose three novel methods/attacks for adversarial examples and show that defensive distillation is not effective. In particular, they devise attacks for all three commonly used norms $L_1$, $L_2$ and $L_\infty$ – which are used to measure the deviation of the adversarial perturbation from the original testing sample. In the course of the paper, starting with the targeted objective $\min_\delta d(x, x + \delta)$ s.t. $f(x + \delta) = t$ and $x+\delta \in [0,1]^n$, they consider up to 7 different surrogate objectives to express the constraint $f(x + \delta) = t$. Here, $f$ is the neural network to attack and $\delta$ denotes the perturbation. This leads to the formulation $\min_\delta \\delta\_p + cL(x + \delta)$ s.t. $x + \delta \in [0,1]^n$ where $L$ is the surrogate loss. After extensive evaluation, the loss $L$ is taken to be $L(x') = \max(\max\{Z(x')_i : i\neq t\}  Z(x')_t, \kappa)$ where $x' = x + \delta$ and $Z(x')_i$ refers to the logit for class $i$; $\kappa$ is a constant ($=0$ in their experiments) that can be used to control the confidence of the adversarial example. In practice, the box constraint $[0,1]^n$ is encoded through a change of variable by expressing $\delta$ in terms of the hyperbolic tangent, see the paper for details. Carlini and Wagner then discuss the detailed attacks for all three norms, i.e. $L_1$, $L_2$ and $L_\infty$ where the first and latter are discussed in more detail as they impose nondifferentiability. 
[link]
Adversarial examples are datapoints that are designed to fool a classifier. For example, we can take an image that is classified correctly using a neural network, then backprop through the model to find which changes we need to make in order for it to be classified as something else. And these changes can be quite small, such that a human would hardly notice a difference. https://i.imgur.com/pkK570X.png Examples of adversarial images. In this paper, they show that much of this property holds even when the images are fed into the classifier from the real world – after being photographed with a cell phone camera. While the accuracy goes from 85.3% to 36.3% when adversarial modifications are applied on the source images, the performance still drops from 79.8% to 36.4% when the images are photographed. They also propose two modifications to the process of generating adversarial images – making it into a more gradual iterative process, and optimising for a specific adversarial class. 
[link]
#### Introduction * The paper presents gradient computation based techniques to visualise image classification models. * [Link to the paper](https://arxiv.org/abs/1312.6034) #### Experimental Setup * Single deep convNet trained on ILSVRC2013 dataset (1.2M training images and 1000 classes). * Weight layer configuration is: conv64conv256conv256conv256conv256full4096full4096full1000. #### Class Model Visualisation * Given a learnt ConvNet and a class (of interest), start with the zero image and perform optimisation by back propagating with respect to the input image (keeping the ConvNet weights constant). * Add the mean image (for training set) to the resulting image. * The paper used unnormalised class scores so that optimisation focuses on increasing the score of target class and not decreasing the score of other classes. #### ImageSpecific Class Saliency Visualisation * Given an image, class of interest, and trained ConvNet, rank the pixels of the input image based on their influence on class scores. * Derivative of the class score with respect to image gives an estimate of the importance of different pixels for the class. * The magnitude of derivative also indicated how much each pixel needs to be changed to improve the class score. ##### Class Saliency Extraction * Find the derivative of the class score with respect with respect to the input image. * This would result in one single saliency map per colour channel. * To obtain a single saliency map, take the maximum magnitude of derivative across all colour channels. ##### Weakly Supervised Object Localisation * The saliency map for an image provides a rough encoding of the location of the object of the class of interest. * Given an image and its saliency map, an object segmentation map can be computed using GraphCut colour segmentation. * Color continuity cues are needed as saliency maps might capture only the most dominant part of the object in the image. * This weakly supervised approach achieves 46.4% top5 error on the test set of ILSVRC2013. #### Relation to Deconvolutional Networks * DeconvNetbased reconstruction of the $n^{th}$ layer input is similar to computing the gradient of the visualised neuron activity $f$ with respect to the input layer. * One difference is in the way RELU neurons are treated: * In DeconvNet, the sign indicator (for the derivative of RELU) is computed on output reconstruction while in this paper, the sign indicator is computed on the layer input. 
[link]
https://i.imgur.com/JJFljWo.png This paper follows in a recent tradition of results out of Samsung: in the wake of StyleGAN’s very impressive generated images, it uses a lot of similar architectural elements, combined with metalearning and a new discriminator framework, to generate convincing “talking head” animations based on a small number of frames of a person’s face. Previously, models that generated artificial face videos could only do so by training by a large number of frames of each individual speaker that they wanted to simulate. This system instead is able to generate video in a fewshot way: where they only need one or two frames of a new speaker to do convincing generation. The structure of talking head video generation as a problem relies on the idea of “landmarks,” explicit parametrization of where the nose, the eyes, the lips, the head, are oriented in a given shot. The model is trained to be able to generate frames of a specified person (based on an input frame), and in a specific pose (based on an input landmark set). While the visual quality of the simulated video generated here is quite stunning, the most centrally impressive fact about this paper is that generation was only conditioned on a few frames of each target person. This is accomplished through a combination of metalearning (as an overall training procedure/regime) and adaptive instance normalization, a way of dynamically parametrizing models that was earlier used in a StyleGAN paper (also out of the Samsung lab). Metalearning works by doing simulated fewshot training iterations, where a model is trained for a small number of steps on a given “task” (where here a task is a given target face), and then optimized on the metalevel to be able to get good test set error rates across many such target faces. https://i.imgur.com/RIkO1am.png The mechanics of how this metalearning approach actually work are quite interesting: largely a new application of existing techniques, but with some extensions and innovations worked in.  A convolutional model produces an embedding given an input image and a pose. An average embedding is calculated by averaging over different frames, with the hopes of capturing information about the video, in a poseindependent way. This embedding, along with a goal set of landmarks (i.e. the desired facial expression of your simulation) is used to parametrize the generator, which is then asked to determine whether the generated image looks like it came from the sequence belonging to the target face, and looks like it corresponds to the target pose  Adaptive instance normalization works by having certain parameters of the network (typically, per the name, postnormalization rescaling values) that are dependent on the properties of some input data instance. This works by training a network to produce an embedding vector of the image, and then multiplying that embedding by perlayer, perfilter projection matrices to obtain new parameters. This is in particular a reasonable thing to do in the context of conditional GANs, where you want to have parameters of your generator be conditioned on the content of the image you’re trying to simulate  This model structure gives you a natural way to do fewshot generation: you can train your embedding network, your generator, and your projection matrices over a large dataset, where they’ve hopefully learned how to compress information from any given target image, and generate convincing frames based on it, so that you can just pass in your new target image, have it transformed into an embedding, and have it contain information the rest of the network can work with  This model uses a relatively new (~mid 2018) formulation of a conditional GAN, called the projection discriminator. I don’t have time to fully explain this here, but at a high level, it frames the problem of a discriminator determining whether a generated image corresponds to a given conditioning class by projecting both the class and the image into vectors, and calculating a similarityesque dot product.  During fewshot application of this model, it can get impressively good performance without even training on the new target face at all, simply by projecting the target face into an embedding, and updating the targetspecific network parameters that way. However, they do get better performance if they finetune to a specific person, which they do by treating the embeddingprojection parameters as an initialization, and then taking a few steps of gradient descent from there 
[link]
#### Introduction * Presents WikiQA  a publicly available set of question and sentence pairs for opendomain question answering. * [Link to the paper](https://www.microsoft.com/enus/research/publication/wikiqaachallengedatasetforopendomainquestionanswering/) #### Dataset * 3047 questions sampled from Bing query logs. * Each question associated with a Wikipedia page. * All sentences in the summary paragraph of the page become the candidate answers. * Only 1/3rd questions have a correct answer in the candidate answer set. * Solutions crowdsourced through MTurk like platform. * Answer sentences are associated with *answer phrases* (shortest substring of a sentence that answers the question) though this annotation is not used in the experiments reported by the paper. #### Other Datasets * [QASent datset](http://homes.cs.washington.edu/~nasmith/papers/wang+smith+mitamura.emnlp07.pdf) * Uses questions from TRECQA dataset (questions from both query logs and human editors) and selects sentences which share at least one nonstopword from the question. * Lexical overlap makes QA task easier. * Does not support evaluating for *answer triggering* (detecting if the correct answer even exists in the candidate sentences). #### Experiments ##### Baseline Systems * **Word Count**  Counts the number of nonstopwords common to question and answer sentences. * **Weighted Word Count**  Reweight word counts by the IDF values of the question words. * **[LCLR](https://www.microsoft.com/enus/research/publication/questionansweringusingenhancedlexicalsemanticmodels/)**  Uses rich lexical semantic features like WordNet and vectorspace lexical semantic models. * **Paragraph Vectors**  Considers cosine similarity between question vector and sentence vector. * **Convolutional Neural Network (CNN)**  Bigram CNN model with average pooling. * **PVCnt** and **CNNCnt**  Logistic regression classifier combining PV (and CNN) models and Word Count models. ##### Metrics * MAP and MRR for answer selection problem. * Precision, recall and F1 scores for answer triggering problem. #### Observations * CNNcnt outperforms all other models on both the tasks. * Three additional features, namely the length of the question (QLen), the length of sentence (SLen), and the class of the question (QClass) are added to track question hardness and sentence comprehensiveness. * Adding QLen improves performance significantly while adding SLen (QClass) improves (degrades) performance marginally. * For the same model, the performance on the WikiQA dataset is inferior to that on the QASent dataset. * Note: The dataset is very small to train endtoend networks. 
[link]
This paper describes how to find local interpretable modelagnostic explanations (LIME) why a blackbox model $m_B$ came to a classification decision for one sample $x$. The key idea is to evaluate many more samples around $x$ (local) and fit an interpretable model $m_I$ to it. The way of sampling and the kind of interpretable model depends on the problem domain. For computer vision / image classification, the image $x$ is divided into superpixels. Single superpixels are made black, the new image $x'$ is evaluated $p' = m_B(x')$. This is done multiple times. The paper is also explained in [this YouTube video](https://www.youtube.com/watch?v=KP7JtFMLo4) by Marco Tulio Ribeiro. A very similar idea is already in the [Zeiler & Fergus paper](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma). ## Followup Paper * June 2016: [ModelAgnostic Interpretability of Machine Learning](https://arxiv.org/abs/1606.05386) * November 2016: * [Nothing Else Matters: ModelAgnostic Explanations By Identifying Prediction Invariance](https://arxiv.org/abs/1611.05817) * [An unexpected unity among methods for interpreting model predictions](https://arxiv.org/abs/1611.07478) 
[link]
This reinforcement learning paper starts with the constraints imposed an engineering problem  the need to scale up learning problems to operate across many GPUs  and ended up, as a result, needing to solve an algorithmic problem along with it. In order to massively scale up their training to be able to train multiple problem domains in a single model, the authors of this paper implemented a system whereby many “worker” nodes execute trajectories (series of actions, states, and reward) and then send those trajectories back to a “learner” node, that calculates gradients and updates a central policy model. However, because these updates are queued up to be incorporated into the central learner, it can frequently happen that the policy that was used to collect the trajectories is a few steps behind from the policy on the central learner to which its gradients will be applied (since other workers have updated the learner since this worker last got a policy download). This results in a need to modify the policy network model design accordingly. IMPALA (Importance Weighted Actor Learner Architectures) uses an “Actor Critic” model design, which means you learn both a policy function and a value function. The policy function’s job is to choose which actions to take at a given state, by making some higher probability than others. The value function’s job is to estimate the reward from a given state onward, if a certain policy p is followed. The value function is used to calculate the “advantage” of each action at a given state, by taking the reward you receive through action a (and reward you expect in the future), and subtracting out the value function for that state, which represents the average future reward you’d get if you just sampled randomly from the policy from that point onward. The policy network is then updated to prioritize actions which are higheradvantage. If you’re onpolicy, you can calculate a value function without needing to explicitly calculate the probabilities of each action, because, by definition, if you take actions according to your policy probabilities, then you’re sampling each action with a weight proportional to its probability. However, if your actions are calculated offpolicy, you need correct for this, typically by calculating an “importance sampling” ratio, that multiplies all actions by a probability under the desired policy divided by the probability under the policy used for sampling. This cancels out the implicit probability under the sampling policy, and leaves you with your actions scaled in proportion to their probability under the policy you’re actually updating. IMPALA shares the basic structure of this solution, but with a few additional parameters to dynamically trade off between the bias and variance of the model. The first parameter, rho, controls how much bias you allow into your model, where bias here comes from your model not being fully corrected to “pretend” that you were sampling from the policy to which gradients are being applied. The tradeoff here is that if your policies are far apart, you might downweight its actions so aggressively that you don’t get a strong enough signal to learn quickly. However, the policy you learn might be statistically biased. Rho does this by weighting each value function update by: https://i.imgur.com/4jKVhCe.png where rhobar is a hyperparameter. If rhobar is high, then we allow stronger weighting effects, whereas if it’s low, we put a cap on those weights. The other parameter is c, and instead of weighting each value function update based on policy drift at that state, it weights each timestep based on how likely or unlikely the action taken at that timestep was under the true policy. https://i.imgur.com/8wCcAoE.png Timesteps that much likelier under the true policy are upweighted, and, once again, we use a hyperparameter, cbar, to put a cap on the amount of allowed upweighting. Where the prior parameter controlled how much bias there was in the policy we learn, this parameter helps control the variance  the higher cbar, the higher the amount of variance there will be in the updates used to train the model, and the longer it’ll take to converge. 
[link]
TLDR; The authors propose Progressive Neural Networks (ProgNN), a new way to do transfer learning without forgetting prior knowledge (as is done in finetuning). ProgNNs train a neural neural on task 1, freeze the parameters, and then train a new network on task 2 while introducing lateral connections and adapter functions from network 1 to network 2. This process can be repeated with further columns (networks). The authors evaluate ProgNNs on 3 RL tasks and find that they outperform finetuningbased approaches. #### Key Points  Finetuning is a destructive process that forgets previous knowledge. We don't want that.  Layer h_k in network 3 gets additional lateral connections from layers h_(k1) in network 2 and network 1. Parameters of those connections are learned, but network 2 and network 1 are frozen during training of network 3.  Downside: # of Parameters grows quadratically with the number of tasks. Paper discussed some approaches to address the problem, but not sure how well these work in practice.  Metric: AUC (Average score per episode during training) as opposed to final score. Transfer score = Relative performance compared with single net baseline.  Authors use Average Perturbation Sensitivity (APS) and Average Fisher Sensitivity (AFS) to analyze which features/layers from previous networks are actually used in the newly trained network.  Experiment 1: Variations of Pong game. Baseline that finetunes only final layer fails to learn. ProgNN beats other baselines and APS shows reuse of knowledge.  Experiment 2: Different Atari games. ProgNets result in positive Transfer 8/12 times, negative transfer 2/12 times. Negative transfer may be a result of optimization problems. Finetuning final layers fails again. ProgNN beats other approaches.  Experiment 3: Labyrinth, 3D Maze. Pretty much same result as other experiments. #### Notes  It seems like the assumption is that layer k always wants to transfer knowledge from layer (k1). But why is that true? Network are trained on different tasks, so the layer representations, or even numbers of layers, may be completely different. And Once you introduce lateral connections from all layers to all other layers the approach no longer scales.  Old tasks cannot learn from new tasks. Unlike humans.  Gating or residuals for lateral connection could make sense to allow to network to "easily" reuse previously learned knowledge.  Why use AUC metric? I also would've liked to see the final score. Maybe there's a good reason for this, but the paper doesn't explain.  Scary that finetuning the final layer only fails in most experiments. That's a very commonly used approach in nonRL domains.  Someone should try this on nonRL tasks.  What happens to training time and optimization difficult as you add more columns? Seems prohibitively expensive. 
[link]
#### Introduction * The paper explores the domain of conditional image generation by adopting and improving PixelCNN architecture. * [Link to the paper](https://arxiv.org/abs/1606.05328) #### Based on PixelRNN and PixelCNN * Models image pixel by pixel by decomposing the joint image distribution as a product of conditionals. * PixelRNN uses twodimensional LSTM while PixelCNN uses convolutional networks. * PixelRNN gives better results but PixelCNN is faster to train. #### Gated PixelCNN * PixelRNN outperforms PixelCNN due to the larger receptive field and because they contain multiplicative units, LSTM gates, which allow modelling more complex interactions. * To account for these, deeper models and gated activation units (equation 2 in the [paper](https://arxiv.org/abs/1606.05328)) can be used respectively. * Masked convolutions can lead to blind spots in the receptive fields. * These can be removed by combining 2 convolutional network stacks: * Horizontal stack  conditions on the current row. * Vertical stack  conditions on all rows above the current row. * Every layer in the horizontal stack takes as input the output of the previous layer as well as that of the vertical stack. * Residual connections are used in the horizontal stack and not in the vertical stack (as they did not seem to improve results in the initial settings). #### Conditional PixelCNN * Model conditional distribution of image, given the highlevel description of the image, represented using the latent vector h (equation 4 in the [paper](https://arxiv.org/abs/1606.05328)) * This conditioning does not depend on the location of the pixel in the image. * To consider the location as well, map h to spatial representation $s = m(h)$ (equation 5 in the the [paper](https://arxiv.org/abs/1606.05328)) #### PixelCNN AutoEncoders * Start with a traditional autoencoder architecture and replace the deconvolutional decoder with PixelCNN and train the network endtoend. #### Experiments * For unconditional modelling, Gated PixelCNN either outperforms PixelRNN or performs almost as good and takes much less time to train. * In the case of conditioning on ImageNet classes, the log likelihood measure did not improve a lot but the visual quality of the generated sampled was significantly improved. * Paper also included sample images generated by conditioning on human portraits and by training a PixelCNN autoencoder on ImageNet patches. 
[link]
Everyone has been thinking about how to apply GANs to discrete sequence data for the past year or so. This paper presents the model that I would guess most people thought of as the firstthingtotry: 1. Build a recurrent generator model which samples from its softmax outputs at each timestep. 2. Pass sampled sequences to a recurrent discriminator model which distinguishes between sampled sequences and realdata sequences. 3. Train the discriminator under the standard GAN loss. 4. Train the generator with a REINFORCE (policy gradient) objective, where each trajectory is assigned a single episodic reward: the score assigned to the generated sequence by the discriminator. Sounds hacky, right? We're learning a generator with a highvariance modelfree reinforcement learning algorithm, in a very seriously nonstationary environment. (Here the "environment" is a discriminator being jointly learned with the generator.) There's just one trick in this paper on top of that setup: for nonterminal states, the reward is defined as the *expectation* of the discriminator score after stochastically generating from that state forward. To restate using standard (somewhat sloppy) RL syntax, in different terms than the paper: (under stochastic sequential policy $\pi$, with current state $s_t$, trajectory $\tau_{1:T}$ and discriminator $D(\tau)$) $$r_t = \mathbb E_{\tau_{t+1:T} \sim \pi(s_t)} \left[ D(\tau_{1:T}) \right]$$ The rewards are estimated via Monte Carlo — i.e., just take the mean of $N$ rollouts from each intermediate state. They claim this helps to reduce variance. That makes intuitive sense, but I don't see any results in the paper demonstrating the effect of varying $N$.  Yep, so it turns out that this sort of works.. with a big caveat: ## The big caveat Graph from appendix: ![](https://www.dropbox.com/s/5fqh6my63sgv5y4/Bildschirmfoto%2020160927%20um%2021.34.44.png?raw=1) SeqGANs don't work without supervised pretraining. Makes sense — with a cold start, the generator just samples a bunch of nonsense and the discriminator overfits. Both the generator and discriminator are pretrained on supervised data in this paper (see Algorithm 1). I think it must be possible to overcome this with the proper training tricks and enough sweat. But it's probably more worth our time to address the fundamental problem here of developing better RL for structured prediction tasks.
4 Comments
