ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Szegedy, Christian and Ioffe, Sergey and Vanhoucke, Vincent
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Open Review 7 years ago

This paper presents a combination of the inception architecture
with residual networks. This is done by adding a shortcut connection
to each inception module. This can alternatively be seen as a resnet where
the 2 conv layers are replaced by a (slightly modified) inception module.
The paper (claims to) provide results against the hypothesis that adding residual
connections improves training, rather increasing the model size is what makes the difference.

arxiv.org
scholar.google.com

Recurrent Batch Normalization
Cooijmans, Tim and Ballas, Nicolas and Laurent, César and Courville, Aaron
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 7 years ago

This paper describes how to apply the idea of batch normalization (BN) successfully to recurrent neural networks, specifically to LSTM networks. The technique involves the 3 following ideas:

**1) Careful initialization of the BN scaling parameter.** While standard practice is to initialize it to 1 (to have unit variance), they show that this situation creates problems with the gradient flow through time, which vanishes quickly. A value around 0.1 (used in the experiments) preserves gradient flow much better.

**2) Separate BN for the "hiddens to hiddens pre-activation and for the "inputs to hiddens" pre-activation.** In other words, 2 separate BN operators are applied on each contributions to the pre-activation, before summing and passing through the tanh and sigmoid non-linearities.

**3) Use of largest time-step BN statistics for longer test-time sequences.** Indeed, one issue with applying BN to RNNs is that if the input sequences have varying length, and if one uses per-time-step mean/variance statistics in the BN transformation (which is the natural thing to do), it hasn't been clear how do deal with the last time steps of longer sequences seen at test time, for which BN has no statistics from the training set. The paper shows evidence that the pre-activation statistics tend to gradually converge to stationary values over time steps, which supports the idea of simply using the training set's last time step statistics.

Among these ideas, I believe the most impactful idea is 1). The papers mentions towards the end that improper initialization of the BN scaling parameter probably explains previous failed attempts to apply BN to recurrent networks.

Experiments on 4 datasets confirms the method's success.

**My two cents**

This is an excellent development for LSTMs. BN has had an important impact on our success in training deep neural networks, and this approach might very well have a similar impact on the success of LSTMs in practice.

arxiv.org
scholar.google.com

The Limitations of Deep Learning in Adversarial Settings
Nicolas Papernot and Patrick McDaniel and Somesh Jha and Matt Fredrikson and Z. Berkay Celik and Ananthram Swami
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CR, cs.LG, cs.NE, stat.ML
more

[link] Summary by David Stutz 5 years ago

Papernot et al. Introduce a novel attack on deep networks based on so-called adversarial saliency maps that are computed independently of a loss. Specifically, they consider – for a given network $F(X)$ – the forward derivative

$\nabla F = \frac{\partial F}{\partial X} = \left[\frac{\partial F_j(X)}{\partial x_i}\right]_{i,j}$.

Essentially, this is the regular derivative of $F$ with respect to its input; Papernot et al. seem to refer to is as “forward” derivative as it stands in contrast with regular backpropagation where the derivative of the loss with respect to the parameters is considered. They define an adversarial saliency map by considering

$S(X, t)_i = \begin{cases}0 & \text{ if } \frac{\partial F_t(X)}{\partial X_i} < 0 \text{ or } \sum_{j\neq t} \frac{\partial F_j(X)}{\partial X_i} > 0\\ \left(\frac{\partial F_t(X)}{\partial X_i}\right) \left| \sum_{j \neq t} \frac{\partial F_j(X)}{\partial X_i}\right| & \text{ otherwise}\end{cases}$

where $t$ is the target class of the attack. The intuition of this definition is the following: The partial derivative of $F_t$ with respect to $X$ at location $i$ indicates how $X_i$ can be changed in order to increase $F_t$ (which is the goal). At the same time, $F_j$ for all $t \neq j$ is supposed to decrease for the targeted attack, this is implemented using the second (absolute) term. If, at a specific feature $X_i$, not increase of $X_i$ will lead to an increase of $F_t$, or an increase will also lead to an increase in the other $F_j$, the saliency map is zero – indicating that feature $i$ is useless. Note that here, only increases in $X_i$ are considered; Papernot et al. have a analogous formulation for considering decreases of $X_i$.
Based on the concept of adversarial saliency maps, a simple attack is implemented as illustrated in Algorithm 1. In particular, the feature $X_i$ for which the saliency map $S(X, t)$ is maximized is chosen and increased by a fixed amount until the network $F$ changes the label to $t$ or a maximum perturbation is reached (in which case the attack fails).

https://i.imgur.com/PvJv9yS.png
Algorithm 1: The proposed algorithm for generating adversarial examples, see text for details.

In experiments on MNIST they show the effectiveness of the proposed attack. Additionally, they attempt to quantify the robustness (called “hardness”) of specific classes. In particular, they show that some classes are harder to attack than others. To this end they derive the so-called adversarial distance

$A(X, t) = 1 - \frac{1}{M}\sum_i 1_{[S(X, t)_i > 0]}$

which counts the number of features in the adversarial saliency map that are greater than zero (i.e. can be perturbed during the attack in Algorithm 1). Personally, I find this “hardness” measure quite interesting because it is independent of a specific loss, but directly takes statistics of the learned model into account.

Also see this summary on [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
scholar.google.com

Gated Graph Sequence Neural Networks
Li, Yujia and Tarlow, Daniel and Brockschmidt, Marc and Zemel, Richard S.
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 8 years ago

This paper presents a feed-forward neural network architecture for processing graphs as inputs, inspired from previous work on Graph Neural Networks.

In brief, the architecture of the GG-NN corresponds to $T$ steps of GRU-like (gated recurrent units) updates, where T is a hyper-parameter. At each step, a vector representation is computed for all nodes in the graph, where a node's representation at step t is computed from the representation of nodes at step $t-1$. Specifically, the representation of a node will be updated based on the representation of its neighbors in the graph. Incoming and outgoing edges in the graph are treated differently by the neural network, by using different parameter matrices for each. Moreover, if edges have labels, separate parameters can be learned for the different types of edges (meaning that edge labels determine the configuration of parameter sharing in the model). Finally, GG-NNs can incorporate node-level attributes, by using them in the initialization (time step 0) of the nodes' representations.

GG-NNs can be used to perform a variety of tasks on graphs. The per-node representations can be used to make per-node predictions by feeding them to a neural network (shared across nodes). A graph-level predictor can also be obtained using a soft attention architecture, where per-node outputs are used as scores into a softmax in order to pool the representations across the graph, and feed this graph-level representation to a neural network. The attention mechanism can be conditioned on a "question" (e.g. on a task to predict the shortest path in a graph, the question would be the identity of the beginning and end nodes of the path to find), which is fed to the node scorer of the soft attention mechanism. Moreover, the authors describe how to chain GG-NNs to go beyond predicting individual labels and predict sequences.

Experiments on several datasets are presented. These include tasks where a single output is required (on a few bAbI tasks) as well as tasks where a sequential output is required, such as outputting the shortest path or the Eulerian circuit of a graph. Moreover, experiments on a much more complex and interesting program verification task are presented.

arxiv.org
arxiv-vanity.com
scholar.google.com

Intriguing properties of neural networks
Christian Szegedy and Wojciech Zaremba and Ilya Sutskever and Joan Bruna and Dumitru Erhan and Ian Goodfellow and Rob Fergus
arXiv e-Print archive - 2013 via Local arXiv
Keywords: cs.CV, cs.LG, cs.NE
more

[link] Summary by David Stutz 5 years ago

Szegedy et al. were (to the best of my knowledge) the first to describe the phenomen of adversarial examples as researched today. Specifically, they described the main objective in order to obtain adversarial examples as

$\arg\min_r \|r\|_2$ s.t. $f(x+r)=l$ and $x+r$ being a valid image

where $f$ is the neural network and $l$ the target class (i.e. targeted adversarial example). In the paper, they originally headlined the section by “blind spots in neural networks”. While they give some explanation and provide experiments, also introducing the notion of transferability of adversarial examples and an idea of adversarial examples used as regularization during training, many questions are left open. The given conclusion, that these adversarial examples are highly unlikely and that these examples lie dense within regular training examples are controversial in the literature.