ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Training Very Deep Networks
Srivastava, Rupesh Kumar and Greff, Klaus and Schmidhuber, Jürgen
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 7 years ago

TLDR; The authors propose "Highway Networks", which uses gates (inspired by LSTMs) to determine how much of a layer's activations to transform or just pass through. Highway Networks can be used with any kind of activation function, including recurrent and convnolutional units, and trained using plain SGD. The gating mechanism allows highway networks with tens or hundreds of layers to be trained efficiently. The authors show that highway networks with fewer parameters achieve results competitive with state-of-the art for the MNIST and CIFAR tasks. Gates outputs vary significantly with the input examples, demonstrating that the network not just learns a "fixed structure", but dynamically routes data based for specific examples examples.

Datasets used: MNIST, CIFAR-10, CIFAR-100

#### Key Takeaways

- Apply LSTM-like gating to networks layers. Transform gate T and carry gate C.
- The gating forces the layer inputs/outputs to be of the same size. We can use additional plain layers for dimensionality transformations.
- Bias weights of the transform gates should be initialized to negative values (-1, -2, -3, etc) to initially force the networks to pass through information and learn long-term dependencies.
- HWN does not learn a fixed structure (same gate outputs), but dynamic routing based on current input.
- In complex data sets each layer makes an important contritbution, which is shown by lesioning (setting to pass-through) individual layers.

#### Notes / Questions

- Seems like the authors did not use dropout in their experiments. I wonder how these play together. Is dropout less effective for highway networks because the gates already learn efficients paths?
- If we see that certain gates outputs have low variance across examples, can we "prune" the network into a fixed strucure to make it more efficient (for production deployments)?

arxiv.org
arxiv-vanity.com
scholar.google.com

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford and Luke Metz and Soumith Chintala
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.CV
more

[link] Summary by Shagun Sodhani 7 years ago

# Deep Convolutional Generative Adversarial Nets

## Introduction

* The paper presents Deep Convolutional Generative Adversarial Nets (DCGAN) - a topologically constrained variant of conditional GAN.
* [Link to the paper](https://arxiv.org/abs/1511.06434)

## Benefits

* Stable to train
* Very useful to learn unsupervised image representations.

## Model

* GANs difficult to scale using CNNs.
* Paper proposes following changes to GANs:
* Replace any pooling layers with strided convolutions (for discriminator) and fractional strided convolutions (for generators).
* Remove fully connected hidden layers.
* Use batch normalisation in both generator (all layers except output layer) and discriminator (all layers except input layer).
* Use LeakyReLU in all layers of the discriminator.
* Use ReLU activation in all layers of the generator (except output layer which uses Tanh).

## Datasets

* Large-Scale Scene Understanding.
* Imagenet-1K.
* Faces dataset.

## Hyperparameters

* Minibatch SGD with minibatch size of 128.
* Weights initialized with 0 centered Normal distribution with standard deviation = 0.02
* Adam Optimizer
* Slope of leak = 0.2 for LeakyReLU.
* Learning rate = 0.0002, β1 = 0.5

## Observations

* Large-Scale Scene Understanding data
* Demonstrates that model scales with more data and higher resolution generation.
* Even though it is unlikely that model would have memorized images (due to low learning rate of minibatch SGD).
* Classifying CIFAR-10 dataset
* Features
* Train in Imagenet-1K and test on CIFAR-10.
* Max pool discriminator's convolutional features (from all layers) to get 4x4 spatial grids.
* Flatten and concatenate to get a 28672-dimensional vector.
* Linear L2-SVM classifier trained over the feature vector.
* 82.8% accuracy, outperforms K-means (80.6%)
* Street View House Number Classifier
* Similar pipeline as CIFAR-10
* 22.48% test error.
* The paper contains many examples of images generated by final and intermediate layers of the network.
* Images in the latent space do not show sharp transitions indicating that network did not memorize images.
* DCGAN can learn an interesting hierarchy of features.
* Networks seems to have some success in disentangling image representation from object representation.
* Vector arithmetic can be performed on the Z vectors corresponding to the face samples to get results like `smiling woman - normal woman + normal man = smiling man` visually.

arxiv.org
arxiv-vanity.com
scholar.google.com

On orthogonality and learning recurrent networks with long term dependencies
Eugene Vorontsov and Chiheb Trabelsi and Samuel Kadoury and Chris Pal
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.NE
more

3	[link] Summary by Joseph Paul Cohen 6 years ago Here is a video overview: https://www.youtube.com/watch?v=t-fow6GJepQ Here is an image of the poster: https://i.imgur.com/Ti9btj9.png more less 1 Comments

arxiv.org
arxiv-vanity.com
scholar.google.com

Certifying Some Distributional Robustness with Principled Adversarial Training
Aman Sinha and Hongseok Namkoong and John Duchi
arXiv e-Print archive - 2017 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by David Stutz 5 years ago

Sinha et al. introduce a variant of adversarial training based on distributional robust optimization. I strongly recommend reading the paper for understanding the introduced theoretical framework. The authors also provide guarantees on the obtained adversarial loss – and show experimentally that this guarantee is a realistic indicator. The adversarial training variant itself follows the general strategy of training on adversarially perturbed training samples in a min-max framework. In each iteration, an attacker crafts an adversarial examples which the network is trained on. In a nutshell, their approach differs from previous ones (apart from the theoretical framework) in the used attacker. Specifically, their attacker optimizes

$\arg\max_z l(\theta, z) - \gamma \|z – z^t\|_p^2$

where $z^t$ is a training sample chosen randomly during training.

On a side note, I also recommend reading the reviews of this paper: https://openreview.net/forum?id=Hk6kPgZA-

Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Towards Reverse-Engineering Black-Box Neural Networks
Seong Joon Oh and Max Augustin and Bernt Schiele and Mario Fritz
arXiv e-Print archive - 2017 via Local arXiv
Keywords: stat.ML, cs.CR, cs.CV, cs.LG
more

[link] Summary by David Stutz 5 years ago

Oh et al. propose two different approaches for whitening black box neural networks, i.e. predicting details of their internals such as architecture or training procedure. In particular, they consider attributes regarding architecture (activation function, dropout, max pooling, kernel size of convolutional layers, number of convolutionaly/fully connected layers etc.), attributes concerning optimization (batch size and optimization algorithm) and attributes regarding the data (data split and size). In order to create a dataset of models, they trained roughly 11k models on MNIST; they ensured that these models have at least 98% accuracy on the validation set and they also consider ensembles.

For predicting model attributes, they propose two models, called kennen-o and kennen-i, see Figure 1. Kennen-o takes as input a set of $100$ predictions of the models (i.e. final probability distributions) and tries to directly learn the attributes using a MLP of two fully connected layers. Kennen-i instead crafts a single input which allows to reason about a specific model attribute. An example for kennen-i is shown in Figure 2. In experiments, they demonstrate that both models are able to predict model attributes significantly better than chance. For details, I refer to the paper.

https://i.imgur.com/YbFuniu.png
Figure 1: Illustration of the two proposed approaches, kennen-o (top) and kennen-i (bottom).

https://i.imgur.com/ZXj22zG.png
Figure 2: Illustration of the images created by kennen-i to classify different attributes. See the paper for details.

Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/).