ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron C. and Salakhutdinov, Ruslan and Zemel, Richard S. and Bengio, Yoshua
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 7 years ago

TLDR; The authors use an attention mechanism in image caption generation, allowing the decoder RNN focus on specific parts of the image. In order find the correspondence between words and image patches, the RNN uses a lower convolutional layer as its input (before pooling). The authors propose both a "hard" attention (trained using sampling methods) and "soft" attention (trained end-to-end) mechanism, and show qualitatively that the decoder focuses on sensible regions while generating text, adding an additional layer of interpretability to the model. The attention-based models achieve state-of-the art on Flickr8k, Flickr30 and MS Coco.

#### Key Points

- To find image correspondence use lower convolutional layers to attend to.
- Two attention mechanisms: Soft and hard. Depending on evaluation metric (BLEU vs. METERO) one or the other performs better.
- Largest data set (MS COCO) takes 3 days to train on Titan Black GPU. Oxford VGG.
- Soft attention is same as for seq2seq models.
- Attention weights are visualized by upsampling and applying a Gaussian

#### Notes/Questions

- Would've liked to see an explanation of when/how soft vs. hard attention does better.
- What is the computational overhead of using the attention mechanism? Is it significant?

arxiv.org
scholar.google.com

Recurrent Batch Normalization
Cooijmans, Tim and Ballas, Nicolas and Laurent, César and Courville, Aaron
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 7 years ago

This paper describes how to apply the idea of batch normalization (BN) successfully to recurrent neural networks, specifically to LSTM networks. The technique involves the 3 following ideas:

**1) Careful initialization of the BN scaling parameter.** While standard practice is to initialize it to 1 (to have unit variance), they show that this situation creates problems with the gradient flow through time, which vanishes quickly. A value around 0.1 (used in the experiments) preserves gradient flow much better.

**2) Separate BN for the "hiddens to hiddens pre-activation and for the "inputs to hiddens" pre-activation.** In other words, 2 separate BN operators are applied on each contributions to the pre-activation, before summing and passing through the tanh and sigmoid non-linearities.

**3) Use of largest time-step BN statistics for longer test-time sequences.** Indeed, one issue with applying BN to RNNs is that if the input sequences have varying length, and if one uses per-time-step mean/variance statistics in the BN transformation (which is the natural thing to do), it hasn't been clear how do deal with the last time steps of longer sequences seen at test time, for which BN has no statistics from the training set. The paper shows evidence that the pre-activation statistics tend to gradually converge to stationary values over time steps, which supports the idea of simply using the training set's last time step statistics.

Among these ideas, I believe the most impactful idea is 1). The papers mentions towards the end that improper initialization of the BN scaling parameter probably explains previous failed attempts to apply BN to recurrent networks.

Experiments on 4 datasets confirms the method's success.

**My two cents**

This is an excellent development for LSTMs. BN has had an important impact on our success in training deep neural networks, and this approach might very well have a similar impact on the success of LSTMs in practice.

proceedings.mlr.press
scholar.google.com

Understanding Black-box Predictions via Influence Functions
Koh, Pang Wei and Liang, Percy
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by kangcheng 5 years ago

**Goal**: identifying training points most responsible for a given prediction.

Given training points $z_1, \dots, z_n$, let loss function be $\frac{1}{n}\sum_{i=1}^nL(z_i, \theta)$ 

A function called influence function let us compute the parameter change if $z$ were upweighted by some small $\epsilon$. 
$$\hat{\theta}_{\epsilon, z} := \arg \min_{\theta \in \Theta} \frac{1}{n}\sum_{i=1}^n L(z_i, \theta) + \epsilon L(z, \theta)$$

$$\mathcal{I}_{\text{up, params}}(z) := \frac{d\hat{\theta}_{\epsilon, z}}{d\epsilon} = -H_{\hat{\theta}}^{-1} \nabla_\theta L(z, \hat{\theta})$$

$\mathcal{I}_{\text{up, params}}(z)$ shows how uplifting one point $z$ affect the estimate of the parameters $\theta$. 

Furthermore, we could determine how uplifting $z$ affect the loss estimate of a test point through chain rule. 
$$\mathcal{I}_{\text{up, loss}}(z, z_{\text{test}}) = \nabla_\theta L(z_{\text{test}}, \hat{\theta})^\top \mathcal{I}_{\text{up, params}}(z)$$ 

Apart from lifting one training point, change of the parameters with the change of a training point could also be estimated. 
$$\frac{d\hat{\theta}_{\epsilon, z_\delta, -z}}{d\epsilon} = \mathcal{I}_{\text{up, params}}(z_\delta) - \mathcal{I}_{\text{up, params}}(z)$$
This measures how purturbation $\delta$ to training point $z$ affect the parameter estimation $\theta$.

Section 3 describes some practicals about efficient implementing.

This set of tool could be used for some interpretable machine learning tasks.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Accurate Image Super-Resolution Using Very Deep Convolutional Networks
Kim, Jiwon and Lee, Jung Kwon and Lee, Kyoung Mu
Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Alexander Jung 6 years ago

  * They describe a model that upscales low resolution images to their high resolution equivalents ("Single Image Super Resolution").
  * Their model uses a deeper architecture than previous models and has a residual component.

### How
  * Their model is a fully convolutional neural network.
  * Input of the model: The image to upscale, *already upscaled to the desired size* (but still blurry).
  * Output of the model: The upscaled image (without the blurriness).
  * They use 20 layers of padded 3x3 convolutions with size 64xHxW with ReLU activations. (No pooling.)
  * They have a residual component, i.e. the model only learns and outputs the *change* that has to be applied/added to the blurry input image (instead of outputting the full image). That change is applied to the blurry input image before using the loss function on it. (Note that this is a bit different from the currently used "residual learning".)
  * They use a MSE between the "correct" upscaling and the generated upscaled image (input image + residual).
  * They use SGD starting with a learning rate of 0.1 and decay it 3 times by a factor of 10.
  * They use weight decay of 0.0001.
  * During training they use a special gradient clipping adapted to the learning rate. Usually gradient clipping restricts the gradient values to `[-t, t]` (`t` is a hyperparameter). Their gradient clipping restricts the values to `[-t/lr, t/lr]` (where `lr` is the learning rate).
  * They argue that their special gradient clipping allows the use of significantly higher learning rates.
  * They train their model on multiple scales, e.g. 2x, 3x, 4x upscaling. (Not really clear how. They probably feed their upscaled image again into the network or something like that?)

### Results
  * Higher accuracy upscaling than all previous methods.
  * Can handle well upscaling factors above 2x.
  * Residual network learns significantly faster than non-residual network.

![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Accurate_Image_Super-Resolution__architecture.png?raw=true "Architecture")

*Architecture of the model.*


![Examples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Accurate_Image_Super-Resolution__examples.png?raw=true "Examples")

*Super-resolution quality of their model (top, bottom is a competing model).*

arxiv.org
scholar.google.com

Convolutional Neural Networks for Sentence Classification
Kim, Yoon
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* The paper demonstrates how simple CNNs, built on top of word embeddings, can be used for sentence classification tasks.
* [Link to the paper](https://arxiv.org/abs/1408.5882)
* [Implementation](https://github.com/shagunsodhani/CNN-Sentence-Classifier)

#### Architecture

* Pad input sentences so that they are of the same length.
* Map words in the padded sentence using word embeddings (which may be either initialized as zero vectors or initialized as word2vec embeddings) to obtain a matrix corresponding to the sentence.
* Apply convolution layer with multiple filter widths and feature maps.
* Apply max-over-time pooling operation over the feature map.
* Concatenate the pooling results from different layers and feed to a fully-connected layer with softmax activation.
* Softmax outputs probabilistic distribution over the labels.
* Use dropout for regularisation.

#### Hyperparameters

* RELU activation for convolution layers
* Filter window of 3, 4, 5 with 100 feature maps each.
* Dropout - 0.5
* Gradient clipping at 3
* Batch size - 50
* Adadelta update rule.

#### Variants

* CNN-rand
    * Randomly initialized word vectors.
* CNN-static
    * Uses pre-trained vectors from word2vec and does not update the word vectors.
* CNN-non-static
    * Same as CNN-static but updates word vectors during training.
* CNN-multichannel
    * Uses two set of word vectors (channels).
    * One set is updated and other is not updated.

#### Datasets

* Sentiment analysis datasets for Movie Reviews, Customer Reviews etc.
* Classification data for questions.
* Maximum number of classes for any dataset - 6

#### Strengths

* Good results on benchmarks despite being a simple architecture.
* Word vectors obtained by non-static channel have more meaningful representation. 

#### Weakness

* Small data with few labels.
* Results are not very detailed or exhaustive.