ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens et al.
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.AI, cs.LG
more

[link] Summary by Udibr 7 years ago

This is a very techincal paper and I only covered items that interested me
* Model
  * Encoder
    * 8 layers LSTM 
    * bi-directional only first encoder layer
    * top 4 layers add input to output (residual network)
  * Decoder
    * same as encoder except all layers are just forward direction
  * encoder state is not passed as a start point to Decoder state
  * Attention
    * energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer
    * computed from output of 1st decoder layer
    * pre-feed to all layers
* Training has two steps: ML and RL
  * ML (cross-entropy) training:
    * common wisdom, initialize all trainable parameters uniformly between [-0.04, 0.04]
    * clipping=5, batch=128
    * Adam (lr=2e-4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps)
    * 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days
    * [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.2-0.3 (higher for smaller datasets)
  * RL - [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15) 
    * sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on n-grams of size 1-4
    * mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$
    * mean $r$ computed from $m=15$ samples
    * SGD, 400K steps, 3 days, no drouput
* Prediction (i.e. Decoder)
  * beam search (3 beams)
  * A normalized score is computed to every beam that ended (died)
    * did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.6-0.7]$
    * normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sum-log of attention weight of every input word (i.e. after summing over all output words)
    * Do a second pruning using normalized scores

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

You Only Look Once: Unified, Real-Time Object Detection
Redmon, Joseph and Divvala, Santosh Kumar and Girshick, Ross B. and Farhadi, Ali
Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 6 years ago

This paper models object detection as a regression problem for bounding
boxes and object class probabilities with a single pass through the CNN. The
main contribution is the idea of dividing the image into a 7x7 grid, and having
each cell predict a distribution over class labels as well as a bounding box
for the object whose center falls into it. It's much faster than R-CNN and
Fast R-CNN, as the additional step of extracting region proposals has been
removed.

## Strengths

- Works real-time. Base model runs at 45fps and a faster version goes up to
150fps, and they claim that it's more than twice as fast as other works on
real-time detection.

- End-to-end model; Localization and classification errors can be jointly
optimized.

- YOLO makes more localization errors and fewer background mistakes than
Fast R-CNN, so using YOLO to eliminate false background detections from
Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN
is much slower).

## Weaknesses / Notes

- Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN).

- Performs worse at detecting small objects, as at most one object per grid
cell can be detected.

papers.nips.cc
scholar.google.com

Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets
Vincent, Pascal and de Brébisson, Alexandre and Bouthillier, Xavier
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 8 years ago

This paper presents a linear algebraic trick for computing both the value and the gradient update for a loss function that compares a very high-dimensional target with a (dense) output prediction. Most of the paper exposes the specific case of the squared error loss, though it can also be applied to some other losses such as the so-called spherical softmax. One use case could be for training autoencoders with the squared error on very high-dimensional but sparse inputs. While a naive (i.e. what most people currently do) implementation would scale in $O(Dd)$ where $D$ is the input dimensionality and d the hidden layer dimensionality, they show that their trick allows to scale in $O(d^2)$.

Their experiments show that they can achieve speedup factors of over 500 on the CPU, and over 1500 on the GPU.

#### My two cents

This is a really neat, and frankly really surprising, mathematical contribution. I did not suspect getting rid of the dependence on D in the complexity would actually be achievable, even for the "simpler" case of the squared error.

The jury is still out as to whether we can leverage the full power of this trick in practice. Indeed, the squared error over sparse targets isn't the most natural choice in most situations. The authors did try to use this trick in the context of a version of the neural network language model that uses the squared error instead of the negative log-softmax (or at least I think that's what was done... I couldn't confirm this with 100% confidence). They showed that good measures of word similarity (Simlex-999) could be achieved in this way, though using the hierarchical softmax actually achieves better performance in about the same time.

But as far as I'm concerned, that doesn't make the trick less impressive. It's still a neat piece of new knowledge to have about reconstruction errors. Also, the authors mention that it would be possible to adapt the trick to the so-called (negative log) spherical softmax, which is like the softmax but where the numerator is the square of the pre-activation, instead of the exponential. I hope someone tries this out in the future, as perhaps it could be key to making this trick a real game changer!

arxiv.org
arxiv-vanity.com
scholar.google.com

Understanding deep learning requires rethinking generalization
Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Martin Thoma 7 years ago

This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained.

When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs.

## Key contributions

* Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data.
* Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks
* The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4.

## What I learned

* Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels.
* We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought.

## Funny

> deep neural nets remain mysterious for many reasons

> Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call.

## See also

* [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg)

arxiv.org
scholar.google.com

Evaluating the visualization of what a Deep Neural Network has learned
Samek, Wojciech and Binder, Alexander and Montavon, Grégoire and Bach, Sebastian and Müller, Klaus-Robert
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 7 years ago

Layer-wise Relevance Propagation (LRP) is a novel technique has been used by authors in multiple use-cases (apart from this publication) to demonstrate the robustness and advantage of a *decomposition* method over other heatmap generation methods. Such heatmap generation methods are very crucial for increasing interpretability of Deep Learning models as such. Apart from LRP relevance, authors also discuss quantitative ways to measure the accuracy of the heatmap generated.

### LRP & Alternatives

What is LRP ?

LRP is a principled approach to decompose a classification decision into pixel-wise relevances indicating the contributions of a pixel to the overall classification score. The approach is derived
from a layer-wise conservation principle , which forces the propagated quantity (e.g. evidence for a predicted class) to be preserved between neurons of two adjacent layers.

Denoting by R(l) [i] the relevance associated to the ith neuron of layer and by R (l+1) [j] the relevance associated to the jth neuron in the next layer, the conservation principle requires that

![](https://i.imgur.com/GQxrnCT.png)

where R(l) [i] is given as
![](https://i.imgur.com/FD7AAfF.png)

where z[i,j] is the activation of jth neuron because of input from ith neuron

As per authors this is not necssarily the only relevance funtion which is conserved. The intuition behind using such a function is that lower-layer neurons that mostly contribute to the activation of the higher-layer neuron receive a larger share of the relevance Rj of the neuron j.

A downside of this propagation rule (at least if *epsilon* = 0) is that the denominator may tend to zero if lower-level contributions to neuron j cancel each other out. The numerical instability can be overcome by setting *epsilon* > 0. However in that case, the conservation idea is relaxated in order to gain better numerical properties. To conserve relevance, it can be formulated as sum of positive and negative activations
![](https://i.imgur.com/lo7f8AI.png)
such that *alpha* - *beta* = 1

#### Alternatives to LRP for heatmap

**Senstiivity measurement**

In such methods of generating heamaps, gradient of the output with respect to input is used for generating heatmap. This quantity measures how much small changes in the pixel value locally affect the network output.
##### Disadvantages
Given most models use ReLU as activation function, the gradient flows only through activation with positive output - thereby making makes the backward mapping discontinuous, and consequently strongly local. Also same applies for maxpool activations - wherein gradients only flow through neurons with maximum intensity in local neighbourhood.

Also, given most of these methods use absolute impact on prediction cause by changes in pixel intensities, the granularity of whether the pixel intensity was in favour or against evidence is lost.

**Deconvolutional Networks**

##### Disadvantages

Here the backward discontinuity problem of sensitivity based methods are absent, hence global features can be captured. However, since the method only takes in activation from final layer (which learns the presence or absence of features mostly) , using this for generating heatmaps is likely to yield avergae maps, lacking image specific localisation effects

LRP is able to counter the effects nicely because of the way it uses relevance

#### Performance of heatmaps

Few concerns that the authors raise are
- A heatmap is not a segmentation mask on the contrary missing evidence or the context may be very important for classification
- Salient features represent average explanations of what distinguishes one image category from another. For individual images these explanations may be meaningless or even wrong. For instance, salient features for the class ‘bicycle’ may be the wheels and the handlebar. However, in some images a bicycle
may be partly occluded so that these parts of a bike are not visible. In these images salient features fail to explain the classifier’s decision (which still may be correct).

Authors propose a novel method (MoRF - *Most Relevant First* ) of objectively quantifying quality of a heatmap. A good detailed idea of the measure can best be obtained from the paper. To give an idea, the most reliable method should ideally rank the most relevant regions in the same order even if small perturbations in pixel intensities are observed (in non-relevant areas.

The quantity of interest in this case is the area over the MoRF perturbation curve (AOPC).

#### Observation

Most of the sensitivity based methods answer to the question - *what change would make the image more or less belong to the category car* which isn't really the classifier's question. LRP plans to answer the real classifier question *what speaks for the presence of a car in the image*

An image below would be a good example of how LRPs can denoise heatmaps generated on the basis of sensitivity.

![](https://i.imgur.com/Sq0b5yg.png)