ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens et al.
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.AI, cs.LG
more

[link] Summary by Udibr 7 years ago

This is a very techincal paper and I only covered items that interested me
* Model
  * Encoder
    * 8 layers LSTM 
    * bi-directional only first encoder layer
    * top 4 layers add input to output (residual network)
  * Decoder
    * same as encoder except all layers are just forward direction
  * encoder state is not passed as a start point to Decoder state
  * Attention
    * energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer
    * computed from output of 1st decoder layer
    * pre-feed to all layers
* Training has two steps: ML and RL
  * ML (cross-entropy) training:
    * common wisdom, initialize all trainable parameters uniformly between [-0.04, 0.04]
    * clipping=5, batch=128
    * Adam (lr=2e-4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps)
    * 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days
    * [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.2-0.3 (higher for smaller datasets)
  * RL - [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15) 
    * sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on n-grams of size 1-4
    * mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$
    * mean $r$ computed from $m=15$ samples
    * SGD, 400K steps, 3 days, no drouput
* Prediction (i.e. Decoder)
  * beam search (3 beams)
  * A normalized score is computed to every beam that ended (died)
    * did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.6-0.7]$
    * normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sum-log of attention weight of every input word (i.e. after summing over all output words)
    * Do a second pruning using normalized scores

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

You Only Look Once: Unified, Real-Time Object Detection
Redmon, Joseph and Divvala, Santosh Kumar and Girshick, Ross B. and Farhadi, Ali
Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 6 years ago

This paper models object detection as a regression problem for bounding
boxes and object class probabilities with a single pass through the CNN. The
main contribution is the idea of dividing the image into a 7x7 grid, and having
each cell predict a distribution over class labels as well as a bounding box
for the object whose center falls into it. It's much faster than R-CNN and
Fast R-CNN, as the additional step of extracting region proposals has been
removed.

## Strengths

- Works real-time. Base model runs at 45fps and a faster version goes up to
150fps, and they claim that it's more than twice as fast as other works on
real-time detection.

- End-to-end model; Localization and classification errors can be jointly
optimized.

- YOLO makes more localization errors and fewer background mistakes than
Fast R-CNN, so using YOLO to eliminate false background detections from
Fast R-CNN results in ~3% mAP gain (without much computational time as R-CNN
is much slower).

## Weaknesses / Notes

- Results fall short of state-of-the-art: 57.9% v/s 70.4% mAP (Faster R-CNN).

- Performs worse at detecting small objects, as at most one object per grid
cell can be detected.

papers.nips.cc
scholar.google.com

Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets
Vincent, Pascal and de Brébisson, Alexandre and Bouthillier, Xavier
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 8 years ago

This paper presents a linear algebraic trick for computing both the value and the gradient update for a loss function that compares a very high-dimensional target with a (dense) output prediction. Most of the paper exposes the specific case of the squared error loss, though it can also be applied to some other losses such as the so-called spherical softmax. One use case could be for training autoencoders with the squared error on very high-dimensional but sparse inputs. While a naive (i.e. what most people currently do) implementation would scale in $O(Dd)$ where $D$ is the input dimensionality and d the hidden layer dimensionality, they show that their trick allows to scale in $O(d^2)$.

Their experiments show that they can achieve speedup factors of over 500 on the CPU, and over 1500 on the GPU.

#### My two cents

This is a really neat, and frankly really surprising, mathematical contribution. I did not suspect getting rid of the dependence on D in the complexity would actually be achievable, even for the "simpler" case of the squared error.

The jury is still out as to whether we can leverage the full power of this trick in practice. Indeed, the squared error over sparse targets isn't the most natural choice in most situations. The authors did try to use this trick in the context of a version of the neural network language model that uses the squared error instead of the negative log-softmax (or at least I think that's what was done... I couldn't confirm this with 100% confidence). They showed that good measures of word similarity (Simlex-999) could be achieved in this way, though using the hierarchical softmax actually achieves better performance in about the same time.

But as far as I'm concerned, that doesn't make the trick less impressive. It's still a neat piece of new knowledge to have about reconstruction errors. Also, the authors mention that it would be possible to adapt the trick to the so-called (negative log) spherical softmax, which is like the softmax but where the numerator is the square of the pre-activation, instead of the exponential. I hope someone tries this out in the future, as perhaps it could be key to making this trick a real game changer!

arxiv.org
scholar.google.com

Second-Order Adversarial Attack and Certifiable Robustness
Li, Bai and Chen, Changyou and Wang, Wenlin and Carin, Lawrence
arXiv e-Print archive - 2018 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 4 years ago

Li et al. propose an adversarial attack motivated by second-order optimization and uses input randomization as defense. Based on a Taylor expansion, the optimal adversarial perturbation should be aligned with the dominant eigenvector of the Hessian matrix of the loss. As the eigenvectors of the Hessian cannot be computed efficiently, the authors propose an approximation; this is mainly based on evaluating the gradient under Gaussian noise. The gradient is then normalized before taking a projected gradient step. As defense, the authors inject random noise on the input (clean example or adversarial example) and compute the average prediction over multiple iterations.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
scholar.google.com

Cooperative Inverse Reinforcement Learning
Hadfield-Menell, Dylan and Dragan, Anca and Abbeel, Pieter and Russell, Stuart J.
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Patrick Emami 7 years ago

In the future, AI and people will work together; hence, we must concern ourselves with ensuring that AI will have interests aligned with our own. 
The authors suggest that it is in our best interests to find a solution to the "value-alignment problem". As recently pointed out by Ian Goodfellow, however,
[this may not always be a good idea](https://www.quora.com/When-do-you-expect-AI-safety-to-become-a-serious-issue).

Cooperative Inverse Reinforcement Learning (CIRL) is a formulation of a cooperative, partial information game between a human and a robot. Both share a reward 
function, but the robot does not initially know what it is. One of the key departures from classical Inverse Reinforcement Learning
is that the teacher, which in this case is the human, is not assumed to act optimally. Rather, it is shown that sub-optimal actions
on the part of the human can result in the robot learning a better reward function. The structure of the CIRL formulation is such that it should encourage the 
human to not attempt to teach by demonstration in a way that greedily maximizes immediate reward. Rather, the human learns how to "best respond" to the robot.

CIRL can be formulated as a dec-POMDP, and reduced to a single-agent POMDP. The authors solved a 2D navigation task with CIRL to demonstrate the inferiority of having the human follow a "demonstration-by-expert" policy as opposed to a "best-response" policy.