Raza Habib's profile - ShortScience.org

dx.doi.org
sci-hub
scholar.google.com

Training Deep and Recurrent Networks with Hessian-Free Optimization
Martens, James and Sutskever, Ilya
Springer Neural Networks: Tricks of the Trade (2nd ed.) - 2012 via Local Bibsonomy
Keywords: dblp

[link] Summary by Raza Habib 7 years ago

## Very Short Summary
The authors introduce a number of modifications to traditional hessian-free optimisation that makes the method work better for neural networks. The modifications are:
* Use the Generalised Gauss Newton Matrix (GGN) rather than the Hessian.
* Damp the GGN so that $G' = G + \lambda I$ and adjust $\lambda$ using levenberg-marquardt heuristic.
* Use an efficient recursion to calculate the GGN.
* Initialise each round of conjugated gradients with the final vector of the previous iteration.
* A new simpler termination criterion for CG. Terminate CG when the relative decrease in the objective falls below some threshold.
* Back-tracking of the CG solution. ie you store intermediate solutions to CG and only update if the new CG solution actually decreases the over all problem objective.

## Less Short Summary

### Hessian Free Optimisation in General
Hessian free optimisation is used when one wishes to optimise some objective $f(\theta)$ using second order methods but inversion or even computation of the Hessian is intractable or infeasible. The method is an iterative method and at each iteration, we take a second order approximation to the objective. i.e at iterantion n, we take a second order taylor expansion of $f$ to get:

$M^n(\theta) = f(\theta^n) + \nabla_{\theta}^Tf(\theta^n)(\theta - \theta^n) + (\theta - \theta^n)^TH(\theta - \theta^n) $

Where $H$ is the hessian matrix. If we minimise this second order approximation with respect to $\theta$ we would find that that $\theta^{n+1} = H^{-1}(-\nabla_{\theta}^Tf(\theta^n))$. However, inverting $H$ is usually not possible for even moderately sized neural networks.

There does however exist an efficient algorithm for calculating hessian vector products $Hv$ for any $v$. The insight of hessian-free optimisation is that one can solve linear problems of the form $Hx = v$ using only hessian vector products via the linear conjugated gradients algorithm. You therefore avoid the need to ever actually compute either the Hessian or its inverse.

To run vanilla hessian free all you need to do at each iteration is:

1) Calculate the gradient vector using standard backprop.

2) Calculate $H\theta$ product using an efficient recursion.

3) calculate the next update $\theta^{n+1} = ConjugatedGradients(H, -\nabla_{\theta}^Tf(\theta^n))$

The main contribution of this paper is to take the above algorithm and make the changes outlined in the very short summary.

## Take aways
Hessian-Free optimisation was perhaps the best method at the time of publication. Recently it seems that first order methods using per-parameter learning rates like ADAM or even learning-to-learn can outperform Hessian-Free. This is primarily because of the increased cost per iteration of Hessian Free. However it still seems that using curvature information if its available is beneficial though expensive.

More resent second order curvature appoximations like Kroeniker Factored Approximate Curvature (KFAC) and Kroeniker Factored Recursive Approximation (KFRA) are cheaper ways to achieve the same benefit.

arxiv.org
arxiv-vanity.com
scholar.google.com

On orthogonality and learning recurrent networks with long term dependencies
Eugene Vorontsov and Chiheb Trabelsi and Samuel Kadoury and Chris Pal
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.NE
more

[link] Summary by Raza Habib 7 years ago

## Headline Summary
The problem of vanishing or exploding gradients when training recurrent neural networks can be partially overcome by enforcing an orthogonality constraint on the weight matrices between consecutive hidden states. This paper explores the trade-off between enforced orthogonality and the rate of learning by slowly increasing the extent to which the weight matrices are allowed to deviate from orthogonality.  They find that a hard orthogonality constraint can be too restrictive and perseveres gradients at the cost of slowing learning.

### Why orthogonal matrices are likely to help with exploding gradients

For a  neural network  with N hidden layers we can write the gradient of the loss wrt to the weight matrix between hidden states $W_i$ as: 

$ \frac{\partial L}{\partial W_i} =  \frac{\partial a_i}{\partial w_i} \prod_j^N (\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i}) \frac{\partial L}{\partial a_{N+1}} = \frac{\partial a_i}{\partial w_i} \prod_j^N (D_i W_{j+1}) \frac{\partial L}{\partial a_{N+1}}$

where L is the loss function and D is the jacobian of the activation function. It's in this long chain of products that the gradient tends to either vanish or explode. This problem is particularly bad in recurrent nets where the chain is the length of the unrolled computation graph.

The norm of the derivative $\|\frac{\partial h_i}{\partial a_i} \frac{\partial a_i}{\partial h_i} \|= \|D_i W_{j+1}\| \leq  \|D_i \| \| W_{j+1}\| \leq \lambda_D \lambda_W$ 

where $\lambda_{D/W}$ is the largest singular value of the given matrix. Its clear that if $\lambda$ is less than or greater than 1, that the norm of the gradient will grow or decay exponentially with length of the chain of products given above.

Constraining the weight matrices to be orthogonal ensures that the singular values are 1 and so perseveres the gradient norm.

### How to enforce the orthogonality constraints

The set of all orthogonal matrices from a continuos manifold known as the stiefel manifold. One way to ensure that the weight matrices remain orthogonal is to initialise them to be orthogonal and then to perform gradient optimisation on this manifold. The orthogonality of the matrix can be preserved by a Caley transformation outlined in the paper. 

In order to allow the weight  matrix to deviate slightly from the stiefel manifold, the authors took the SVD of $W = USV^T $ and parameterised the singular values. They used geodesic gradient descent to maintain the orthogonality of $U$ and $V$. The singular values were restricted to between $ 1 ^+_- m$ where  $m$ is a margin by parameterisation with a sigmoid.

### Experiment Summary

They experiment on a number of synthetic and real data-sets with elman and factorial RNNs. The synthetic tasks are the hochrieter memory and addition tasks, classifying sequential mnist and the real task is a language modelling task using the Penn Tree Bank dataset.

In the synthetic memory tasks they found that pure othogonalisation hindered performance but that a small margin away from orthogonality was better than no orthogonal constraint.

On the mnist tasks, the LSTM outperformed all other versions of simple RNNs but the simple RNNs with orthogonal constraints did quite well. Orthonal initialisation outperformed identity and glorot initialisation.

On the language modelling task with short sentences, orthogonal constraints were detrimental but on the longer sentences beneficial. They did not compare to LSTM. They also found that orthogonal initialisation was almost as good as orthogonal constraints during learning suggesting that preserving gradient flow is most important early in learning.

### My Two Cents
This paper explores an interesting approach to learning long term dependencies even with very simple RNNs. However, as theoretically appealing as the results may be, they still fail to match the memory capacities of an LSTM so for the time being it doesn't seem that there are any major practical take aways.

arxiv.org
scholar.google.com

An Empirical Analysis of Deep Network Loss Surfaces
Im, Daniel Jiwoong and Tao, Michael and Branson, Kristin
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Raza Habib 7 years ago

### Brief Summary:

Does what it says on the tin. The authors examine the loss surface of the Network-In-Network and VGG neural networks near the critical points found by 5 different stochastic optimisation algorithms. These algorithms are Adam, Adadelta, SGD, SGD with momentum and a new Runge-Kutta based optimiser. They examine the loss function by interpolating along lines between the initial and final parameter values.

They find that the different algorithms find qualitatively genuinely different types of minima. They find that the basin around the minima of ADAM is bigger than those for other optimisation algorithms. They find also that all of the minima are similarly good.

arxiv.org
arxiv-vanity.com
scholar.google.com

Generative Temporal Models with Memory
Mevlana Gemici and Chia-Chun Hung and Adam Santoro and Greg Wayne and Shakir Mohamed and Danilo J. Rezende and David Amos and Timothy Lillicrap
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.NE
more

[link] Summary by Raza Habib 7 years ago

#### Very Brief Summary:

This paper combines stochastic variational inference with memory-augmented recurrent neural networks. The authors test 4 variants of their models against the Variational Recurrent Neural Network on 7 artificial tasks requiring long term memory. The reported log-likelihood lower bound is not obviously improved by the new models on all tasks but is slightly better on tasks requiring high capacity memory.

#### Slightly Less Brief Summary:

The authors propose a general class of generative models for time-series data with both deterministic and stochastic latents. The deterministic latents, $h_t$, evolve as a recurrent net with augmented memory and the stochastic latents, $z_t$ are gaussians whose mean and variance are a deterministic function of $h_t$. The observations at each time-step $x_t$ are also gaussians whose mean and variance are parametrised by a function of $h_{<t}, x_{<t}$.

#### Generative Temporal Models without Augmented Memory:

The family of generative temporal models is fairly broad and includes kalman filters, non-linear dynamical systems, hidden-markov models and switching state-space models. More recent non-linear models such as the variational RNN are most similar to the new models in this paper. In general all of the mentioned temporal models can be written as:

$P_\theta(x_{\leq T}, z_{\leq T} ) = \prod_t P_\theta(x_t | f_x(z_{\leq t}, x_{\leq t}))P_\theta(z_t | f_z(z_{\leq t}, x_{\leq t}))$

The differences between models then come from the the exact forms of $f_x$ and $f_z$ with most models making strong conditional independence assumptions and/or having linear dependence. For example in a Gaussian State Space model both $f_x$ and $f_z$ are linear, the latents form a first order Markov chain and the observations $x_t$ are conditionally independent of everything given $z_t$.

In the Variational Recurrent Neural Net (VRNN) an additional deterministic latent variable $h_t$ is introduced and at each time-step $x_t$ is the output of a VAE whose prior $z_t$ is conditioned on $h_t$. $h_t$ evolves as an RNN.

#### Types of Model with Augmented Memory:

This paper follows the same strategy as the VRNN but adds more structure to the underlying recurrent neural net. The authors motivate this by saying that the VRNN "scales poorly when higher capacity storage is required".

* "Introspective" Model: In the first augmented memory model, the deterministic latent M_t is simply a concatenation of the last $L$ latent stochastic variables $z_t$. A soft method of attention over the latent memory is used to generate a "memory context" vector at each time step. The observed output $x_t$ is a gaussian with mean and variance parameterised by the "memory context' and the stochastic latent $z_t$. Because this model does not learn to write to memory it is faster to train.

* In the later models the memory read and write operations are the same as those in the neural turing machine or differentiable neural computer.

#### My Two Cents:

In some senses this paper feels fairly inevitable since VAE's have already been married with RNNs and so it's a small leap to add augmented memory.

The actual read write operations introduced in the "introspective" model feel a little hacky and unprincipled.

The actual images generated are quite impressive.

I'd like to see how these kind of models do on language generation tasks and wether they can be adapted for question answering.

Raza Habib

sciscore: 2.8