Training Deep and Recurrent Networks with Hessian-Free Optimization on ShortScience.org

dx.doi.org
sci-hub
scholar.google.com

Training Deep and Recurrent Networks with Hessian-Free Optimization
Martens, James and Sutskever, Ilya
Springer Neural Networks: Tricks of the Trade (2nd ed.) - 2012 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Raza Habib 7 years ago

## Very Short Summary
The authors introduce a number of modifications to traditional hessian-free optimisation that makes the method work better for neural networks. The modifications are:
* Use the Generalised Gauss Newton Matrix (GGN) rather than the Hessian.
* Damp the GGN so that $G' = G + \lambda I$ and adjust $\lambda$ using levenberg-marquardt heuristic.
* Use an efficient recursion to calculate the GGN.
* Initialise each round of conjugated gradients with the final vector of the previous iteration.
* A new simpler termination criterion for CG. Terminate CG when the relative decrease in the objective falls below some threshold.
* Back-tracking of the CG solution. ie you store intermediate solutions to CG and only update if the new CG solution actually decreases the over all problem objective.

## Less Short Summary

### Hessian Free Optimisation in General
Hessian free optimisation is used when one wishes to optimise some objective $f(\theta)$ using second order methods but inversion or even computation of the Hessian is intractable or infeasible. The method is an iterative method and at each iteration, we take a second order approximation to the objective. i.e at iterantion n, we take a second order taylor expansion of $f$ to get:

$M^n(\theta) = f(\theta^n) + \nabla_{\theta}^Tf(\theta^n)(\theta - \theta^n) + (\theta - \theta^n)^TH(\theta - \theta^n) $

Where $H$ is the hessian matrix. If we minimise this second order approximation with respect to $\theta$ we would find that that $\theta^{n+1} = H^{-1}(-\nabla_{\theta}^Tf(\theta^n))$. However, inverting $H$ is usually not possible for even moderately sized neural networks.

There does however exist an efficient algorithm for calculating hessian vector products $Hv$ for any $v$. The insight of hessian-free optimisation is that one can solve linear problems of the form $Hx = v$ using only hessian vector products via the linear conjugated gradients algorithm. You therefore avoid the need to ever actually compute either the Hessian or its inverse.

To run vanilla hessian free all you need to do at each iteration is:

1) Calculate the gradient vector using standard backprop.

2) Calculate $H\theta$ product using an efficient recursion.

3) calculate the next update $\theta^{n+1} = ConjugatedGradients(H, -\nabla_{\theta}^Tf(\theta^n))$

The main contribution of this paper is to take the above algorithm and make the changes outlined in the very short summary.

## Take aways
Hessian-Free optimisation was perhaps the best method at the time of publication. Recently it seems that first order methods using per-parameter learning rates like ADAM or even learning-to-learn can outperform Hessian-Free. This is primarily because of the increased cost per iteration of Hessian Free. However it still seems that using curvature information if its available is beneficial though expensive.

More resent second order curvature appoximations like Kroeniker Factored Approximate Curvature (KFAC) and Kroeniker Factored Recursive Approximation (KFRA) are cheaper ways to achieve the same benefit.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private