Variance Reduction for Stochastic Gradient OptimizationVariance Reduction for Stochastic Gradient OptimizationWang, Chong and Chen, Xi and Smola, Alexander J. and Xing, Eric P.2013
Paper summarynipsreviewsThe authors propose to accelerate the stochastic gradient optimization algorithm by reducing the variance of the noisy gradient estimate by using the 'control variate' trick (a standard variance reduction technique for Monte Carlo simulations, explained in [3] for example). The control variate is a vector which hopefully has high correlation with the noisy gradient but for which the expectation is easier to compute. Standard convergence rates for stochastic gradient optimization depend on the variance of the gradient estimates, and thus a variance reduction technique should yield an acceleration of convergence. The authors give examples of control variates by using Taylor approximations of the gradient estimate for the optimization problem arising in regularized logistic regression as well as for MAP estimation for the latent Dirichlet Allocation (LDA) model. They compare constant step-size SGD with and without variance reduction for logistic regression on the covtype dataset, claiming that the variance reduction allows to use bigger step-sizes without having the problem of high variance and thus yields faster empirical convergence. For LDA, they compare the adaptive step-size version of the stochastic optimization method of [10] with and without variance reduction, showing a faster convergence on the held-out test log-likelihood on three large corpora.
The authors propose to accelerate the stochastic gradient optimization algorithm by reducing the variance of the noisy gradient estimate by using the 'control variate' trick (a standard variance reduction technique for Monte Carlo simulations, explained in [3] for example). The control variate is a vector which hopefully has high correlation with the noisy gradient but for which the expectation is easier to compute. Standard convergence rates for stochastic gradient optimization depend on the variance of the gradient estimates, and thus a variance reduction technique should yield an acceleration of convergence. The authors give examples of control variates by using Taylor approximations of the gradient estimate for the optimization problem arising in regularized logistic regression as well as for MAP estimation for the latent Dirichlet Allocation (LDA) model. They compare constant step-size SGD with and without variance reduction for logistic regression on the covtype dataset, claiming that the variance reduction allows to use bigger step-sizes without having the problem of high variance and thus yields faster empirical convergence. For LDA, they compare the adaptive step-size version of the stochastic optimization method of [10] with and without variance reduction, showing a faster convergence on the held-out test log-likelihood on three large corpora.