Adam: A Method for Stochastic OptimizationAdam: A Method for Stochastic OptimizationKingma, Diederik P. and Ba, Jimmy2014
Paper summaryjoecohenAdam is like RMSProp with momentum. The (simplified) update [[Stanford CS231n]](https://cs231n.github.io/neural-networks-3/#ada) looks as follows:
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)
* They suggest a new stochastic optimization method, similar to the existing SGD, Adagrad or RMSProp.
* Stochastic optimization methods have to find parameters that minimize/maximize a stochastic function.
* A function is stochastic (non-deterministic), if the same set of parameters can generate different results. E.g. the loss of different mini-batches can differ, even when the parameters remain unchanged. Even for the same mini-batch the results can change due to e.g. dropout.
* Their method tends to converge faster to optimal parameters than the existing competitors.
* Their method can deal with non-stationary distributions (similar to e.g. SGD, Adadelta, RMSProp).
* Their method can deal with very sparse or noisy gradients (similar to e.g. Adagrad).
* Basic principle
* Standard SGD just updates the parameters based on `parameters = parameters - learningRate * gradient`.
* Adam operates similar to that, but adds more "cleverness" to the rule.
* It assumes that the gradient values have means and variances and tries to estimate these values.
* Recall here that the function to optimize is stochastic, so there is some randomness in the gradients.
* The mean is also called "the first moment".
* The variance is also called "the second (raw) moment".
* Then an update rule very similar to SGD would be `parameters = parameters - learningRate * means`.
* They instead use the update rule `parameters = parameters - learningRate * means/sqrt(variances)`.
* They call `means/sqrt(variances)` a 'Signal to Noise Ratio'.
* Basically, if the variance of a specific parameter's gradient is high, it is pretty unclear how it should be changend. So we choose a small step size in the update rule via `learningRate * mean/sqrt(highValue)`.
* If the variance is low, it is easier to predict how far to "move", so we choose a larger step size via `learningRate * mean/sqrt(lowValue)`.
* Exponential moving averages
* In order to approximate the mean and variance values you could simply save the last `T` gradients and then average the values.
* That however is a pretty bad idea, because it can lead to high memory demands (e.g. for millions of parameters in CNNs).
* A simple average also has the disadvantage, that it would completely ignore all gradients before `T` and weight all of the last `T` gradients identically. In reality, you might want to give more weight to the last couple of gradients.
* Instead, they use an exponential moving average, which fixes both problems and simply updates the average at every timestep via the formula `avg = alpha * avg + (1 - alpha) * avg`.
* Let the gradient at timestep (batch) `t` be `g`, then we can approximate the mean and variance values using:
* `mean = beta1 * mean + (1 - beta1) * g`
* `variance = beta2 * variance + (1 - beta2) * g^2`.
* `beta1` and `beta2` are hyperparameters of the algorithm. Good values for them seem to be `beta1=0.9` and `beta2=0.999`.
* At the start of the algorithm, `mean` and `variance` are initialized to zero-vectors.
* Bias correction
* Initializing the `mean` and `variance` vectors to zero is an easy and logical step, but has the disadvantage that bias is introduced.
* E.g. at the first timestep, the mean of the gradient would be `mean = beta1 * 0 + (1 - beta1) * g`, with `beta1=0.9` then: `mean = 0.9 * g`. So `0.9g`, not `g`. Both the mean and the variance are biased (towards 0).
* This seems pretty harmless, but it can be shown that it lowers the convergence speed of the algorithm by quite a bit.
* So to fix this pretty they perform bias-corrections of the mean and the variance:
* `correctedMean = mean / (1-beta1^t)` (where `t` is the timestep).
* `correctedVariance = variance / (1-beta2^t)`.
* Both formulas are applied at every timestep after the exponential moving averages (they do not influence the next timestep).