by **Diederik P. Kingma & Jimmy Lei Ba**

ArXiv, 2015

Adam is a stochastic gradient descent algorithm based on estimation of 1st and 2nd-order moments. The algorithm estimates 1st-order moment (the gradient mean) and 2nd-order moment (element-wise squared gradient) of the gradient using exponential moving average, and corrects its bias. The final weight update is proportional to learning rate times 1st-order moment divided by the square root of 2nd-order moment.

The Adam algorithm proposed in this paper is closely related to another stochastic optimization algorithm called RMSprop with Momentum. RMSprop (unpublished, mentioned in G. E. Hinton’s *Neural Network for Machine Learning* course on Coursera [1]) is a variant of SGD that divides gradients in the current timestep by the square root of the exponentially averaged square of the gradient (2nd-order moment), and is essentially the mini-batch version of using only the sign of the gradient. RMSprop with Momentum (also unpublished) is a modified version of RMSprop, that divides exponentially averaged gradients (1st-order moment, or equivalently, gradient with momentum) by 2nd-order moment. The main difference between Adam and RMSprop with Momentum is that Adam has a bias correction step. Also, in the supplement for Convolutional Color Constancy by Jon Barron [2] introduces a Stochastic gradient descent variation that is similar to Adam plus an adaptive learning rate, where hyperparameters are re-parameterized by the exponential half-life instead of decay multipliers.

Adam takes 3 hyperparameters: the learning rate, the decay rate of 1st-order moment, and the decay rate of 2nd-order moment. It is shown in the paper that the final weight update is approximately bounded by the learning rate, thus making it relatively easier to choose the scale of the learning rate, especially in some scenarios where the region of optimal solution can be estimated in advance. The exponential moving average is biased towards zero when starting from zero, so it is divided by a term based on the decay rate to get an unbiased estimate.

In summary, each update of Adam involves the following steps:

- Compute the gradient and its element-wise square using the current parameters.
- Update the exponential moving average of the 1st-order moment and the 2nd-order moment.
- Compute an unbiased average of the 1st-order moment and 2nd-order moment.
- Compute weight update: 1st-order moment unbiased average divided by the square root of 2nd-order moment unbiased average (and scale by learning rate).
- Apply update to the weights.

In the implementation, the steps above are slightly simplified into fewer steps.

The paper provides convergence analysis and shows that the regret bound is $O(\sqrt{T})$ on a sequence of T convex functions, which is comparable to the best known bound for general convex online learning problems. Also, experiments using multi-class logistic regression and neural net classifiers show good convergence behavior. (However, these results are obtained on relatively small datasets such as MNIST and CIFAR-10 with relatively shallow models).

Since the main difference between Adam and RMSprop with Momentum is bias-correction, the paper also studies the effect of bias-correction. In Section 6.4, the paper empirically shows that bias-correction does matter in Adam, as it outperforms RMSprop with Momentum across different hyperparameter settings.

Finally, the paper also proposes AdaMax, replacing 2nd-order moment (L2-norm) with infinite-norm, but does not provide experimental results on AdaMax.

Adam has been used in some recent visual attention papers such as *DRAW* [3] and *Show, Attend and Tell* [4], and shows good results. However, the Adam paper does not provide results on large-scale deep learning tasks such as ImageNet classification, so it remains to determine whether Adam works well on such tasks.

Adam is implemented in Caffe [5], and has been used by several computer vision researchers at Berkeley, who anecdotally reported that one advantage of such solvers (Adam, RMSprop with momentum, etc.) is that in practice they require less tweaking of learning rate. In many scenarios, even constant learning rate works well. However, this may be subject to the training task.

**References:**

[1] https://www.coursera.org/course/neuralnets

[2] http://www.cs.berkeley.edu/~barron/BarronICCV2015_supp.pdf

[3] K. Gregor, A. Graves, and W. G. Com, “DRAW: A Recurrent Neural Network For Image Generation,” arXiv:1502.04623, 2014.

[4] K. Xu, A. Courville, R. S. Zemel, and Y. Bengio, “Show, Attend and Tell : Neural Image Caption Generation with Visual Attention,” arXiv:1502.03044, 2015.

[5] http://caffe.berkeleyvision.org/