Deep Learning

Momentum and Adaptive Learning based Gradient Descent Optimizers: Adagrad and Adam

By Inggris

Posted on 13 Juni 2019

In my previous article on Gradient Descent Optimizers, we had discussed about three types of Gradient Descent algorithms:

1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini Batch Gradient Descent

In this article, we will see some advanced versions of Gradient Descent which can be categorized as:

1. Momentum based (Nesterov Momentum)
2. Based on adaptive learning rate (Adagrad, Adadelta, RMSprop)
3. Combination of momentum and adaptive learning rate (Adam)

Lets first understand something about momentum.

Momentum

Momentum helps in accelerating SGD in a relevant direction. So, its a good idea to also consider momentum for every parameter. It has following advantages:

1. Avoids local minima: As momentum adds up speed and hence increases the step size, optimizer will not get trapped in local minima.

2. Faster convergence: Momentum makes the convergence faster as it increases the step size due to the gained speed.

Now, lets see some flavors of SGD.

1. Nesterov Momentum

It finds out the current momentum and based upon that approximates the next position. And then, it calculates the gradient w.r.t next approximated position instead of calculating gradient w.r.t current position. This thing prevents us from going too fast and results in increased responsiveness, which significantly increases the performance of SGD.

2. Adagrad

It mainly focuses on adaptive learning rate instead of momentum.

In standard SGD, learning rate is always constant. It means, we have to go with same speed irrespective of the slope. This seems impractical in real life.

What happen if we know that we should slow down or speed up? What happen if we know that we should accelerate more in this direction and decelerate in that direction? Its not possible using the standard SGD.

Adagrad keeps updating the learning rate instead of using constant learning rate. It accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate, so that now the learning rate could be smaller or larger depending on how the past gradients behaved.

It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

2A. AdaDelta and RMSprop

AdaDelta and RMSprop are an extension of Adagrad.

As discussed in Adagrad section, Adagrad accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate. Due to this, Adagrad encounters an issue. The issue is that learning rate in Adagrad keeps on decreasing due to which at a point learning almost stops.

To handle this issue AdaDelta and RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, we consider the moving average.

3. Adam

Adam is the finest Gradient Descent Optimizer and is widely used. It uses powers of both momentum and adaptive learning. In other words, Adam is RMSprop or AdaDelta with momentum. It considers momentum and also normalize the learning rate using the moving average squared gradient.

Conclusion: Most of the above Gradient Descent methods are already implemented in the popular Deep Learning frameworks like Tensorflow, Keras, Caffe etc. However, Adam is currently the default recommended algorithm to be used as it utilizes both momentum and adaptive learning features.

For more details on above algorithms, I strongly refer this and this article.

Terimakasih telah membaca di Piool.com, semoga bermanfaat dan lihat juga di situs berkualitas dan paling populer Aopok.com, peluang bisnis online Topbisnisonline.com dan join di komunitas Topoin.com.

Piool

Momentum and Adaptive Learning based Gradient Descent Optimizers: Adagrad and Adam

Paling Populer

12 Steps To Increase Traffic To Your Website

6 Top Types Of SEO Tools

Monkeypox: Memahami Penyakit yang Sedang Menjadi Perhatian Dunia

Mobil Listrik: Solusi atau Tantangan Baru bagi Industri Otomotif?

Link Video Yanti TKW Taiwan Viral Karena Joget Nganu

Guru Honorer Langkat Beri Award untuk Polda Sumut

KEJAMNYA MANUSIA YANG MENGAMALKAN RIBA.

Proses Komunikasi yang Salah Bisa Memperlambat Kesepakatan, Ini Cara Memperbaikinya!

7 Manfaat Positif di Balik Hobi Melukis

Review Kursi Bayi Sugar Baby Klassic Chair (K-Chair)