登录查看更多内容

Deep Learning: Explaining Optimization (Gradient Descent, Momentum, RMSprop, Adam)

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

发布日期: 2020年1月11日

1- Introduction

In a simple case of supervised image classification task, Deep Neural Network (DNN) can be seen as a function that maps from pixels of an image (x) to the corresponding class label (y). The DNN, or the mapping function, is parameterized by some parameters W (neural network weights).

The loss function measures how well the predictions of the network (based on W) agree with the ground truth labels in the training data.

The loss function enables us quantify the quality of any particular set of weights W

Iterative refinement: Practically, W is initialized at random, and according to an iterative process, W is updated in the direction to minimize the loss.

Optimization is the process of finding the set of parameters W that minimize the loss function

2- Gradient Descent

The gradient is a generalization of slope for functions that take a vector of numbers as input. The gradient is a vector of slopes (derivatives) for each dimension in the input space.

The gradient represents the direction in which the function has the steepest rate of increase.

How to find the W that minimize the loss starting from a random initialization of W? we have to find a direction to update W, and a step size of that update.

Direction: The gradient tells us the slope of the loss function along every dimension. Accordingly, we can update W in the negative direction of the gradient, since we wish our loss function to decrease, not increase.
Step size: (the learning rate) is one of the most important hyperparameters in training DNNs. A small learning rate gives make a consistent but very small progress; A large learning rate gives faster move but could give a higher loss as we overstep he minima.

Gradient Descent optimization loop:

Load random samples of data: mini-batch
Forward the data in the DNN and compute the loss
Compute the gradients of the loss w.r.t W: weights_grad
Update the weights: W = W - step_size * weights_grad

Note: In real applications, the training data can have millions of examples. It takes a lot of time to compute the full loss function over the entire training set for a single parameter update. A very common and practical solution is to compute the gradient over batches (samples) of the training data. The size of the mini-batch is another hyperparameter usually based on memory constraints, and have size of 32, 64 or 128.

3- Gradient descent with momentum

In the following figure, the optimization process is slow due to the many oscillations conducted while updating the parameters in the direction of minimum loss.

The momentum algorithm usually learns (converges) faster than standard gradient descent. The main idea is to calculate the exponentially weighted averages (EWA) for the gradients and then update the weights with the new values.

Exponentially weighted averages:

Exponentially weighted averages (EWA) are used to smooth noisy data based on a parameter B, where the smoothed values represent averages over ~ (1 / (1 - beta)) entries. For example, if we have some arbitrary noisy data, here how we use EWA:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0.0, 1.0, num=1000) *10
noise = np.random.normal(scale=0.3,size=len(x))
y = np.sin(x) + noise

plt.plot(x,y,'bo',alpha=0.3,ms=4,label='data')
plt.xlabel('Time')
plt.ylabel('Values')

beta = 0.9
v = 0.0
ewa90 = []
for t in y:
  v = beta * v + (1-beta) * t
  ewa90.append(v)

beta = 0.98
v = 0.0
ewa98 = []
for t in y:
  v = beta * v + (1-beta) * t
  ewa98.append(v)


plt.figure(figsize=(11,7))
plt.plot(x, y, 'o', alpha=0.3, ms=4, label='data')
plt.plot(x, ewa90)
plt.plot(x, ewa98)
plt.legend(['noisy data', 'B = 0.9', 'B = 0.98'], loc='lower right')

As shown, smoother curves are achieved via larger B; however the new average curve is more dependent on the past and it looks more delayed.

vdW = 0
on iteration t:
    compute dw on current mini-batch
    vdW = beta * vdW + (1 - beta) * dW
    W = W - learning_rate * vdW

Exponentially weighted averages are useful for further optimizing gradient descent algorithm because it can smooth the averages of skewed data points: reduces oscillations in gradient descent and hence makes faster and smoother path towards the minima.

With Momentum update, the parameter vector will build up velocity in any direction that has consistent gradient.

4- RMSprop

RMSprop (Root mean square prop) is a very effective, but currently unpublished adaptive learning rate method. Amusingly, it is cited as slide 29 of Lecture 6 of Geoff Hinton’s Coursera class. The main idea is to normalize the parameter update step, element-wise.

sdW = 0
on iteration t:
    compute dw on current mini-batch
    sdW = (beta * sdW) + (1 - beta) * dW ^ 2  # squaring is element-wise
    W = W - learning_rate * dW / sqrt(sdW)

Intuitively, the normalization (dividing by sdW) slows down updates in high oscillation directions (large sdW); and speeds up the consistent gradient direction (small sdW).

5- Adam optimization algorithm

Adam (Adaptive Moment Estimation) RMSprop are among the optimization algorithms that worked very well with many different DNN architectures and tasks. Adam optimization simply puts RMSprop and momentum together.

vdW = 0
sdW = 0
on iteration t:
    compute dw on current mini-batch
    vdW = (beta1 * vdW) + (1 - beta1) * dW  # momentum
    sdW = (beta2 * sdW) + (1 - beta2) * dW ^ 2  # RMSprop
    W = W - learning_rate * vdW / (sqrt(sdW) + epsilon)

Recommended values in the paper are epsilon = 1e-8, beta1 = 0.9, beta2 = 0.999.

In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp.

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

2025年3月1日

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

Article created by Perplexity Deep Research. Prompt: "You are a deep-learning experienced researcher.

1 条评论
The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

2025年3月1日

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

Research Report Created by Perplexity Deep Research My Research Question : "Now I want to dig deeper in the human judge…

3 条评论
How to Learn Artificial Intelligence: A Beginner’s Guide

2024年5月31日

How to Learn Artificial Intelligence: A Beginner’s Guide

Artificial Intelligence (AI) is a fascinating field that simulates human intelligence and task performance using…
[????????????] ?????????????????? ???????????? explained with code ??

2023年1月28日

[????????????] ?????????????????? ???????????? explained with code ??

"During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion…

2 条评论
A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

2023年1月21日

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

Hello everyone, and thank you all for being here today! Let me introduce our new star, the ChatGPT, who will discuss…
10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

2022年2月17日

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

In this article, 10 well-known pre-trained object detectors are loaded and used in a standard and easy way. YOLOF: You…

6 条评论
FNet: Do we need the attention layer at all? [Explained with code]

2021年10月30日

FNet: Do we need the attention layer at all? [Explained with code]

FNet: Mixing Tokens with Fourier Transforms "In this work, we investigate whether simpler token mixing mechanisms can…
Patches Are All You Need! [with code]

2021年10月28日

Patches Are All You Need! [with code]

"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have…
MLP is all you need! [with code]

2021年10月23日

MLP is all you need! [with code]

From Google: MLP-Mixer: An all-MLP Architecture for Vision Main idea: "While convolutions and attention are both…

2 条评论
9 Steps for solving any machine learning problem

2021年8月28日

9 Steps for solving any machine learning problem

In this article, we will present a universal blueprint that we can use to attack and solve any machine-learning…

2 条评论

See all articles

Deep Learning: Explaining Optimization (Gradient Descent, Momentum, RMSprop, Adam)

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

1- Introduction

2- Gradient Descent

3- Gradient descent with momentum

4- RMSprop

5- Adam optimization algorithm

Further readings:

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了

20 Deep Learning Terminologies You Must Know

Understanding 1x1 Convolutions: The Key to Efficient Bottleneck Layers in Deep Learning

Deep Learning in Computer Vision: Object Classification

Supervised Learning Optimization

Simplifying Deep Learning - Part I

Deep Learning: GANs again

Understanding the variance of Variational Autoencoders

Introduction to Deep Q-Learning: Training Agents to Make Decisions in Complex Environments

Self Organizing Maps

Understanding Deep Learning Networks using Saliency Maps

1- Introduction

2- Gradient Descent

3- Gradient descent with momentum

4- RMSprop

5- Adam optimization algorithm

Further readings:

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

How to Learn Artificial Intelligence: A Beginner’s Guide

[????????????] ?????????????????? ???????????? explained with code ??

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

FNet: Do we need the attention layer at all? [Explained with code]

Patches Are All You Need! [with code]

MLP is all you need! [with code]

9 Steps for solving any machine learning problem

社区洞察

其他会员也浏览了

20 Deep Learning Terminologies You Must Know

Understanding 1x1 Convolutions: The Key to Efficient Bottleneck Layers in Deep Learning

Deep Learning in Computer Vision: Object Classification

Supervised Learning Optimization

Simplifying Deep Learning - Part I

Deep Learning: GANs again

Understanding the variance of Variational Autoencoders

Introduction to Deep Q-Learning: Training Agents to Make Decisions in Complex Environments

Self Organizing Maps

Understanding Deep Learning Networks using Saliency Maps