Deep Learning: Explaining Optimization (Gradient Descent, Momentum, RMSprop, Adam)

Deep Learning: Explaining Optimization (Gradient Descent, Momentum, RMSprop, Adam)

1- Introduction

In a simple case of supervised image classification task, Deep Neural Network (DNN) can be seen as a function that maps from pixels of an image (x) to the corresponding class label (y). The DNN, or the mapping function, is parameterized by some parameters W (neural network weights).

The loss function measures how well the predictions of the network (based on W) agree with the ground truth labels in the training data.

The loss function enables us quantify the quality of any particular set of weights W

Iterative refinement: Practically, W is initialized at random, and according to an iterative process, W is updated in the direction to minimize the loss.

Optimization is the process of finding the set of parameters W that minimize the loss function

2- Gradient Descent

The gradient is a generalization of slope for functions that take a vector of numbers as input. The gradient is a vector of slopes (derivatives) for each dimension in the input space.

The gradient represents the direction in which the function has the steepest rate of increase.

How to find the W that minimize the loss starting from a random initialization of W? we have to find a direction to update W, and a step size of that update.

  • Direction: The gradient tells us the slope of the loss function along every dimension. Accordingly, we can update W in the negative direction of the gradient, since we wish our loss function to decrease, not increase.
  • Step size: (the learning rate) is one of the most important hyperparameters in training DNNs. A small learning rate gives make a consistent but very small progress; A large learning rate gives faster move but could give a higher loss as we overstep he minima.

Gradient Descent optimization loop:

  1. Load random samples of data: mini-batch
  2. Forward the data in the DNN and compute the loss
  3. Compute the gradients of the loss w.r.t W: weights_grad
  4. Update the weights: W = W - step_size * weights_grad

Note: In real applications, the training data can have millions of examples. It takes a lot of time to compute the full loss function over the entire training set for a single parameter update. A very common and practical solution is to compute the gradient over batches (samples) of the training data. The size of the mini-batch is another hyperparameter usually based on memory constraints, and have size of 32, 64 or 128.

3- Gradient descent with momentum

In the following figure, the optimization process is slow due to the many oscillations conducted while updating the parameters in the direction of minimum loss.

No alt text provided for this image


The momentum algorithm usually learns (converges) faster than standard gradient descent. The main idea is to calculate the exponentially weighted averages (EWA) for the gradients and then update the weights with the new values.

Exponentially weighted averages:

Exponentially weighted averages (EWA) are used to smooth noisy data based on a parameter B, where the smoothed values represent averages over ~ (1 / (1 - beta)) entries. For example, if we have some arbitrary noisy data, here how we use EWA:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0.0, 1.0, num=1000) *10
noise = np.random.normal(scale=0.3,size=len(x))
y = np.sin(x) + noise

plt.plot(x,y,'bo',alpha=0.3,ms=4,label='data')
plt.xlabel('Time')
plt.ylabel('Values')

beta = 0.9
v = 0.0
ewa90 = []
for t in y:
  v = beta * v + (1-beta) * t
  ewa90.append(v)

beta = 0.98
v = 0.0
ewa98 = []
for t in y:
  v = beta * v + (1-beta) * t
  ewa98.append(v)


plt.figure(figsize=(11,7))
plt.plot(x, y, 'o', alpha=0.3, ms=4, label='data')
plt.plot(x, ewa90)
plt.plot(x, ewa98)
plt.legend(['noisy data', 'B = 0.9', 'B = 0.98'], loc='lower right')
No alt text provided for this image

As shown, smoother curves are achieved via larger B; however the new average curve is more dependent on the past and it looks more delayed.

vdW = 0
on iteration t:
    compute dw on current mini-batch
    vdW = beta * vdW + (1 - beta) * dW
    W = W - learning_rate * vdW

Exponentially weighted averages are useful for further optimizing gradient descent algorithm because it can smooth the averages of skewed data points: reduces oscillations in gradient descent and hence makes faster and smoother path towards the minima.

With Momentum update, the parameter vector will build up velocity in any direction that has consistent gradient.


4- RMSprop

RMSprop (Root mean square prop) is a very effective, but currently unpublished adaptive learning rate method. Amusingly, it is cited as slide 29 of Lecture 6 of Geoff Hinton’s Coursera class. The main idea is to normalize the parameter update step, element-wise.

sdW = 0
on iteration t:
    compute dw on current mini-batch
    sdW = (beta * sdW) + (1 - beta) * dW ^ 2  # squaring is element-wise
    W = W - learning_rate * dW / sqrt(sdW)

Intuitively, the normalization (dividing by sdW) slows down updates in high oscillation directions (large sdW); and speeds up the consistent gradient direction (small sdW).

5- Adam optimization algorithm

Adam (Adaptive Moment Estimation) RMSprop are among the optimization algorithms that worked very well with many different DNN architectures and tasks. Adam optimization simply puts RMSprop and momentum together.

vdW = 0
sdW = 0
on iteration t:
    compute dw on current mini-batch
    vdW = (beta1 * vdW) + (1 - beta1) * dW  # momentum
    sdW = (beta2 * sdW) + (1 - beta2) * dW ^ 2  # RMSprop
    W = W - learning_rate * vdW / (sqrt(sdW) + epsilon)

Recommended values in the paper are epsilon = 1e-8, beta1 = 0.9, beta2 = 0.999.


In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp.



Further readings:


Best Regards

要查看或添加评论,请登录

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了