PyTorch: Gradient Descent, Stochastic Gradient Descent and Mini Batch Gradient Descent (Code included)

PyTorch: Gradient Descent, Stochastic Gradient Descent and Mini Batch Gradient Descent (Code included)


In this article we use PyTorch automatic differentiation and dynamic computational graph for implementing and evaluating different Gradient Descent methods.

PyTorch is an open source machine learning framework that accelerates the path from research to production. PyTorch is gaining popularity for its simplicity, ease of use, dynamic computational graph and efficient memory usage. PyTorch naturally supports dynamic building of computational graphs and performs automatic differentiation of the dynamic graphs (Autograds).

Gradient Descent (GD) is an optimization method used to optimize (update) the parameters of a model (Deep Neural Network) using the gradients of an objective function w.r.t the parameters.

No alt text provided for this image

In the figure above, a visualization of a saddle point in the optimization landscape. More information can be found in cs231n by Stanford

Generally there are three main scenarios:

  1. Gradient Descent (GD)
  2. Stochastic Gradient Descent(SGD)
  3. Mini Batch Gradient Descent (Mini Batch GD)

Experimental Setup

In this article, a simple regression example is used to see the deference between these scenarios. Here we have some artificially generated data and want to train a neural network to approximate a function that describe the training data by optimizing its parameters.

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

x = torch.unsqueeze(torch.linspace(-1, 1, 100), dim=1)  
y = x.pow(2) + 0.2*torch.rand(x.size()) 
No alt text provided for this image
class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)   
        self.predict = torch.nn.Linear(n_hidden, n_output)  

    def forward(self, x):
        x = F.relu(self.hidden(x))      
        x = self.predict(x)             
        return x

criterion = torch.nn.MSELoss()

Gradient Descent (GD):

GD in its original form uses the whole training data to update the parameters. This is the basic procedure that produces a smooth movement to the low cost region in the parameter space. However, in modern deep learning settings, it is impossible to use the whole data for a single parameter update.

net_all = Net(n_feature=1, n_hidden=10, n_output=1) 
optimizer_all = torch.optim.SGD(net_all.parameters(), lr=0.2)

loss_list = []
for t in range(100):
    prediction = net_all(x)     
    loss = criterion(prediction, y)
    loss_list.append(loss)
    optimizer_all.zero_grad()
    loss.backward()
    optimizer_all.step()
    print('\repoch: {}\tLoss =  {:.3f}'.format(t, loss), end="")
No alt text provided for this image
  • Very smooth convergence, however using all the data for one update.

Stochastic Gradient Descent:

SGD computes the gradients, represents the other extreme, makes an update for every sample in the dataset. The intuition is that using only one data point is not accurate but it is efficient and gives us a general direction to update the parameters.

net_item = Net(n_feature=1, n_hidden=10, n_output=1) 
optimizer_item = torch.optim.SGD(net_item.parameters(), lr=0.2)

loss_list = []
avg_loss_list = []
for t in range(100):
    for x_i, y_i in zip(x, y):
      pred_i = net_item(x_i)
      loss = criterion(pred_i, y_i)
      loss_list.append(loss)
      optimizer_item.zero_grad()   
      loss.backward()         
      optimizer_item.step()        
      print('\repoch: {}\tLoss =  {:.3f}'.format(t, loss), end="") 
No alt text provided for this image
  • Very noisy convergence, because using only one data point for one update.

Mini Batch Gradient Descent:

This is meant to be the best of the two extremes. Instead of a single sample or the whole dataset, a small batches of the dataset is considered and update the parameters accordingly. For a dataset of 100 samples, if the batch size is 4 meaning we have 25 batches. Hence, updates occur 25 times.

net_batch = Net(n_feature=1, n_hidden=10, n_output=1)  
optimizer_batch = torch.optim.SGD(net_batch.parameters(), lr=0.2) 
loss_list = []
avg_loss_list = []
batch_size = 16
n_batches = int(len(x) / batch_size) 
print(n_batches)
for epoch in range(len(x)):
    for batch in range(n_batches):
      batch_X, batch_y = x[batch*batch_size:(batch+1)*batch_size,], y[batch*batch_size:(batch+1)*batch_size,]
      prediction = net_batch(batch_X)
      loss = criterion(prediction, batch_y)
      loss_list.append(loss)

      optimizer_batch.zero_grad()
      loss.backward() 
      optimizer_batch.step() 
      print('\repoch: {}\tbatch: {}\tLoss =  {:.3f}'.format(epoch, batch, loss), end="")
No alt text provided for this image
  • Practical convergence, using batch of the data for one update.


Best Regards


要查看或添加评论,请登录

社区洞察

其他会员也浏览了