登录查看更多内容

PyTorch: Gradient Descent, Stochastic Gradient Descent and Mini Batch Gradient Descent (Code included)

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

发布日期: 2019年11月9日

In this article we use PyTorch automatic differentiation and dynamic computational graph for implementing and evaluating different Gradient Descent methods.

PyTorch is an open source machine learning framework that accelerates the path from research to production. PyTorch is gaining popularity for its simplicity, ease of use, dynamic computational graph and efficient memory usage. PyTorch naturally supports dynamic building of computational graphs and performs automatic differentiation of the dynamic graphs (Autograds).

Gradient Descent (GD) is an optimization method used to optimize (update) the parameters of a model (Deep Neural Network) using the gradients of an objective function w.r.t the parameters.

In the figure above, a visualization of a saddle point in the optimization landscape. More information can be found in cs231n by Stanford

Generally there are three main scenarios:

Gradient Descent (GD)
Stochastic Gradient Descent(SGD)
Mini Batch Gradient Descent (Mini Batch GD)

Experimental Setup

In this article, a simple regression example is used to see the deference between these scenarios. Here we have some artificially generated data and want to train a neural network to approximate a function that describe the training data by optimizing its parameters.

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

x = torch.unsqueeze(torch.linspace(-1, 1, 100), dim=1)  
y = x.pow(2) + 0.2*torch.rand(x.size())

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)   
        self.predict = torch.nn.Linear(n_hidden, n_output)  

    def forward(self, x):
        x = F.relu(self.hidden(x))      
        x = self.predict(x)             
        return x

criterion = torch.nn.MSELoss()

Gradient Descent (GD):

GD in its original form uses the whole training data to update the parameters. This is the basic procedure that produces a smooth movement to the low cost region in the parameter space. However, in modern deep learning settings, it is impossible to use the whole data for a single parameter update.

net_all = Net(n_feature=1, n_hidden=10, n_output=1) 
optimizer_all = torch.optim.SGD(net_all.parameters(), lr=0.2)

loss_list = []
for t in range(100):
    prediction = net_all(x)     
    loss = criterion(prediction, y)
    loss_list.append(loss)
    optimizer_all.zero_grad()
    loss.backward()
    optimizer_all.step()
    print('\repoch: {}\tLoss =  {:.3f}'.format(t, loss), end="")

Very smooth convergence, however using all the data for one update.

Stochastic Gradient Descent:

SGD computes the gradients, represents the other extreme, makes an update for every sample in the dataset. The intuition is that using only one data point is not accurate but it is efficient and gives us a general direction to update the parameters.

net_item = Net(n_feature=1, n_hidden=10, n_output=1) 
optimizer_item = torch.optim.SGD(net_item.parameters(), lr=0.2)

loss_list = []
avg_loss_list = []
for t in range(100):
    for x_i, y_i in zip(x, y):
      pred_i = net_item(x_i)
      loss = criterion(pred_i, y_i)
      loss_list.append(loss)
      optimizer_item.zero_grad()   
      loss.backward()         
      optimizer_item.step()        
      print('\repoch: {}\tLoss =  {:.3f}'.format(t, loss), end="")

Very noisy convergence, because using only one data point for one update.

Mini Batch Gradient Descent:

This is meant to be the best of the two extremes. Instead of a single sample or the whole dataset, a small batches of the dataset is considered and update the parameters accordingly. For a dataset of 100 samples, if the batch size is 4 meaning we have 25 batches. Hence, updates occur 25 times.

net_batch = Net(n_feature=1, n_hidden=10, n_output=1)  
optimizer_batch = torch.optim.SGD(net_batch.parameters(), lr=0.2) 
loss_list = []
avg_loss_list = []
batch_size = 16
n_batches = int(len(x) / batch_size) 
print(n_batches)
for epoch in range(len(x)):
    for batch in range(n_batches):
      batch_X, batch_y = x[batch*batch_size:(batch+1)*batch_size,], y[batch*batch_size:(batch+1)*batch_size,]
      prediction = net_batch(batch_X)
      loss = criterion(prediction, batch_y)
      loss_list.append(loss)

      optimizer_batch.zero_grad()
      loss.backward() 
      optimizer_batch.step() 
      print('\repoch: {}\tbatch: {}\tLoss =  {:.3f}'.format(epoch, batch, loss), end="")

Practical convergence, using batch of the data for one update.

Best Regards

要查看或添加评论，请登录

查看全部

PyTorch: Gradient Descent, Stochastic Gradient Descent and Mini Batch Gradient Descent (Code included)

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

Generally there are three main scenarios:

Experimental Setup

更多精彩文章

社区洞察

其他会员也浏览了

New Book on Synthetic Data: Version 3.0 Just Released

pANN: A Fast Alternative to Vector Search

The Encoder Component of the Transformer Architecture: Source code Demystified

AIML 10- Building Custom Image Datasets in PyTorch

Object Detection Using EfficientNet in Tensorflow 2

AI Framework for Beginners: TensorFlow

TensorFlow-Keras using Mnist Dataset

Computer Vision : A PyTorch Model Trained on the Stanford Cars Dataset

How to build a GAN from scratch with library, PyTorch

Unlock Computer Vision with AlexNet: Step-by-Step Tutorial

Generally there are three main scenarios:

Experimental Setup

How to Learn Artificial Intelligence: A Beginner’s Guide

2024年5月31日

[????????????] ?????????????????? ???????????? explained with code ??

2023年1月28日

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

2023年1月21日

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

2022年2月17日

FNet: Do we need the attention layer at all? [Explained with code]

2021年10月30日

Patches Are All You Need! [with code]

2021年10月28日

MLP is all you need! [with code]

2021年10月23日

9 Steps for solving any machine learning problem

2021年8月28日

Anatomy of the Beast with many heads! [with code]

2021年6月12日

The magic of XLM-R: Unsupervised Cross-lingual Representation Learning at Scale

2021年1月16日

社区洞察

其他会员也浏览了

New Book on Synthetic Data: Version 3.0 Just Released

pANN: A Fast Alternative to Vector Search

The Encoder Component of the Transformer Architecture: Source code Demystified

AIML 10- Building Custom Image Datasets in PyTorch

Object Detection Using EfficientNet in Tensorflow 2

AI Framework for Beginners: TensorFlow

TensorFlow-Keras using Mnist Dataset

Computer Vision : A PyTorch Model Trained on the Stanford Cars Dataset

How to build a GAN from scratch with library, PyTorch

Unlock Computer Vision with AlexNet: Step-by-Step Tutorial