PyTorch: Gradient Descent, Stochastic Gradient Descent and Mini Batch Gradient Descent (Code included)
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
In this article we use PyTorch automatic differentiation and dynamic computational graph for implementing and evaluating different Gradient Descent methods.
PyTorch is an open source machine learning framework that accelerates the path from research to production. PyTorch is gaining popularity for its simplicity, ease of use, dynamic computational graph and efficient memory usage. PyTorch naturally supports dynamic building of computational graphs and performs automatic differentiation of the dynamic graphs (Autograds).
Gradient Descent (GD) is an optimization method used to optimize (update) the parameters of a model (Deep Neural Network) using the gradients of an objective function w.r.t the parameters.
In the figure above, a visualization of a saddle point in the optimization landscape. More information can be found in cs231n by Stanford
Generally there are three main scenarios:
- Gradient Descent (GD)
- Stochastic Gradient Descent(SGD)
- Mini Batch Gradient Descent (Mini Batch GD)
Experimental Setup
In this article, a simple regression example is used to see the deference between these scenarios. Here we have some artificially generated data and want to train a neural network to approximate a function that describe the training data by optimizing its parameters.
import torch import torch.nn.functional as F import matplotlib.pyplot as plt import numpy as np x = torch.unsqueeze(torch.linspace(-1, 1, 100), dim=1) y = x.pow(2) + 0.2*torch.rand(x.size())
class Net(torch.nn.Module): def __init__(self, n_feature, n_hidden, n_output): super(Net, self).__init__() self.hidden = torch.nn.Linear(n_feature, n_hidden) self.predict = torch.nn.Linear(n_hidden, n_output) def forward(self, x): x = F.relu(self.hidden(x)) x = self.predict(x) return x criterion = torch.nn.MSELoss()
Gradient Descent (GD):
GD in its original form uses the whole training data to update the parameters. This is the basic procedure that produces a smooth movement to the low cost region in the parameter space. However, in modern deep learning settings, it is impossible to use the whole data for a single parameter update.
net_all = Net(n_feature=1, n_hidden=10, n_output=1) optimizer_all = torch.optim.SGD(net_all.parameters(), lr=0.2) loss_list = [] for t in range(100): prediction = net_all(x) loss = criterion(prediction, y) loss_list.append(loss) optimizer_all.zero_grad() loss.backward() optimizer_all.step() print('\repoch: {}\tLoss = {:.3f}'.format(t, loss), end="")
- Very smooth convergence, however using all the data for one update.
Stochastic Gradient Descent:
SGD computes the gradients, represents the other extreme, makes an update for every sample in the dataset. The intuition is that using only one data point is not accurate but it is efficient and gives us a general direction to update the parameters.
net_item = Net(n_feature=1, n_hidden=10, n_output=1) optimizer_item = torch.optim.SGD(net_item.parameters(), lr=0.2) loss_list = [] avg_loss_list = [] for t in range(100): for x_i, y_i in zip(x, y): pred_i = net_item(x_i) loss = criterion(pred_i, y_i) loss_list.append(loss) optimizer_item.zero_grad() loss.backward() optimizer_item.step() print('\repoch: {}\tLoss = {:.3f}'.format(t, loss), end="")
- Very noisy convergence, because using only one data point for one update.
Mini Batch Gradient Descent:
This is meant to be the best of the two extremes. Instead of a single sample or the whole dataset, a small batches of the dataset is considered and update the parameters accordingly. For a dataset of 100 samples, if the batch size is 4 meaning we have 25 batches. Hence, updates occur 25 times.
net_batch = Net(n_feature=1, n_hidden=10, n_output=1) optimizer_batch = torch.optim.SGD(net_batch.parameters(), lr=0.2) loss_list = [] avg_loss_list = [] batch_size = 16 n_batches = int(len(x) / batch_size) print(n_batches) for epoch in range(len(x)): for batch in range(n_batches): batch_X, batch_y = x[batch*batch_size:(batch+1)*batch_size,], y[batch*batch_size:(batch+1)*batch_size,] prediction = net_batch(batch_X) loss = criterion(prediction, batch_y) loss_list.append(loss) optimizer_batch.zero_grad() loss.backward() optimizer_batch.step() print('\repoch: {}\tbatch: {}\tLoss = {:.3f}'.format(epoch, batch, loss), end="")
- Practical convergence, using batch of the data for one update.
Best Regards