登录查看更多内容

Mini Auto-Diff from Scratch?-?A Theory?Behind

Darko Juri?

Data Scientist | Deep Learning | Computer Vision | LLM

发布日期: 2024年8月5日

An auto-differentiation engine is a fundamental part of PyTorch. Here, we are going to explain how it works in detail by an easy to follow sample.

This article is a segment of a tutorial on how to build a mini PyTorch - available on Github -Mini PyTorch.

Intro

We can think about PyTorch as an function optimization framework built on tensors. Tensors in PyTorch are nothing more than multidimensional arrays which have an ability to be connected to each other in a form of a graph. By having an execution graph we can propagate gradients backward.

In this story we will explain in depth how and why graph is built, derivatives calculated and propagated backwards as well how an optimization works.

Sample

Let us write a basic sample in PyTorch that uses automatic differentiation and optimization mechanism?—?finding minimum of a 1D function quadratic function:

So what is going on there??

The sample above has three important parts:

objective to optimize?—?quadratic function
gradient calculation
optimization

Users of PyTorch will notice the three most important functions:

backward()
optim step()?—?written as value = value?—?<learning rate> * x.grad
zero_()?—?commonly used as zero_grad

First, we set our parameter: x (in our case it has a concrete value of 10). The tensor x has set requires_grad set to True as it is going to be optimized. Moreover, all operations that are used are from the torch namespace?—?we can not mix 3rd party ones with torch ones. The reason behind is that we have to trace variable x to build a graph.

What kind of graph are we talking about??

We are talking about auto differentiation graph (autodiff).

Let us build an execution graph for the function: f(x) = 2x2 + 5, for some concrete value e.g. x = 10. Each node will have two elements: a value and an operation if a node carries an operation where ‘.’ will signify no operation as shown below:

Graph for the 1D quadratic function in the sample. (by author)

Top down: the input value is 10. The node has no operation. This number is an input to a node which has power of 2 operation and it is displayed in the second row (100).

Next, in the third row constant 2 enters into play which is multiplied with the previous node (100)?—?fourth row (200). Finally, constant 5 is added to everything which creates final node (205).

领英推荐

Top 10 Must-Know Machine Learning Libraries for Data…

ThinkDigits Inc. 9 个月前

Riemannian Metric for SPD Manifolds

Patrick Nicolas 6 个月前

McCulloch-Pitts: The First Computational Neuron

Dr. Kais Dukes 1 年前

Going backwards?—?function ‘backward’

In the previous tutorial we have introduced chain rule and show it on a sample. According to the rule d · f’(x), where d is a derivative from a child node, we obtain the derivative value by traversing the directed graph backwards (from the bottom node). Let us show this by using the previous graph and extending it by the third element (in red): a derivation value.

Step 1

Graph for the 1D quadratic function in the sample?—?backward step 0. (by author)

The final node (205) always has derivation value set to 1. The node has an addition operation x+y. Taking partial derivations w.r.t. x and y we get:

and using the chain rule the left parent node will have value 1 · 1 = 1. The same goes for the right parent node.

Step 2

Next, we have multiplication operation x*y. Taking partial derivatives w.r.t x and y respectively we get:

Therefore, the parent node on the left receives value d y = 1 100 = 100 and the right parent receives d · x = 1 · 2 = 2.

Graph for the 1D quadratic function in the sample?—?backward step 1. (by author)

Step 3

The node is the second row has power function which derivative is (x2)’ = 2x. Therefore to its only parent the node sends d (2x) = 2 (2 * 10) = 40 which is the final result.

Graph for the 1D quadratic function in the sample?—?backward step 2. (by author)

As it can be seen, by using partial derivatives for simple functions and using the chain rule we can differentiate complex functions automatically.

Optimization step?—?function ‘optimize’

After obtaining a gradient value (in our case 40 for x =10), we can shift parameter x towards the function minimum. We do that simply by offsetting in the opposite direction (‘-’ sign) of a gradient by a fraction (here 0.1) of its value.

Optimization of the 1D quadratic function in the sample. (by author)

The next value of x is 10–0.1 * 40 = 6.

After an optimization step we do not need the gradient anymore and we do not want to accumulate values from previous steps, so we clear parameter gradient by calling zero_().

Finally, by repeating this procedure many times (in this case 15) we are moving our parameter (x) towards the function minimum?—?in this case 0.

Remarks

We have shown on a 1D function optimization how PyTorch does optimization. In the next and the last part we are going to write an actual implementation of an auto-diff engine.

All images / figures are originally made by an author, unless stated otherwise.

If you are working on something interesting please do not hesitate to reach me out.

要查看或添加评论，请登录

Darko Juri?的更多文章

Mini Auto-Diff from Scratch?-?An Implementation

2024年8月5日

Mini Auto-Diff from Scratch?-?An Implementation

Here, we are going to implement an auto-diff engine from scratch! A theory behind this concept is covered in the…
Why Gradient Based Optimization is?Used

2024年8月5日

Why Gradient Based Optimization is?Used

This article is a segment of a tutorial on how to build a mini PyTorch - available on Github -Mini PyTorch. Related…

Mini Auto-Diff from Scratch?-?A Theory?Behind

Darko Juri?

Data Scientist | Deep Learning | Computer Vision | LLM

Related Content

Intro

So what is going on there??

What kind of graph are we talking about??

领英推荐

Going backwards?—?function ‘backward’

Optimization step?—?function ‘optimize’

Remarks

Darko Juri?的更多文章

社区洞察

其他会员也浏览了

McCulloch-Pitts: The First Computational Neuron

Riemann Curvature for Geometric Learning in Python

Building BERT From Scratch - Bidirectional Encoders Representation of Transformers - Part 3

Implementing the Self-Attention Mechanism from Scratch in PyTorch!

LibTorch: The C++ Powerhouse Driving PyTorch

Object Fractal Dimension

?? Advancing Symbolic Algebra with Data Annotation Insights ??

Mastering Image Classification with Vision Transformers (ViT): A Modern Approach to Computer Vision

A Generative Adversarial Networks (GAN) in rTorch for creating synthetic datasets. Season 1 Episode 1

How do data scientists use PyTorch?

Related Content

Intro

So what is going on there??

What kind of graph are we talking about??

领英推荐

Going backwards?—?function ‘backward’

Optimization step?—?function ‘optimize’

Remarks

Darko Juri?的更多文章

Mini Auto-Diff from Scratch?-?An Implementation

Why Gradient Based Optimization is?Used

社区洞察

其他会员也浏览了

McCulloch-Pitts: The First Computational Neuron

Riemann Curvature for Geometric Learning in Python

Building BERT From Scratch - Bidirectional Encoders Representation of Transformers - Part 3

Implementing the Self-Attention Mechanism from Scratch in PyTorch!

LibTorch: The C++ Powerhouse Driving PyTorch

Object Fractal Dimension

?? Advancing Symbolic Algebra with Data Annotation Insights ??

Mastering Image Classification with Vision Transformers (ViT): A Modern Approach to Computer Vision

A Generative Adversarial Networks (GAN) in rTorch for creating synthetic datasets. Season 1 Episode 1

How do data scientists use PyTorch?