Mini Auto-Diff from Scratch?-?A Theory?Behind
Image generated by Stable Diffusion.

Mini Auto-Diff from Scratch?-?A Theory?Behind

An auto-differentiation engine is a fundamental part of PyTorch. Here, we are going to explain how it works in detail by an easy to follow sample.

This article is a segment of a tutorial on how to build a mini PyTorch - available on Github -Mini PyTorch.


Related Content

  1. Why Gradient Based Optimization is Used (good to read for a motivation)
  2. Mini Auto-Diff from Scratch?—?An Implementation (how it is actually implemented)


Intro

We can think about PyTorch as an function optimization framework built on tensors. Tensors in PyTorch are nothing more than multidimensional arrays which have an ability to be connected to each other in a form of a graph. By having an execution graph we can propagate gradients backward.

In this story we will explain in depth how and why graph is built, derivatives calculated and propagated backwards as well how an optimization works.

Sample

Let us write a basic sample in PyTorch that uses automatic differentiation and optimization mechanism?—?finding minimum of a 1D function quadratic function:

Minimization of y = 2x2+5. (by author)

So what is going on there??

The sample above has three important parts:

  1. objective to optimize?—?quadratic function
  2. gradient calculation
  3. optimization

Users of PyTorch will notice the three most important functions:

  1. backward()
  2. optim step()?—?written as value = value?—?<learning rate> * x.grad
  3. zero_()?—?commonly used as zero_grad

First, we set our parameter: x (in our case it has a concrete value of 10). The tensor x has set requires_grad set to True as it is going to be optimized. Moreover, all operations that are used are from the torch namespace?—?we can not mix 3rd party ones with torch ones. The reason behind is that we have to trace variable x to build a graph.


What kind of graph are we talking about??

We are talking about auto differentiation graph (autodiff).

Let us build an execution graph for the function: f(x) = 2x2 + 5, for some concrete value e.g. x = 10. Each node will have two elements: a value and an operation if a node carries an operation where ‘.’ will signify no operation as shown below:

Graph for the 1D quadratic function in the sample. (by author)

Top down: the input value is 10. The node has no operation. This number is an input to a node which has power of 2 operation and it is displayed in the second row (100).

Next, in the third row constant 2 enters into play which is multiplied with the previous node (100)?—?fourth row (200). Finally, constant 5 is added to everything which creates final node (205).


Going backwards?—?function ‘backward’

In the previous tutorial we have introduced chain rule and show it on a sample. According to the rule d · f’(x), where d is a derivative from a child node, we obtain the derivative value by traversing the directed graph backwards (from the bottom node). Let us show this by using the previous graph and extending it by the third element (in red): a derivation value.

Step 1

Graph for the 1D quadratic function in the sample?—?backward step 0. (by author)

The final node (205) always has derivation value set to 1. The node has an addition operation x+y. Taking partial derivations w.r.t. x and y we get:

and using the chain rule the left parent node will have value 1 · 1 = 1. The same goes for the right parent node.

Step 2

Next, we have multiplication operation x*y. Taking partial derivatives w.r.t x and y respectively we get:

Therefore, the parent node on the left receives value d y = 1 100 = 100 and the right parent receives d · x = 1 · 2 = 2.

Graph for the 1D quadratic function in the sample?—?backward step 1. (by author)

Step 3

The node is the second row has power function which derivative is (x2)’ = 2x. Therefore to its only parent the node sends d (2x) = 2 (2 * 10) = 40 which is the final result.

Graph for the 1D quadratic function in the sample?—?backward step 2. (by author)

As it can be seen, by using partial derivatives for simple functions and using the chain rule we can differentiate complex functions automatically.


Optimization step?—?function ‘optimize’

After obtaining a gradient value (in our case 40 for x =10), we can shift parameter x towards the function minimum. We do that simply by offsetting in the opposite direction (‘-’ sign) of a gradient by a fraction (here 0.1) of its value.

Optimization of the 1D quadratic function in the sample. (by author)

The next value of x is 10–0.1 * 40 = 6.

After an optimization step we do not need the gradient anymore and we do not want to accumulate values from previous steps, so we clear parameter gradient by calling zero_().

Finally, by repeating this procedure many times (in this case 15) we are moving our parameter (x) towards the function minimum?—?in this case 0.


Remarks

We have shown on a 1D function optimization how PyTorch does optimization. In the next and the last part we are going to write an actual implementation of an auto-diff engine.

All images / figures are originally made by an author, unless stated otherwise.

If you are working on something interesting please do not hesitate to reach me out.

要查看或添加评论,请登录

Darko Juri?的更多文章

社区洞察

其他会员也浏览了