登录查看更多内容

Why Gradient Based Optimization is?Used

Darko Juri?

Data Scientist | Deep Learning | Computer Vision | LLM

发布日期: 2024年8月5日

+ 关注

This article is a segment of a tutorial on how to build a mini PyTorch - available on Github -Mini PyTorch.

Intro

Imagine you are on a 1D mountain and you want to go to a valley. You do not see anything because you are blindfolded. How would you do that?

You would step into one direction (e.g. left), and then, if you stepped onto higher ground, you would move in the opposite direction. Notice that even if you stepped onto a lower level you would still require to check the other side (e.g. right) to ensure the steepest path. In other words you would require two evaluations: the left and the right one.

Now imagine you are on a 2D plane with the same problem. How many evaluations would you have now? The answer is 4 (left, right, top, bottom).

Functions being optimized have high dimensions. So for N dimensional function, you would require N * 2 evaluations for a single movement in the right direction.

You would repeat that step multiple times until the local-minima is reached. If we do not want to miss a local minima, we would usually make smaller steps?—?by some factor.

We can formulate this as an algorithm:

1) Make a forward-backward step for each dimension.

2) Calculate absolute difference D between forward-backward step for each dimension.

3) Update current position: p = p?—?D * a where a is a factor which tells us how fast we are heading toward the minima (not to skip one).

Let us assume we have the following function: y = 2x2 + 5.

The following animation shows our current position in each step and how we approach the minima:

Minimization of 2x2+5 with start x = 10.

This is analogous to what a PyTorch does, where we would replace steps 1) and 2) with zero grad and backward functions and the step 3) with optimize function.

Why gradient based?approach

The current problem is the number of steps required to calculate the next movement.

The introduced algorithm (called hill climbing / decent) has to make 2N calculations where N is a space dimensionality. Each function we evaluate is treated as a black box, which means we have to evaluate the entire function each time.

If we relax conditions just a bit?—?specifically:

that is if we know what a function is (different from its evaluation on an entire space!) and
if the function is “smooth enough” (no discontinuities)

then we can calculate a direction where we move in a single step (for each dimension). That speeds up the process 2N times (because we do not treat our function as a black box anymore)!

Derivation of a?function

To be able calculate a direction in one step we have to express our function in such way. Another way said?—?we have to take a derivative. Let us introduce its definition formally. If we take a very small step and assuming that a function is smooth, then we have a formal definition of a derivation [2]:

Derivation of complex functions

If we take our function: y = 2x2 + 5. Its derivation is dy/dx = 2(2x1) + 0 = 4x, also shown in the graph below:

Top: function 2x2+5. Bottom: derivation of the function: 4x.

Here, we used multiple derivation rules [1] for multiplication, addition and power of 2 (which we derived in the previous paragraph using the definition of derivation).

In general, we will have complex functions and it is beneficial if we know how to derive simple elements; then, we also know how to derive the entire function (e.g. machine learning model with its loss). The rule that enables us to do that is called the chain rule and it can formally be written as:

Definition of chain rule for derivation.

It states that if we have a complex function we can substitute inner expressions with variables, derive the function and then its substituted parts and finally multiply the results. Notice that we can have multiple substitutions.

Sample

If we have the following function and we need to find its derivation w.r.t. x:

Then we can replace its inner part with u and get:

Then make a derivation of the function and its substitution:

And finally, multiply the derivatives:

In general, we will compute derivative of our most inner function and propagate it to outer functions, which will just multiply its derivative with the received one and continue the propagation.?

Remarks

Here, we delved into the most fundamental concept?—?derivation?—?exploring the idea behind it and why it is being used.

All images, if not explicitly written, are created and are the property of the author.

If you are working on something interesting please do not hesitate to reach me out.

References

[1] https://www.mathsisfun.com/calculus/derivatives-rules.html

[2] https://socratic.org/questions/what-is-the-formal-definition-of-a-derivative

Darko Juri?的更多文章

Mini Auto-Diff from Scratch?-?An Implementation

2024年8月5日

Mini Auto-Diff from Scratch?-?An Implementation

Here, we are going to implement an auto-diff engine from scratch! A theory behind this concept is covered in the…
Mini Auto-Diff from Scratch?-?A Theory?Behind

2024年8月5日

Mini Auto-Diff from Scratch?-?A Theory?Behind

An auto-differentiation engine is a fundamental part of PyTorch. Here, we are going to explain how it works in detail…

Related Content

Intro

Why gradient based?approach

Remarks

References

Darko Juri?的更多文章

Mini Auto-Diff from Scratch?-?An Implementation

Mini Auto-Diff from Scratch?-?A Theory?Behind