Why Gradient Based Optimization is?Used
This article is a segment of a tutorial on how to build a mini PyTorch - available on Github -Mini PyTorch.
Related Content
After you solidify knowledge about gradients, you can check:
Intro
Imagine you are on a 1D mountain and you want to go to a valley. You do not see anything because you are blindfolded. How would you do that?
You would step into one direction (e.g. left), and then, if you stepped onto higher ground, you would move in the opposite direction. Notice that even if you stepped onto a lower level you would still require to check the other side (e.g. right) to ensure the steepest path. In other words you would require two evaluations: the left and the right one.
Now imagine you are on a 2D plane with the same problem. How many evaluations would you have now? The answer is 4 (left, right, top, bottom).
Functions being optimized have high dimensions. So for N dimensional function, you would require N * 2 evaluations for a single movement in the right direction.
You would repeat that step multiple times until the local-minima is reached. If we do not want to miss a local minima, we would usually make smaller steps?—?by some factor.
We can formulate this as an algorithm:
1) Make a forward-backward step for each dimension.
2) Calculate absolute difference D between forward-backward step for each dimension.
3) Update current position: p = p?—?D * a where a is a factor which tells us how fast we are heading toward the minima (not to skip one).
Let us assume we have the following function: y = 2x2 + 5.
The following animation shows our current position in each step and how we approach the minima:
This is analogous to what a PyTorch does, where we would replace steps 1) and 2) with zero grad and backward functions and the step 3) with optimize function.
Why gradient based?approach
The current problem is the number of steps required to calculate the next movement.
The introduced algorithm (called hill climbing / decent) has to make 2N calculations where N is a space dimensionality. Each function we evaluate is treated as a black box, which means we have to evaluate the entire function each time.
If we relax conditions just a bit?—?specifically:
then we can calculate a direction where we move in a single step (for each dimension). That speeds up the process 2N times (because we do not treat our function as a black box anymore)!
Derivation of a?function
To be able calculate a direction in one step we have to express our function in such way. Another way said?—?we have to take a derivative. Let us introduce its definition formally. If we take a very small step and assuming that a function is smooth, then we have a formal definition of a derivation [2]:
Derivation of complex functions
If we take our function: y = 2x2 + 5. Its derivation is dy/dx = 2(2x1) + 0 = 4x, also shown in the graph below:
Here, we used multiple derivation rules [1] for multiplication, addition and power of 2 (which we derived in the previous paragraph using the definition of derivation).
In general, we will have complex functions and it is beneficial if we know how to derive simple elements; then, we also know how to derive the entire function (e.g. machine learning model with its loss). The rule that enables us to do that is called the chain rule and it can formally be written as:
It states that if we have a complex function we can substitute inner expressions with variables, derive the function and then its substituted parts and finally multiply the results. Notice that we can have multiple substitutions.
Sample
If we have the following function and we need to find its derivation w.r.t. x:
Then we can replace its inner part with u and get:
Then make a derivation of the function and its substitution:
And finally, multiply the derivatives:
In general, we will compute derivative of our most inner function and propagate it to outer functions, which will just multiply its derivative with the received one and continue the propagation.?
Remarks
Here, we delved into the most fundamental concept?—?derivation?—?exploring the idea behind it and why it is being used.
All images, if not explicitly written, are created and are the property of the author.
If you are working on something interesting please do not hesitate to reach me out.