Mini Auto-Diff from Scratch?-?An Implementation
Here, we are going to implement an auto-diff engine from scratch! A theory behind this concept is covered in the previous story where we have seen how an execution graph is built by an example, how back-propagation works and we did an optimization on a simple 1D quadratic function. We implement those concepts here.?
This article is a segment of a tutorial on how to build a mini PyTorch - available on Github -Mini PyTorch. The explanation corresponds to the code available under dark-1 folder.
Related Content
A theory behind is a must have?—?please check the following:
Intro
Principle of auto-differentiation on a graph is the main characteristic of PyTorch framework. It enables minimization of an arbitrary complex functions (e.g. a model with loss function) as long as all parts are differentiable. Partial derivatives are done automatically for us.
In this story, we implement this fundamental concept using a fairly simple 1D function optimization sample.
Implementation
Here we have an optimization loop for minimizing 2x2 + 5. Since we are going to recreate all functions that PyTorch uses we use a namespace of our new, not yet existing, framework?—?Dark. The rest, is the same.
First, we set our parameter: x. As the variable x is the only variable that will be optimized, it is wrapped into Parameter object. The reason behind is that we have to trace variable x to build a graph. Next, we construct the function using math functions in ‘dark’ namespace for the same reason. We can see that the code is nearly identical to our PyTorch version. Now let us create building blocks for a graph.
Graph nodes
A graph is consisted of nodes. We will have three node types. One for constant values which do not have gradient?—?Constant and one for parameters which are optimized?—?Parameter node. Both node types will be inherited from the base class Node which is also used as an internal graph node type?—?for intermediate function results.
Let us start from the base class Node. The class encapsulates two elements: operation and parent nodes. It contains its value as well as gradient _grad.
Besides the values, the node object will have two methods which we have seen and used before: zero_grad and backward.
Function zero_grad is a recursive function which sets _grad variable to zero and repeats that procedure to the root node:
Function backward will call first call a function that sorts nodes by the order of execution?—?bottom to top. The function topologicalsort is a recursive function that does depth first search [2] and returns visited nodes. There are two reasons we do not call differentiate on a unordered list:
领英推荐
Having such ordered nodes we just need to execute a differentiation operation of a node, propagate gradient(s) to parent(s) and repeat the same procedure until the root of the graph is reached.
2. Constant
Node Constant does not have a gradient?—?therefore we override getter and setter function and set gradient to None.
3. Parameter
Parameter node is a node type created explicitly by user for parameter(s) that we want to optimize. Its entire implementation is below:
Node Operation
Operation object is a member of a Node and it has two important functions: apply and differentiate.
Function apply creates a Node, applies an operation and connects node’s inputs to a graph being built in such fashion. The function is called implicitly by a user using functions from dark namespace (e.g. dark.add(…) which calls Add.apply(…)).
Function differentiate calculates derivative for the node using input(s) x, nodes value y calculated in the forward pass and a gradient propagated from its child(ren) dldy. The function is called by a node object when backward function is called.
An actual implementation of an Operation class must implement two methods: forward and backward?—?samples below.
Operation samples
One concrete class of an Operation class is power which implements forward and backward function by its definition. However, users will call operation using a function which wraps a creation and execution of a concrete operation?—?here dark.pow(…).
Another sample is multiplication function that has two arguments (a and b)?—?see sample in the last part. In this case function backward returns multiple elements which are then propagated backwards by a node object which contains the operation.
Remarks
We have created a full-fledged implementation of an auto-differentiation engine for scalars. Such implementation can be easily extended to support tensors. However, even in the current form, it provides a foundation for other building blocks. One, that can understand the idea and grasp the implementation behind has an edge in learning new concepts that are built on top of it?—?that was an entire idea behind this (two part) story.
All images / figures are originally made by an author, unless stated otherwise.
If you are working on something interesting please do not hesitate to reach me out.