登录查看更多内容

Mini Auto-Diff from Scratch?-?An Implementation

Darko Juri?

Data Scientist | Deep Learning | Computer Vision | LLM

发布日期: 2024年8月5日

Here, we are going to implement an auto-diff engine from scratch! A theory behind this concept is covered in the previous story where we have seen how an execution graph is built by an example, how back-propagation works and we did an optimization on a simple 1D quadratic function. We implement those concepts here.?

This article is a segment of a tutorial on how to build a mini PyTorch - available on Github -Mini PyTorch. The explanation corresponds to the code available under dark-1 folder.

Intro

Principle of auto-differentiation on a graph is the main characteristic of PyTorch framework. It enables minimization of an arbitrary complex functions (e.g. a model with loss function) as long as all parts are differentiable. Partial derivatives are done automatically for us.

In this story, we implement this fundamental concept using a fairly simple 1D function optimization sample.

Implementation

Here we have an optimization loop for minimizing 2x2 + 5. Since we are going to recreate all functions that PyTorch uses we use a namespace of our new, not yet existing, framework?—?Dark. The rest, is the same.

First, we set our parameter: x. As the variable x is the only variable that will be optimized, it is wrapped into Parameter object. The reason behind is that we have to trace variable x to build a graph. Next, we construct the function using math functions in ‘dark’ namespace for the same reason. We can see that the code is nearly identical to our PyTorch version. Now let us create building blocks for a graph.

Graph nodes

A graph is consisted of nodes. We will have three node types. One for constant values which do not have gradient?—?Constant and one for parameters which are optimized?—?Parameter node. Both node types will be inherited from the base class Node which is also used as an internal graph node type?—?for intermediate function results.

Inheritance of the Node class. (by author)

Node

Let us start from the base class Node. The class encapsulates two elements: operation and parent nodes. It contains its value as well as gradient _grad.

Besides the values, the node object will have two methods which we have seen and used before: zero_grad and backward.

Function zero_grad is a recursive function which sets _grad variable to zero and repeats that procedure to the root node:

Function backward will call first call a function that sorts nodes by the order of execution?—?bottom to top. The function topologicalsort is a recursive function that does depth first search [2] and returns visited nodes. There are two reasons we do not call differentiate on a unordered list:

prevents executing the same node multiple times
in graphs order of execution must be respected to obtain valid results

领英推荐

Knowledge Graphs, Hardware Choices, Python Workflows…

Towards Data Science 1 年前

Top 10 Must-Know Machine Learning Libraries for Data…

ThinkDigits Inc. 9 个月前

#Artificial Intelligence # 72: A roadmap for time…

Ajit Jaokar 2 年前

Having such ordered nodes we just need to execute a differentiation operation of a node, propagate gradient(s) to parent(s) and repeat the same procedure until the root of the graph is reached.

2. Constant

Node Constant does not have a gradient?—?therefore we override getter and setter function and set gradient to None.

3. Parameter

Parameter node is a node type created explicitly by user for parameter(s) that we want to optimize. Its entire implementation is below:

Node Operation

Operation object is a member of a Node and it has two important functions: apply and differentiate.

Function apply creates a Node, applies an operation and connects node’s inputs to a graph being built in such fashion. The function is called implicitly by a user using functions from dark namespace (e.g. dark.add(…) which calls Add.apply(…)).

Function differentiate calculates derivative for the node using input(s) x, nodes value y calculated in the forward pass and a gradient propagated from its child(ren) dldy. The function is called by a node object when backward function is called.

An actual implementation of an Operation class must implement two methods: forward and backward?—?samples below.

Operation samples

One concrete class of an Operation class is power which implements forward and backward function by its definition. However, users will call operation using a function which wraps a creation and execution of a concrete operation?—?here dark.pow(…).

Another sample is multiplication function that has two arguments (a and b)?—?see sample in the last part. In this case function backward returns multiple elements which are then propagated backwards by a node object which contains the operation.

Remarks

We have created a full-fledged implementation of an auto-differentiation engine for scalars. Such implementation can be easily extended to support tensors. However, even in the current form, it provides a foundation for other building blocks. One, that can understand the idea and grasp the implementation behind has an edge in learning new concepts that are built on top of it?—?that was an entire idea behind this (two part) story.

All images / figures are originally made by an author, unless stated otherwise.

If you are working on something interesting please do not hesitate to reach me out.

References

[1] https://github.com/a-nico/ArrayFlow

[2] https://www.geeksforgeeks.org/depth-first-search-or-dfs-for-a-graph/

要查看或添加评论，请登录

Darko Juri?的更多文章

Mini Auto-Diff from Scratch?-?A Theory?Behind

2024年8月5日

Mini Auto-Diff from Scratch?-?A Theory?Behind

An auto-differentiation engine is a fundamental part of PyTorch. Here, we are going to explain how it works in detail…
Why Gradient Based Optimization is?Used

2024年8月5日

Why Gradient Based Optimization is?Used

This article is a segment of a tutorial on how to build a mini PyTorch - available on Github -Mini PyTorch. Related…

Mini Auto-Diff from Scratch?-?An Implementation

Darko Juri?

Data Scientist | Deep Learning | Computer Vision | LLM

Related Content

Intro

Implementation

Graph nodes

领英推荐

Remarks

References

Darko Juri?的更多文章

社区洞察

其他会员也浏览了

CROPLAND's top picks from the rstudio conf 2022: Machine Learning, A.I. and MLOPs

Google Colab Tricks and Tutorial for AIGoogle Colab Tricks and Tutorial for AI:

DeepSeek-R1 (Reasoning Models/ Open Source) 8B vs 14B | A Side-by-Side Comparison (Local - Ollama)

Riemannian Metric for SPD Manifolds

Fastest way to finetune and deploy Large Language Model without writing any code

Vector and Covector Fields

A Walk Through Randomness- Forecasting Stock Price With Geometric Brownian Motion through Excel & Python

Building BERT From Scratch - Bidirectional Encoders Representation of Transformers - Part 3

LibTorch: The C++ Powerhouse Driving PyTorch

Visualization of Mathematical Engineering of Transformers - Part 2

Related Content

Intro

Implementation

Graph nodes

领英推荐

Remarks

References

Darko Juri?的更多文章

Mini Auto-Diff from Scratch?-?A Theory?Behind

Why Gradient Based Optimization is?Used

社区洞察

其他会员也浏览了

CROPLAND's top picks from the rstudio conf 2022: Machine Learning, A.I. and MLOPs

Google Colab Tricks and Tutorial for AIGoogle Colab Tricks and Tutorial for AI:

DeepSeek-R1 (Reasoning Models/ Open Source) 8B vs 14B | A Side-by-Side Comparison (Local - Ollama)

Riemannian Metric for SPD Manifolds

Fastest way to finetune and deploy Large Language Model without writing any code

Vector and Covector Fields

A Walk Through Randomness- Forecasting Stock Price With Geometric Brownian Motion through Excel & Python

Building BERT From Scratch - Bidirectional Encoders Representation of Transformers - Part 3

LibTorch: The C++ Powerhouse Driving PyTorch

Visualization of Mathematical Engineering of Transformers - Part 2