Deep Dive: Building GPT from scratch - part 5
Miko Pawlikowski ???
Follow for coding, bootstrapped startups & breakthroughs in tech. Founder, Engineer, Speaker.
learning from Andrej Karpathy
Hello and welcome back to the series on Starter AI. I’m Miko, this time writing from Tokyo.
Today, we’re picking up where we left last week, and we’re working on stabilising the neural network we implemented last time, using batch normalization, and learning helpful visualizations in the process.
The roadmap
The goal of this series is to implement a GPT from scratch, and to actually understand everything needed to do that. We’re following Andrej’s Zero To Hero videos. If you missed a previous part, catch up here:
To follow along, subscribe to the newsletter at starterai.dev. You can also follow me on LinkedIn.
Generative language model - activations & gradients
Today’s lecture is called “Building makemore Part 3: Activations & Gradients, BatchNorm”, and it builds from where we left last week.
Last time we covered building a Multilayer perceptron (MLP), following the Bengio et al. 2003 MLP language model paper. Before we move to more sophisticated networks, we’re spending today’s lecture on building a deeper understanding of activations and gradients, how to develop an intuition on what numbers make sense and what don’t, and how to visualise them.
The lecture is in two parts. The first part covers initialisation and the Batch normalization paper. The second part turns the code to look like PyTorch’s equivalent, and teaches us how to visualise the different ratios using basic histograms, to better understand how well the training is going.
Only a few new concepts in this lecture.
Context
Fan-in (and fan-out) are the number of inputs (or outputs, respectively).
Kaiming init paper - a paper discussing the behaviour of various squishing functions, both in forward and backpropagation passes. It’s implemented in torch.nn.init.kaiming_normal_and it’s considered one of the most popular ways of initialising neural networks.
Batch normalization paper. A technique allowing for normalising ranges of data in a neural network to avoid calculus pitfalls, and stabilise the learning of the whole network. It takes out some heuristics, and replaces them with formulas. The lecture covers this in detail.
Also, the magical 5/3 according to @leopetrini comes from the average value of tanh^2(x) where x is Gaussian:
Video + timestamps
Part 1
00:04:19 Fixing the initial loss, removing the hockey stick appearance of the graph
00:12:59 Tanh quirks & how to work around them
00:27:53 Initialising the network - “Kaiming init” paper
领英推荐
01:04:50 Real example: resnet50 walkthrough
Part 2
01:18:35 PyTorch-ifying the code
01:26:51 Viz #1: forward pass activations statistics
01:30:54 Viz #2: backward pass gradient statistics
01:36:15 Viz #3: parameter activation and gradient statistics
01:39:55 Viz #4: update:data ratio over time
01:46:04 bringing back batchnorm, looking at the visualizations
01:51:34 Summary
Summary
I really liked this lecture - Andrej took a quick detour on our quest of making makemore, to remove a little bit of the fog around the initialisation, turning the whole process from very artisanal to more engineering-based.
We covered the Kaiming init paper and well as the Batch normalization paper, which make for far more predictable outcomes.
And plotting the different ratios and distributions to confirm things look reasonable makes me feel much better about the whole thing :)
What’s next
Next week, we’re following Andrej into another rabbit hole - that of backpropagation.
As always, subscribe to this newsletter at starterai.dev to get the next parts in your mailbox!
Share with a friend
If you like this series, please forward to a friend!
Feedback
How did you like it? Was it easy to follow? What should I change for the next time?
Please reach out on LinkedIn and let me know!
Senior Talent & Culture Champion | I help leaders build RESILIENT teams | Keynote Speaker | Executive Coach | Global Leadership Strategist | ResumeATM??
1 年Excited to dive deeper into the world of activations and gradients! ????