Learning LLMs at Yale SOM
Always love Dall-E's spelling...

Learning LLMs at Yale SOM

My colleague, K Sudhir and I are teaching a new course at the Yale School of Management on large language models like ChatGPT, Bard, Claude, and Llama. Our course is divided into two parts: theory and application. We just finished the theory part, the goal of which was to endow students, mostly MBAs, with a passable understanding of how LLMs are built. That's not to say our students could build such models on their own, but, I think it's fair to say most of the students can now correctly describe how and why ChatGPT works.


We received numerous requests for our course materials. So, we're sharing those here, which is trivial because there's nothing original! ?? Instead of developing our own materials, we found the best materials we could online and assigned these to our students as pre-class reading. (Each class in the "theory" part began with an on-paper, off-line quiz ??.) Finding the "best" material took some time. Having now completed this first part of the course, I'd say we're satisfied with our choices.


Below, you can find our day-by-day assigned reading. We're sharing these in the hope that it helps you on your journey to understand LLMs. These materials assume you have some very basic understanding of calculus, linear algebra, and python programming. I'm guessing most people meeting those prerequisites can come up to speed on LLMs in a month of study. (Of course, you can consume this material quickly, but understanding it takes a bit of time to mentally digest...at least it did for me.)


Day 1: the basics of deep learning.

LLMs are big neural networks and a neural network is basically just a bunch of matrices that we multiply together in fancy ways to make some prediction like "that image contains a cat" or "the next word in this sentence is 'squirrel'". During training, we run the neural network forward and backward a bunch of times until we get good values in these matrices: values that make correct predictions. You can kinda think of that like tuning the strings on some freakishly large guitar. We want the guitar to make the right notes and so we twist knobs until we get there.

To begin, check out

You don’t need to “read” that article so much as skim it and maybe keep it open while you watch the 3blue1brown videos below. Once you’re done with the 3blue1brown videos, you should have a perfect understanding of the IEEE article.

Please watch all four videos in order. It’s likely a good idea to take notes and, as Sanderson suggests, to “pause and ponder.” The final video is not super important for you. The first three are super important!

It would be reasonable to need a quick refresher on linear algebra, which you can find in these 3blue1brown videos. You’ll likely just want videos 1, 3, and 4. If you need a refresher on the chain rule, you can find a lot of content on 3blue1brown and Khan Academy. Of course, you should also ask an LLM to teach you material you don’t know!

Having consumed the above, you should:


Day 2: Recurrent Neural Networks

In our previous class, we saw how we could classify a single example—eg. a single image of a handwritten digit. In this class, we’ll alter that basic neural network to allow us to classify sequences—eg. multiple digits in a row. Here’s a few thought experiments to make this concept concrete.

  • Imagine we want to classify handwritten phone numbers. We know that certain area codes are more popular than others, so, e.g. if we see “61”, there’s a high probability that the next number is “7” for Boston’s “617.”
  • Imagine we want to classify the weather forecast given some data. If we’ve had a few days of sun in April, maybe rain is more likely soon.
  • Imagine we want to predict the next word in the sentence “I took a walk with my ____.” Clearly “sister” and “father” should be more probable than “refrigerator.”

To accomplish these tasks, we need to give our neural network some “memory” of what happened in the past. That’s exactly what recurrent neural networks do.

Please read and watch the following content prior to class.

Please watch until minute 48, whereupon Amini starts getting into transformers.

This is a pretty famous blog post. What’s amazing is that the RNN’s Karpathy trains in this article are character-based. They are outputting single letters instead of words. And, even with that being the case, they are able to learn to create unreasonably awesome output. You need to understand the stuff he's talking about down until his "Fun with RNNs" section.

What’s happening is that these RNNs are spitting out characters one-at-a-time. But, they remember what they output previously and so they can kinda make sensible outputs at each step. (As we will see later though, they don’t have a long memory. We’ll see soon how a model called a Transformer is able to remember better than an RNN. The “T” in ChatGPT is for Transformer.)

Having consumed that material you should

Optional further reading:


Day 3: Word Embeddings

In our previous classes, we saw how a basic neural network is trained to take numbers and make decisions/outputs. We also saw how to make a series of outputs with recurrent neural networks and this was our first foray into?text?as data. In our RNN implementation we used the so-called "one hot" encoding of words/characters/primitives. In this representation, if we have N words in our input sentence and M words in our vocabulary, out input matrix to the network is NxM.

The problem is that one-hot representations are really poor. Consider a word like "dog", it's quite similar to the word "cat" in many ways, isn't it? They're both pets, we might cuddle with each of them, we feed each of them. Indeed, "dog" can often be replaced with "cat" in a sentence and it still works just fine! But, in one-hot encoding "dog" is no more similar to "cat" than it is to "volcano". What a loss!

What we would like instead is some representation of those words that will have dimensionality <M that allows us to numerically represent that "dog" is similar to "cat" and not so much "volcano". This is not a massive leap forward for us: we discussed previously how Kyle might be [0.2, 0.99, 0.232, 0.487, 0.3], where each of those dimensions says something about me: my height, hair, rate of speaking errors, etc. But it surely doesn't capture everything about me! How could it?!

Now we're going to get the same kinds of vectors for words and we call these "embeddings". We're going to rewind the clock the little bit and learn a famous embedding called word2vec. You likely won't use word2vec in products anymore because there are better embeddings now. But, it is the correct step for us, as relative newcomers to the concept of embeddings.

Please read the following content prior to class:

After that

Optional reading:

  • Efficient Estimation of Word Representations in Vector Space, https://arxiv.org/abs/1301.3781. This is the original word2vec paper from Google authors in 2013.


Day 4: Attention

In our previous classes we learned:

  • How neural networks can learn complicated things by adjusting the values of big matrices so as to minimize some user-defined loss function.
  • How we could give these networks a little bit of memory with RNNs, so that 1) we can create some kind of hidden "thought" vector and 2) words could "remember" past words.
  • How we could represent words as embeddings: "dense" vectors that have semantic meaning instead of just "one hot" vectors without meaning.

Now we're going to build off those lessons and learn, finally,?transformers. The "T" in ChatGPT stands for "Transformer". If transformers are the beating heart of ChatGPT, then "attention" is the beating heart of a transformer. In particular, we're going to be learning about "multi-head scaled dot product attention". Quite a mouthful!?

Alas, attention looks complicated (it isn't) and we're kinda learning it without knowing what we're going to do with it: almost like learning how a heart works without knowing about the rest of the body.

Here's what I want you to know, the single thing you should keep in mind as you read the items below. Remember from the last class that we looked at word embeddings or vectors, like "turkey" = [0.2, 0.99, 0.232, 0.487, 0.3, ....]. Also, remember from class how "turkey" could be either the country or the animal? When we get the word2vec word embedding for "turkey", we're getting a vector that is a combination of these meanings. What attention is going to do for us is allow the word "turkey" to pay?attention to other words around it so that it knows?which "turkey" it is. In the case of "Turkey approved the UN resolution", that is the country Turkey. So we're going to take a generic "turkey" embedding and get a context-aware "turkey", a super "turkey" vector that knows all kinds of things about itself and the context in which it occurs: adjectives that apply to it, subject-verb agreement, all kinds of stuff! The attention mechanism spits out this super duper smart vector for "turkey" instead of that dumb one with which we started.

Please read or watch the following:

Having read those, you should

Optional reading:

  • "Attention is All You Need" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ?ukasz Kaiser, and Illia Polosukhin https://arxiv.org/abs/1706.03762. This is it, the famous transformer paper. It's not so tough to read actually!


Day 5: Transformers

Now that we understand Attention, you should have everything you need to understand Transformers and ChatGPT. There's nothing new in this class. We're just putting it all together. ChatGPT is in the family of decoder-only transformer models. Given some input text, the model just continually asks "what is the next most likely word?" spits that out and does it again. This is a little different than what you'll read about below. The article below describes an encoder-decoder model. In these we take in some data and create a "latent" or "hidden" state with the encoder. That state just like the hidden state in an RNN. Then, the decoder can look at that state when making its output. That's how we'd translate from English to German or, how we might add a caption to an image (the encoder makes a hidden representation of the image and then the decoder turns that into words).

Pre-class reading:

After reading that you should

Optional reading:

That's it!

In class we spoke about a lot of other stuff: supervised fine tuning, reinforcement learning with human feedback, etc. And, we rehashed these materials, drew pretty pictures etc. It was fun! I think the most surprising part for us is that MBA students can really come up to speed on LLMs pretty quickly. Of course, most of our students won't be writing code for a living, but they will be managing developers and creating great products based on LLMs. It's nice to see their delight, to see them take joy in understanding what exactly is happening under the hood of products like ChatGPT.


We hope this was useful to you.


Great work Kyle - thankyou ???? Are you still running the course?

回复
Gabrielle Turmelle, MBA

Resume Writer ? Salary Negotiations ? Serving managers, project managers, and senior leaders across North America and Europe. ? Access my resume packages. ??

12 个月

Just finished Day 1. ?? Going to do the entire theory portion listed here. Thank you so much for sharing!

回复
Mauricio Chiong Castillo

Yale School of Management 24’ | Founder AI & AdTech

1 年

The best course by far at Yale!

回复

Thank you!

回复
Christine Liaw

Sales Development Specialist in Higher Education & Research at NVIDIA

1 年

This is awesome! Thanks for sharing, Kyle. Wish I could take your class in person again, but I'm definitely diving into these resources!

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了