课程: Hands-On AI: Build a Generative Language Model from Scratch

Measuring distance

- [Instructor] In order to dive into semantic similarity and understand which texts have similar meanings, we need to look at the concept of embeddings. We can think of an embedding as a vector representation of a word or a phrase. Imagine you could give a model a word and it could tell you where that word is in a multi-dimensional space. Let's start by imagining words in a two-dimensional space. So what if we gave our model the words dog, cat, and car, and it could tell us where they were on this XY axis? Large language models may produce vectors that can have hundreds and even thousands of dimensions. Some models' entire purpose is to receive text as input and return vectors as outputs. So if we gave such a model the word tree, we may get a vector, and if we gave it the word plant, we may get another vector. Now, there are many ways of comparing these vectors, but for our particular task, for understanding which words have similar meanings using the type of model that we'll be using, measuring cosine similarity is extremely useful. Let's head over to our code editor and check it out. So here I am in my exercise files in 03/03_02_begin, and I do some imports. A few of them are related to numpy, which is extremely popular in artificial intelligence and machine learning, and you'll notice that, at the top, I open a file and I load this word to vector dictionary. Now, this is a dictionary that has words as keys and their vectors as values, and I've created these vectors using a BERT-based transformer model. It's a large language model. If you want to create your own vectors, there are various services out there that can create embeddings for you, and you can also find embeddings models in sites like Hugging Face and download them onto your computer. Now, this word to vector dictionary is going to serve us the way a model would. And the first thing I want to do is define that cosine similarity function. And what it's going to do is it's going to take vector A as well as vector B, so it'll take two vectors, and to keep things neat, I like to say that my numerator is the sum of, and I do what's called list comprehension, so vec A I times vec B I for I in, and I'll create a range in the length of one of these vectors. Now all I need is my denominator, so for that, I'll take the norm vec A times norm vec B. And with all that set up, I just need to return the numerator divided by the denominator. So with cosine similarity, I'll have a number from zero to one, one being extremely close or identical and zero being extremely far, and let's test this out with a few of my vectors. So let's print out, what is the cosine similarity of my word to vector? And I'll grab the vector for the word plant and compare it to word to vector for the word grow, and I'll also compare the word plant to the word minute. Finally, I'll do plant to tree. So it should be closer, probably farther, and then closer again. So, yep, these two are closer than this one. And now we have the ability to look at how close things are as far as meaning goes, and that's just what we'll do in our challenge.

内容