课程: Hands-On AI: Build a Generative Language Model from Scratch
Building a naive Bayes classifier - Python教程
课程: Hands-On AI: Build a Generative Language Model from Scratch
Building a naive Bayes classifier
- [Instructor] There are various methods of performing sentiment analysis. The one we'll look at is called Naive Bayes. It's a pretty straightforward approach. What I like about it is that while it's not as effective as more advanced approaches, you can implement it using pure Python code. And that's going to really help us wrap our heads around how this mechanism works. So we're going to have this data set, and it's a list of tuples, and each one has a comment and a label. The label is pos for positive, and neg for negative. Now we're going to go over each one of these, break it down into tokens, and we'll keep count of how many times each token exists in negative connotations and how many times it exists in positive connotations. Then we'll take a brand new input that we haven't classified yet, and for each token in that input, we'll ask ourselves what's the likelihood of it coming up in a positive post, and what's the likelihood of it coming up in a negative post in our training data that is? So let's take a look at this in code. So here I am in my code editor, and if you're following along with me in the exercise files, you can go ahead and open up 02_02_begin and note that once we get to the challenge solution portion of this chapter, you will have an in-browser coding environment for you to check out your answer. Now, the first thing I do is I bring in punctuation from the string module as well as what's called a counter. And this is an efficient clean way to count things in collections. Next, I have my default dictionary, and then I have some comment samples with labels. Remember, since each one of these is labeled positive and negative, this is considered a form of supervised learning. So my first sample says, "I love this post," and it's labeled positive, so pos. And then if you scroll down a bit, you'll notice that starting on line 14, I have some negative comments like "Bad stuff," or "I hate this," and you can go ahead and feel free to add some samples to this. Now, this classifier that we're building is considered a type of Naive Bayes classifier. And so on line 23, you'll see that that's what I called our class. The first thing I do is I have a mapping for this Bayes classifier, and it has positive, and that's going to be a list of positive associated tokens, and negative, and that's going to be a list of negative associated tokens. Then I keep track of how many samples I had and this is going to come in handy. And I iterate over the samples and I tokenize them, and I add them to the right mapping. So I add them to positive or to negative. Then I go ahead and I have a counter for positive and for negative, which is going to give me how many occurrences of each token there is. So the tokenize method here is similar to our last one where it removes punctuation and numbers. And then there's the classify method and it already starts off by tokenizing the new text that it receives, which is great. And then it has a list of positive and negative, and let's see what we'll put in that list. So I'm going to iterate my tokens. So I'll say for token in tokens, and for each one of these tokens I'm going to say pos or positive dot append, and I'm going to add sort of the score for it, so self.pos_counter, and I'll look it up how many times it occurred. So how many times did this occur in a positive connotation divided by the sample count, so self.sample_count. And I'll do the same thing for negative, so neg.append self.neg_counter, and I'll look up how many occurrences of this token we had in a negative connotation. So I'll say token, and I'll also divide this by my sample count. So in other words, how likely is this to come out in a negative connotation and how likely is it to come up in a positive connotation. And basically, we'll end up with a list of scores, so to speak. So now that I have a list of these scores, what I want you to do in the upcoming challenge is to sum up the positive scores and sum up the negative scores. If it's been a while since you've written some Python, you can use sum for this and figure out whether it's positive, negative, or perhaps neutral. And I'll see you after the challenge.