Text Classification Using scikit-learn

Text Classification Using scikit-learn

How to implement TF and TF-IDF text classification using Python, scikit-learn, and NLTK?

Now, let's try to understand what is TF-IDF. When we want to define TF-IDF, we need to understand what is TF and what is IDF. TFs defines number of times that term appears in a particular document, divided by total number of words which are present in the document.

Whereas when we talk about IDF, it is number of docs, divided by number of times that term appear in the docs. After defining TF and IDF, if we put together TF-IDF, it is typically product of TF and IDF scores of the term that we get after applying the formula that we have discussed. SciKit Learn and NLTK simplifies the task. We'll take a sample in order to understand that how we can do classification using TF-IDF.

We are importing pandas as pd because we are going to populate data from a file called sample.tsv that contains structured data which can be vectorized. Separator in that particular file is \t which indicates tab. Now, we are using pd.read_csv in order to load that file and build a data frame.?

import pandas as pd

data=pd.read_csv ('sample.tsv', sep= '\t')

Once, you have data frame, we are importing TfidfVectorizer. Objective of TfidfVectorizer is to convert a collection of raw documents to a matrix of TF-IDF features. Once you have vectorized, it means that now you have TF which defines the vectors, which is created using features of TF-IDF.

from sklearn.feature_extraction.text import TfidVectorizer

tf=TfidVectorizer( )

text_tf= tf.fit_transform (data ['Phrase'])

Now, we are transforming it by passing data with Phrase as one of the index. In other words, now you will have text_tf which will be a dataset populated after fitting and transforming utilizing TfidfVectorizer. Once you have text_tf, you can go and you can start planning to build your classification model. And, to do that, our first objective should be to generate test and train data. In order to generate test and train data, we are importing train_test_split from sklearn.model_selection. Next line is all about making call to the function called train_test_split, passing the data, passing the size that should be used in order to divide the data in form of train and test. We're using random_state as 123 in order to ensure that data is randomly distributed.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split (text_tf, data ['Sentiment'], test_size=0 . 3, random_state=123)

Now, we have train and test data available, using which we can plan building the model. Next, we have selected MultinomialNB, that is naive_bayes, in order to build our model. In next line, we are importing matrices which will be required in order to do classification.

from sklearn.naive_bayes import MultinomialNB

from sklearn import metrics

Next line is the actual statement which builds the model. We have specified classification = MultinomialNB().fit and we are passing X_train and y_train. Finally, our classification model is ready which can be utilized in order to do the predictions and get the matrices of your choice. As of now, we have selected accuracy_score.

classification = MultinomialNB ( ).fit (X_train, y_train)

predicted= classification.predict (X_test)?

print ("MultinomialNB Accuracy:",metrics.accuracy_score(y_test, predicted))

Next line is printing that particular score. And finally, you get the predicted value as well. We'll run the program and we'll see that what is the current accuracy that we are achieving after applying TfidfVectorizer and building the model. Once you execute the program, you'll find that you have data which loads all the data which is present in your file in form of DataFrame.

From that data, you are building your test and train. Let's see y_test, which contains the sentiment. And, y_test contains sentiment, but yes, indexes are different. Finally, we will go and we'll see the predicted value and we'll find that this is what is the predicted value which indicates the accuracy.

Once you are able to understand and correlate the data which is present, you can go and you can also see what is the current accuracy level. If this accuracy level meets your need, it's fine, else you have to go and iterate and do the same task until and unless you are not meeting the right accuracy level. You can go and you can try increasing the data.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了