登录查看更多内容

Text Classification Using scikit-learn

Sateesh Kumar Singh

MCA, M.Sc(Mathematics) with 19+ years exp. as Senior Consultant (AI/ML and Quantum Computing) with Eviden (AtoS) India

发布日期: 2023年1月20日

How to implement TF and TF-IDF text classification using Python, scikit-learn, and NLTK?

Now, let's try to understand what is TF-IDF. When we want to define TF-IDF, we need to understand what is TF and what is IDF. TFs defines number of times that term appears in a particular document, divided by total number of words which are present in the document.

Whereas when we talk about IDF, it is number of docs, divided by number of times that term appear in the docs. After defining TF and IDF, if we put together TF-IDF, it is typically product of TF and IDF scores of the term that we get after applying the formula that we have discussed. SciKit Learn and NLTK simplifies the task. We'll take a sample in order to understand that how we can do classification using TF-IDF.

We are importing pandas as pd because we are going to populate data from a file called sample.tsv that contains structured data which can be vectorized. Separator in that particular file is \t which indicates tab. Now, we are using pd.read_csv in order to load that file and build a data frame.?

import pandas as pd

data=pd.read_csv ('sample.tsv', sep= '\t')

Once, you have data frame, we are importing TfidfVectorizer. Objective of TfidfVectorizer is to convert a collection of raw documents to a matrix of TF-IDF features. Once you have vectorized, it means that now you have TF which defines the vectors, which is created using features of TF-IDF.

from sklearn.feature_extraction.text import TfidVectorizer

tf=TfidVectorizer( )

text_tf= tf.fit_transform (data ['Phrase'])

Now, we are transforming it by passing data with Phrase as one of the index. In other words, now you will have text_tf which will be a dataset populated after fitting and transforming utilizing TfidfVectorizer. Once you have text_tf, you can go and you can start planning to build your classification model. And, to do that, our first objective should be to generate test and train data. In order to generate test and train data, we are importing train_test_split from sklearn.model_selection. Next line is all about making call to the function called train_test_split, passing the data, passing the size that should be used in order to divide the data in form of train and test. We're using random_state as 123 in order to ensure that data is randomly distributed.

from sklearn.model_selection import train_test_split

Data & Analytics 4 个月前

Things You Can Do with Python: Advanced and Special…

Towards Data Science 8 个月前

Introduction to NumPy

Rany ElHousieny, PhD??? 1 年前

X_train, X_test, y_train, y_test = train_test_split (text_tf, data ['Sentiment'], test_size=0 . 3, random_state=123)

Now, we have train and test data available, using which we can plan building the model. Next, we have selected MultinomialNB, that is naive_bayes, in order to build our model. In next line, we are importing matrices which will be required in order to do classification.

from sklearn.naive_bayes import MultinomialNB

from sklearn import metrics

Next line is the actual statement which builds the model. We have specified classification = MultinomialNB().fit and we are passing X_train and y_train. Finally, our classification model is ready which can be utilized in order to do the predictions and get the matrices of your choice. As of now, we have selected accuracy_score.

classification = MultinomialNB ( ).fit (X_train, y_train)

predicted= classification.predict (X_test)?

print ("MultinomialNB Accuracy:",metrics.accuracy_score(y_test, predicted))

Next line is printing that particular score. And finally, you get the predicted value as well. We'll run the program and we'll see that what is the current accuracy that we are achieving after applying TfidfVectorizer and building the model. Once you execute the program, you'll find that you have data which loads all the data which is present in your file in form of DataFrame.

From that data, you are building your test and train. Let's see y_test, which contains the sentiment. And, y_test contains sentiment, but yes, indexes are different. Finally, we will go and we'll see the predicted value and we'll find that this is what is the predicted value which indicates the accuracy.

Once you are able to understand and correlate the data which is present, you can go and you can also see what is the current accuracy level. If this accuracy level meets your need, it's fine, else you have to go and iterate and do the same task until and unless you are not meeting the right accuracy level. You can go and you can try increasing the data.

要查看或添加评论，请登录

查看全部

Text Classification Using scikit-learn

Sateesh Kumar Singh

MCA, M.Sc(Mathematics) with 19+ years exp. as Senior Consultant (AI/ML and Quantum Computing) with Eviden (AtoS) India

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Modular Markov Chain Monte Carlo in Python

Vehicle’s Number Plate Detection using CNN model using python and Flask API…

Logistic Regression Example in Python (Source Code Included)

Integrating Image Related AI models using Streamlit, Python, and Replicate API (Kinda easy even for?me!)

Building a Machine Learning Model from Scratch Using?Python

Class 8 - STRING MANIPULATION & BASIC STRUCTURES IN PYTHON Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Python overtakes R, becomes the leader in Data Science, Machine Learning platforms

How to build Gradient Boosting Regressor in?Python?

A few examples of MACHINE LEARNING ON a simple self-response bot application in Python with source codes for your real-time projects

Few examples of Machine Learning Applications in Python with source code for your projects

领英推荐

Strategic Management - Shareholder Approach

2024年2月28日

Eviden (AtoS) tools on generative AI

2024年2月20日

Atos Quantum Learning Machine (QLM)

2024年2月15日

Building a Data-Driven Culture

2023年10月29日

AI Workforce Structures

2023年10月22日

AI Lung Cancer Diagnosis

2023年10月8日

Predicting Deforestation in Amazon Rainforests

2023年9月17日

Computer Vision and GANs

2023年9月3日

CNN Basics and Evolution

2023年8月26日

The Impact of Computer Vision in Aerospace, Automobiles, and Robotics

2023年7月30日