登录查看更多内容

NLP: Spam - Ham classification through TF-IDF Vectorizer

L Koushik Kumar

Head of AI Practice at Aptagrim limited

发布日期: 2021年1月26日

In the earlier blog, we went through spam - ham classifier with bag of words vectorizer, and that gave 90% of accuracy, with many misclassifications. Also, we did go through the draw-backs of bag of words, and seen how TF-IDF vectorizer can overcome the shortcomings. In this blog, I am implementing the same classification problem, but with two changes.

Perform lemmatization instead of stemming
Use TF-IDF vectorizer instead of Bag of words.

Dataset is available on: https://www.kaggle.com/uciml/sms-spam-collection-dataset

Step 1: Read the dataset, take the required columns dropping the rest

import pandas as pd

data= pd.read_csv('spam.csv', encoding = "ISO-8859-1", usecols=['v1', 'v2'])
data.rename(columns = {'v1':'labels', 'v2':'text'}, inplace=True)

data.head()

Step 2: Check for any imbalance in the dataset

import seaborn as sns

sns.countplot(data['labels'])

Step 3: Clean the data, dropping the stop words and pass through lemmatization

import re
import nltk
#nltk.download('stopwords')


from nltk.corpus import stopwords
#from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer 


ps = PorterStemmer()
lm = WordNetLemmatizer() 
corpus = []
for i in range(0, len(data)):
    review = re.sub('[^a-zA-Z]', ' ', data['text'][i])
    review = review.lower()
    review = review.split()
    review = [lm.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)

    corpus.append(review)

Step 4: Create bag of words and labels for train and test sets

from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features=2500)

X = cv.fit_transform(corpus).toarray()

Step 5: Handle imbalance data through SMOTE

from imblearn.combine import SMOTETomek 
smk= SMOTETomek()

X_bal, y_bal = smk.fit_sample(X, y)

Step 6: Split train and tests, perform training through Naive Bayes

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_bal, y_bal, test_size = 0.20, random_state = 0)


from sklearn.naive_bayes import MultinomialNB

Step 7: Predict X_test, calculate accuracy and confusion matrix

y_pred=spam_detect_model.predict(X_test)


from sklearn.metrics import confusion_matrix,accuracy_score
print("accuracy score:",accuracy_score(y_pred,y_test))
print(confusion_matrix(y_pred,y_test))

Accuracy score: 0.97979

Confusion matrix: [[911 14] [ 25 980]]

Few changes made to the code, gave a big change in accuracy, and reduced misclassifications. Thank you for reading through this blog. Will come with more exciting content tomorrow!!!

Abhishek Dutta

CS&E, Quantum-NN

2 年

where did you define 'spam_detect_model'?

查看更多评论

要查看或添加评论，请登录

L Koushik Kumar的更多文章

The TensorFlow NumPy API - Google I/O 2021

2021年5月23日

The TensorFlow NumPy API - Google I/O 2021

TensorFlow team in the recent Google I/O 2021 event has announced NumPy API. Kemal El Moujahid, the product director…
NLP: Spam - Ham classification through Bag of Words

2021年1月25日

NLP: Spam - Ham classification through Bag of Words

In the last two blogs, we have been through basic concepts of NLP. Lets look at the implementation part through a spam…
NLP: Bag of words and TF-IDF explained!

2021年1月24日

NLP: Bag of words and TF-IDF explained!

In the previous article, we have been through tokenization, use of stop words, stemming and lemmatization. Basically…
Natural Language Processing - Begin the learning naturally!

2021年1月23日

Natural Language Processing - Begin the learning naturally!

Natural language processing is all about making computers understand human language, and intern generates human…

Step 1: Read the dataset, take the required columns dropping the rest

Step 2: Check for any imbalance in the dataset

Step 3: Clean the data, dropping the stop words and pass through lemmatization

Step 4: Create bag of words and labels for train and test sets

Step 5: Handle imbalance data through SMOTE

Step 6: Split train and tests, perform training through Naive Bayes

Step 7: Predict X_test, calculate accuracy and confusion matrix

L Koushik Kumar的更多文章

The TensorFlow NumPy API - Google I/O 2021

NLP: Spam - Ham classification through Bag of Words

NLP: Bag of words and TF-IDF explained!

Natural Language Processing - Begin the learning naturally!