NLP: Spam - Ham classification through Bag of Words
Source: https://unsplash.com/photos/hSODeSbvzE0?utm_source=unsplash&utm_medium=referral&utm_content=creditShareLink

NLP: Spam - Ham classification through Bag of Words

In the last two blogs, we have been through basic concepts of NLP. Lets look at the implementation part through a spam - ham classification problem.

Dataset is available on: https://www.kaggle.com/uciml/sms-spam-collection-dataset

Step 1: Read the dataset, take the required columns dropping the rest

import pandas as pd

data= pd.read_csv('spam.csv', encoding = "ISO-8859-1", usecols=['v1', 'v2'])
data.rename(columns = {'v1':'labels', 'v2':'text'}, inplace=True)
data.head()

No alt text provided for this image

Step 2: Check for any imbalance in the dataset

import seaborn as sns
sns.countplot(data['labels'])


No alt text provided for this image

Step 3: Clean the data, dropping the stop words and passing through stemming 

import re
import nltk
#nltk.download('stopwords')


from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()


corpus = []
for i in range(0, len(data)):
    review = re.sub('[^a-zA-Z]', ' ', data['text'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

Step 4: Create bag of words and labels for train and test sets

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

y=pd.get_dummies(data['labels'])
y=y.iloc[:,1].values

Step 5: Handle imbalance data through SMOTE 

from imblearn.combine import SMOTETomek 
smk= SMOTETomek()
X_bal, y_bal = smk.fit_sample(X, y)

Step 6: Split train and tests, perform training through Naive Bayes 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_bal, y_bal, test_size = 0.20, random_state = 0)


from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, y_train)

Step 7: Predict X_test, calculate accuracy and confusion matrix 

y_pred=spam_detect_model.predict(X_test)


from sklearn.metrics import confusion_matrix,accuracy_score
print("accuracy score:",accuracy_score(y_pred,y_test))
print(confusion_matrix(y_pred,y_test))

No alt text provided for this image


Accuracy score: 0.9041

Confusion matrix: [[920 162] [ 23 825]]

Use of lemmatization and TF-IDF vectorizer may help boost accuracy - will test that out soon! Thank you for reading through this blog. Will come with some exciting content soon!!! 


要查看或添加评论,请登录

L Koushik Kumar的更多文章

社区洞察

其他会员也浏览了