NLP: Spam - Ham classification through TF-IDF Vectorizer
In the earlier blog, we went through spam - ham classifier with bag of words vectorizer, and that gave 90% of accuracy, with many misclassifications. Also, we did go through the draw-backs of bag of words, and seen how TF-IDF vectorizer can overcome the shortcomings. In this blog, I am implementing the same classification problem, but with two changes.
- Perform lemmatization instead of stemming
- Use TF-IDF vectorizer instead of Bag of words.
Dataset is available on: https://www.kaggle.com/uciml/sms-spam-collection-dataset
Step 1: Read the dataset, take the required columns dropping the rest
import pandas as pd data= pd.read_csv('spam.csv', encoding = "ISO-8859-1", usecols=['v1', 'v2']) data.rename(columns = {'v1':'labels', 'v2':'text'}, inplace=True) data.head()
Step 2: Check for any imbalance in the dataset
import seaborn as sns sns.countplot(data['labels'])
Step 3: Clean the data, dropping the stop words and pass through lemmatization
import re import nltk #nltk.download('stopwords') from nltk.corpus import stopwords #from nltk.stem.porter import PorterStemmer from nltk.stem import WordNetLemmatizer ps = PorterStemmer() lm = WordNetLemmatizer() corpus = [] for i in range(0, len(data)): review = re.sub('[^a-zA-Z]', ' ', data['text'][i]) review = review.lower() review = review.split() review = [lm.lemmatize(word) for word in review if not word in stopwords.words('english')] review = ' '.join(review)
corpus.append(review)
Step 4: Create bag of words and labels for train and test sets
from sklearn.feature_extraction.text import TfidfVectorizer cv = TfidfVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()
Step 5: Handle imbalance data through SMOTE
from imblearn.combine import SMOTETomek smk= SMOTETomek()
X_bal, y_bal = smk.fit_sample(X, y)
Step 6: Split train and tests, perform training through Naive Bayes
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_bal, y_bal, test_size = 0.20, random_state = 0) from sklearn.naive_bayes import MultinomialNB
Step 7: Predict X_test, calculate accuracy and confusion matrix
y_pred=spam_detect_model.predict(X_test) from sklearn.metrics import confusion_matrix,accuracy_score print("accuracy score:",accuracy_score(y_pred,y_test)) print(confusion_matrix(y_pred,y_test))
Accuracy score: 0.97979
Confusion matrix: [[911 14] [ 25 980]]
Few changes made to the code, gave a big change in accuracy, and reduced misclassifications. Thank you for reading through this blog. Will come with more exciting content tomorrow!!!
CS&E, Quantum-NN
2 年where did you define 'spam_detect_model'?