Building A SPAM Detector with Na?ve Bayes and AdaBoost Machine Learning Classifiers in JupyterLab.

Dr. Ganapathi Pulipaka

发布日期: 2018年5月12日

Building A SPAM Detector with Na?veBayes and AdaBoost Machine Learning Classifiers in JupyterLab.

Natural language processing is a sub-branch of artificial intelligence. Building a machine or a tool to process the data through natural language processing requires mathematics, statistics, algorithms, and Python programming. Advanced techniques such as Word2Vec can convert words into vectors which makes it easier to process the text through mathematics and deep learning algorithms. Python language can handle the language humans speak, write, and understand. Before we begin the practical implementation of Python code in JupyterLab, it is critical to understand the essentials of natural language processing and machine learning classifiers. The following figure depicts the multiple branches of natural language processing.

Figure 1: Adapted from Python Natural Language Processing Book

It is crucial to set the context and background problem in an enterprise that we are trying to solve through natural language processing and machine learning in Python. Once the background problem statement is defined, the next step is to identify the dataset and preprocess the data to prepare the data in a way a machine can understand through Python language. Feature engineering is another critical aspect of data science problem to process the linguistics of the text. Any machine learning classifier can be applied to the preprocessed dataset to identify an email either as spam or non-spam. In this scenario, I will be processing a preprocessed dataset that contains 48 columns. It is essential to determine the word-frequency measure to understand the number of times a particular word appeared in the dataset divided by the number of words in the document multiplied by 100. In this preprocessed dataset from UCI, the last column has been identified as the label; one = SPAM and 0 = not SPAM. The utilization of this dataset does not require much data wrangling and preprocessing for measuring the accuracy of the classification through Na?ve Bayes and AdaBoost classifiers.

There are several other machine learning classifiers that can be applied to resolve the email spam detection problem as part of the supervised machine learning algorithms. As part of solving the problem, initially a trained classification model will be generated that learns the data features from the training samples. Once the training is complete, it will be able to identify any new data through classification. Binary classification is the most common classifier that can determine if the email is SPAM or non-SPAM. When multiclass classification is applied on such problem, it can allow more than just two possible classes above and beyond binary classification with more outcomes. Handwritten recognition goes way back for being a research and development problem to identify the digits from 0 to 9 on several bank check systems. Multi-label classification is another type of algorithm typically applied in bioinformatics and genomics, where a protein can have multiple functions. In this scenario, breaking down the number of classifiers can be a solution for label classification from a multi-label classifier into many binary classifiers.

Figure 2: Adapted from Python Machine Learning by Example.

The Bayes’ theorem denotes event A and event B. Prediction of the weather, such as will there be a storm tomorrow or the probability of getting a head or a tail when flipped. The probability of P(A|B) is the probability of hypothesis event A for the data B. Here P(A|B) represents the posterior probability. P(B|A) represents the probability of observing B for the event A. P(A) denotes the probability of A being true. It is the prior probability of A. P(B) denotes the probability of the data event B.

Figure 3: Bayes’ Theorem

Python environment has to be set up on macOS or on Linux environment. There will not be a step-by-step guide for installing Python on macOS or Linux as part of this article. Either PyCharm or Anaconda distributions can be downloaded to set up Python and JupyterLab environments. Individual packages have to be imported if PyCharm has been chosen as the Python environment through Pip command. If Anaconda environment has been set up, it installs all the necessary Python 3.6 packages so that there won’t be any issues when Na?ve Bayes classifier is accessed from sci-kit learn. Most of the time, there’s massive preprocessing for natural language processing. As long as the data does not have any imbalances and it is a fit any machine learning classifier can be applied to classify the SPAM problem in the emails. Python sci-kit learn comes with Na?ve Bayes classifier for multinomial models. Multinomial Na?ve Bayes classifier works with high accuracy for any discrete features such as word counts. However, it is also known to work well for TF-IDF as well. The data has been shuffled to access different chunks of data in Python program and the last 100 rows have been considered for training and testing. The machine learning model with Na?ve Bayes classifier has shown an accuracy of 87%. Any other classifier can be applied as well. AdaBoost classifier has demonstrated an accuracy of 93%.

Figure 4: Adapted from Python Machine Learning by Example.

Results

The JupyterLab Notebook has been shared on Github at GPSingularity.

References

Cox, T. (2018). Raspberry Pi 3 Cookbook for Python Programmers — Third Edition (3 ed.). Birmingham, England: Packt Publishing.

Hardeniya, N. (2016). Natural Language Processing: Python and NLTK . Birmingham, England: Packt Publishing.

Liu, Y. H. (2017). Python Machine Learning By Example . Birmingham, England: Packt Publishing.

Thanaki, J. (2017). Python Natural Language Processing. Birmingham, England : Packt Publishing.

Building A SPAM Detector with Na?ve Bayes and AdaBoost Machine Learning Classifiers in JupyterLab.

Dr. Ganapathi Pulipaka

Building A SPAM Detector with Na?veBayes and AdaBoost Machine Learning Classifiers in JupyterLab.

更多精彩文章

社区洞察

其他会员也浏览了

Importance of Python in AI & ML-Alpinetechq

Introducing the Revolutionary Self-Modifying GPT Python Script!

Building a Simple Language Model With Pytorch

Mastering Machine Learning with Python In Six Steps

dl-translate: a python library for text translation between 50 languages using Deep Learning

15 Machine Learning Libraries and Tools for Java

End to end LLMOps Pipeline - Part 2 - FastAPI

Whose tweet is it anyway? Insights and A.I. have the answer!

7 Awesome Python Libraries and Tools for Data Lovers: 6th one is Unbelievable

Importing Word Documents to RapidMiner

Building A SPAM Detector with Na?veBayes and AdaBoost Machine Learning Classifiers in JupyterLab.

Can US Launch Next Generation AI Weapon Program

2023年8月17日

10 Most Influential Artificial Intelligence Executives in 2019 On The Globe by @analyticsinme - Analytics InSight Magazine

2019年5月22日

The Future Of Humanity: Artificial Intelligence by Buzzfeed Magazine.

2019年5月16日

Data Superheroes among US: The Whole Next Level of Human Brain by Brooke Whistance via @TheOdyssey

2019年5月14日

A New Book: The Future of Data Science and Parallel Computing

2018年8月13日

Building a Neural Net to Visualize High-Dimensional Data in TensorFlow

2018年6月19日

Installation Guide for TensorFlow on macOS High Sierra 10.13.4 for your DeepLearning w/ Java, C, and Go

2018年6月19日

Ranked as Top Business Intelligence and Analytics Influencer for 2018 by Onalytica

2018年6月18日

Tera-Peta-Exa-Zetta-Yotta: The Road to Technological Singularity - Interview with MirrorReview

2018年6月15日

A Data Science Guide and Predictions for Future by GP Pulipaka published by Onalytica

2018年6月14日

社区洞察

其他会员也浏览了

Importance of Python in AI & ML-Alpinetechq

Introducing the Revolutionary Self-Modifying GPT Python Script!

Building a Simple Language Model With Pytorch

Mastering Machine Learning with Python In Six Steps

dl-translate: a python library for text translation between 50 languages using Deep Learning

15 Machine Learning Libraries and Tools for Java

End to end LLMOps Pipeline - Part 2 - FastAPI

Whose tweet is it anyway? Insights and A.I. have the answer!

7 Awesome Python Libraries and Tools for Data Lovers: 6th one is Unbelievable

Importing Word Documents to RapidMiner