登录查看更多内容

Building SMS SPAM Detector and Generating a WordCloud with Kaggle Dataset in JupyterLab

Dr. Ganapathi Pulipaka

发布日期: 2018年5月17日

Background problem

At least 97% of American use text messages over mobile phones every day. In 2016, according to the research conducted by Portio research, 8.3 trillion messages exchanged over the mobile phones. The rising flood of big data shows an exchange of 23 billion messages per day and 16 million messages per minute. There are around 6.4 billion mobile subscribers around the world by the end of 2012. According to Portio Research, there will be a CAGR growth of 4.8% of growth in mobile subscriber base from 2014 to 2017. By the end of 2017, the mobile subscriber reached to 7.4 billion mobile subscribers. The proliferation of smart devices powered by exponential computing has shown a significant rise in the global smartphone system-on-chip market lead by Qualcomm, Apple, MediaTrek, Samsung, HiSilicon, Spreadtrum, and a vast number of other smartphone chip manufacturers in the market. Powering the chips with artificial intelligence technology paves the path to 5G for higher performance and signal processing. Regardless of the multifunctional and advanced capabilities of smartphones, simple text messaging continued to soar in the worldwide markets. The exponential growth of computing processing power gave rise to generating such massive big data over the text messages. The timeline of mobile handset industry from 1983 (when the first mobile handset launched) to 2002 (first mobile phone with touchscreen) shows the significant increase in the computing and SoC (System-on-Chip) architecture for such tsunami of big data over the SMS messages. The first SMS communication service launched in 1992. 3G mobile services launched in 2002. In 2010, 4G networks launched. The speed of the delivery gave rise to the communication increase through SMS messages for businesses and individuals to manage a significant part of their lives. The mobile messaging service industry generated revenues of $212b in 2012 alone.

Figure 1. Adapted from Portio Research.

According to Portio Research, SMS traffic to 100 billion messages from just 0.5 billion messages between 1996 and 1999. By the end of 2003, in another four years, SMS traffic quadrupled to 450 billion messages. In 2005, the SMS traffic reached over to a trillion messages mark only in two years between 2003 to 2005. By 2009, the world has seen traffic of five trillion messages. In 2015, the traffic peaked to 8.3 trillion messages. The SMS text traffic went by leaps and bounds with application-to-peer messaging and person-to-application SMS messaging with banking, mobile health, and mobile payments sector. This gave room to abundant SPAM from many telemarketers sending SMS texts. Nowadays, many recruiting messages send SMS with job positions with subscription and sometimes without a subscription. I receive heavy SMS messages from recruiting agencies without subscription. This has been coordinated by some of the people who hacked my account on Facebook and Twitter. I still have those messages with me.

According to two Forrester Research publications Forrester Research Mobile Media Application Spending Forecast 2012–2017 EU-7 and Forrester Research Mobile Media Application Spending Forecast 2012–2017 US six billion SMS messages sent in the US alone every day. Majority of texts, i.e., 80% of the texts generated from American adults. The rise of the SPAM can be attributed to the SMS success open rate of 98% as opposed to 20% success open rate through emails. The response rate goes higher for text messages with 45% and 6% for emails. Americans exchange twice as many text messages as phone calls.

Na?ve Bayes classifier

Considering exponential growth in big data and SMS traffic, there’s significant growth in SMS spam as a medium to commit fraud and advertise their job opportunities. The spam filtering can be applied through Na?ve Bayes classifier by classifying SMS whether SPAM or HAM. In essence, Na?ve Bayes classifier can work as anti-spam software with higher accuracy rates. In this Python implementation, it has shown an accuracy rate of 99.38% training and 98.15% of accuracy rate of testing. Kaggle dataset has been utilized to perform the SPAM detection through Na?ve Bayes classifier. Kaggle dataset file has two columns with the label v1 and v2. V1 contains label either spam or ham text data, while the v2 column contains the actual SMS message. Approximately, in US users receive 1.1 billion SPAM SMS messages and Chinese mobile users receive 8.29 billion SMS spams every week by various advertising media and fraudulent corporations. Many classifiers can be applied to filter the SMS SPAM problem such as rule induction, neural networks, decision trees, Na?ve Bayes, k-nearest neighbors, and support vector machines. One has to consider the fact that, classifying email is entirely different from classifying SMS text, as the length of the text is limited to 160 characters. Therefore, the featurization has to be adequate to identify between ham and spam. Historically, Na?ve Bayes classification algorithm has proven to be highly effective in identifying SPAM.

Figure 2. Kaggle SPAM dataset.

Figure 3.Adapted from The 2016 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob) Research Paper

TF-IDF Vectorizer vs. Common vectorizer strategies

As with other problems, the process involves at first loading the dataset by reading in Python with ISO-8859–1 encoding and applying Na?ve Bayes machine learning algorithm by training and testing stages of building a machine learning model. Any irrelevant column names in the file need to be dropped. The feature extraction can be performed either by count vectorizer or through TF-IDF vectorizer. The countvectorizer applies tokenization and occurrence counting through a single class. By applying the common vectorizer, the words can be tokenized through natural language processing and count the word occurrences through a minimalistic corpus of text files or documents. Alternatively, TF-IDF vectorizer can be applied as well as in the case of large text corpus; there will be the repetitive occurrence of words such as the, a, or is in the English language. The TFIDFTransformer and TFIDVectorizer in scikit learn will perform the count of the word occurrences.

Data Visualization

Generating a word cloud through wordcloud package shows the most frequently repeating SPAM words such as call, free, now, UK, ringtone, customer service, chat, landline, text, etc. with a combination of blue and green. The wordcloud generated from the program attached below:

The data visualization for ham shows the following word cloud from the program.

Figure 5. Python program output from Jupyter console for HAM text.

Results

I have shared the program on Github at GPSingularity.

要查看或添加评论，请登录

Dr. Ganapathi Pulipaka的更多文章

Can US Launch Next Generation AI Weapon Program

2023年8月17日

Can US Launch Next Generation AI Weapon Program

The next generation fighter jet program in America is truly impressive. With the advancements in global technology, the…

1 条评论
10 Most Influential Artificial Intelligence Executives in 2019 On The Globe by @analyticsinme - Analytics InSight Magazine

2019年5月22日

10 Most Influential Artificial Intelligence Executives in 2019 On The Globe by @analyticsinme - Analytics InSight Magazine

Dr. Ganapathi Pulipaka is a Chief Data Scientist for AI strategy, architecture, application development of Machine…

1 条评论
The Future Of Humanity: Artificial Intelligence by Buzzfeed Magazine.

2019年5月16日

The Future Of Humanity: Artificial Intelligence by Buzzfeed Magazine.

Take note of these two words: Artificial Intelligence. They will not hear about anything else with more emphasis on the…
Data Superheroes among US: The Whole Next Level of Human Brain by Brooke Whistance via @TheOdyssey

2019年5月14日

Data Superheroes among US: The Whole Next Level of Human Brain by Brooke Whistance via @TheOdyssey

Every individual possesses a specific talent and ability and sometimes more than one skill and different abilities can…
A New Book: The Future of Data Science and Parallel Computing

2018年8月13日

A New Book: The Future of Data Science and Parallel Computing

A New book Released https://www.amazon.

1 条评论
Building a Neural Net to Visualize High-Dimensional Data in TensorFlow

2018年6月19日

Building a Neural Net to Visualize High-Dimensional Data in TensorFlow

Word embeddings and high-dimensional data are ubiquitous in many facets of deep learning research such as natural…
Installation Guide for TensorFlow on macOS High Sierra 10.13.4 for your DeepLearning w/ Java, C, and Go

2018年6月19日

Installation Guide for TensorFlow on macOS High Sierra 10.13.4 for your DeepLearning w/ Java, C, and Go

This installation particularly focuses on macOS High Sierra version 10.13.

1 条评论
Ranked as Top Business Intelligence and Analytics Influencer for 2018 by Onalytica

2018年6月18日

Ranked as Top Business Intelligence and Analytics Influencer for 2018 by Onalytica

https://www.onalytica.
Tera-Peta-Exa-Zetta-Yotta: The Road to Technological Singularity - Interview with MirrorReview

2018年6月15日

Tera-Peta-Exa-Zetta-Yotta: The Road to Technological Singularity - Interview with MirrorReview

Modern technology has unlocked the data fabric of analytics with the potential of machine intelligence in day-to-day…

3 条评论
A Data Science Guide and Predictions for Future by GP Pulipaka published by Onalytica

2018年6月14日

A Data Science Guide and Predictions for Future by GP Pulipaka published by Onalytica

Key Topics: Machine Learning, Deep Learning, Data Science, IoT, SAP, Cloud Computing, Distributed Computing, Networks…

See all articles

Building SMS SPAM Detector and Generating a WordCloud with Kaggle Dataset in JupyterLab

Dr. Ganapathi Pulipaka

Dr. Ganapathi Pulipaka的更多文章

社区洞察

其他会员也浏览了

My Biggest Takeaways From Tonight's Apple WWDC 2024?Keynote.

(#78) The Tokyo Government is planning to launch a dating app; Treating cancer with AI; California is not really a democracy

Your iPhone Just Got a Lot Smarter (and Maybe a Little Creepier)

From Disabled iPad to Revolutionary AI: Apple's Quiet Revolution in Personal Computing

Analyst Insider Weekly

Many people think Apple is actually lagging behind when it comes to AI, but in reality they are right at the forefront!

New Month, New Edition! Unlock Intel & InspireSemi's Latest Announcement, Cloud Data Privacy, Top Ecommerce Stats Of 2024 And More!

Digital transformation mindcandy 11 October 2024

RetailScanAI: Pioneering Retail Management with Intel's oneAPI and Azure Cloud

Crossing the Point of No Return

Dr. Ganapathi Pulipaka的更多文章

Can US Launch Next Generation AI Weapon Program

10 Most Influential Artificial Intelligence Executives in 2019 On The Globe by @analyticsinme - Analytics InSight Magazine

The Future Of Humanity: Artificial Intelligence by Buzzfeed Magazine.

Data Superheroes among US: The Whole Next Level of Human Brain by Brooke Whistance via @TheOdyssey

A New Book: The Future of Data Science and Parallel Computing

Building a Neural Net to Visualize High-Dimensional Data in TensorFlow

Installation Guide for TensorFlow on macOS High Sierra 10.13.4 for your DeepLearning w/ Java, C, and Go

Ranked as Top Business Intelligence and Analytics Influencer for 2018 by Onalytica

Tera-Peta-Exa-Zetta-Yotta: The Road to Technological Singularity - Interview with MirrorReview

A Data Science Guide and Predictions for Future by GP Pulipaka published by Onalytica

社区洞察

其他会员也浏览了

My Biggest Takeaways From Tonight's Apple WWDC 2024?Keynote.

(#78) The Tokyo Government is planning to launch a dating app; Treating cancer with AI; California is not really a democracy

Your iPhone Just Got a Lot Smarter (and Maybe a Little Creepier)

From Disabled iPad to Revolutionary AI: Apple's Quiet Revolution in Personal Computing

Analyst Insider Weekly

Many people think Apple is actually lagging behind when it comes to AI, but in reality they are right at the forefront!

New Month, New Edition! Unlock Intel & InspireSemi's Latest Announcement, Cloud Data Privacy, Top Ecommerce Stats Of 2024 And More!

Digital transformation mindcandy 11 October 2024

RetailScanAI: Pioneering Retail Management with Intel's oneAPI and Azure Cloud

Crossing the Point of No Return