Building SMS SPAM Detector and Generating a WordCloud with Kaggle Dataset in JupyterLab
Background problem
At least 97% of American use text messages over mobile phones every day. In 2016, according to the research conducted by Portio research, 8.3 trillion messages exchanged over the mobile phones. The rising flood of big data shows an exchange of 23 billion messages per day and 16 million messages per minute. There are around 6.4 billion mobile subscribers around the world by the end of 2012. According to Portio Research, there will be a CAGR growth of 4.8% of growth in mobile subscriber base from 2014 to 2017. By the end of 2017, the mobile subscriber reached to 7.4 billion mobile subscribers. The proliferation of smart devices powered by exponential computing has shown a significant rise in the global smartphone system-on-chip market lead by Qualcomm, Apple, MediaTrek, Samsung, HiSilicon, Spreadtrum, and a vast number of other smartphone chip manufacturers in the market. Powering the chips with artificial intelligence technology paves the path to 5G for higher performance and signal processing. Regardless of the multifunctional and advanced capabilities of smartphones, simple text messaging continued to soar in the worldwide markets. The exponential growth of computing processing power gave rise to generating such massive big data over the text messages. The timeline of mobile handset industry from 1983 (when the first mobile handset launched) to 2002 (first mobile phone with touchscreen) shows the significant increase in the computing and SoC (System-on-Chip) architecture for such tsunami of big data over the SMS messages. The first SMS communication service launched in 1992. 3G mobile services launched in 2002. In 2010, 4G networks launched. The speed of the delivery gave rise to the communication increase through SMS messages for businesses and individuals to manage a significant part of their lives. The mobile messaging service industry generated revenues of $212b in 2012 alone.
Figure 1. Adapted from Portio Research.
According to Portio Research, SMS traffic to 100 billion messages from just 0.5 billion messages between 1996 and 1999. By the end of 2003, in another four years, SMS traffic quadrupled to 450 billion messages. In 2005, the SMS traffic reached over to a trillion messages mark only in two years between 2003 to 2005. By 2009, the world has seen traffic of five trillion messages. In 2015, the traffic peaked to 8.3 trillion messages. The SMS text traffic went by leaps and bounds with application-to-peer messaging and person-to-application SMS messaging with banking, mobile health, and mobile payments sector. This gave room to abundant SPAM from many telemarketers sending SMS texts. Nowadays, many recruiting messages send SMS with job positions with subscription and sometimes without a subscription. I receive heavy SMS messages from recruiting agencies without subscription. This has been coordinated by some of the people who hacked my account on Facebook and Twitter. I still have those messages with me.
According to two Forrester Research publications Forrester Research Mobile Media Application Spending Forecast 2012–2017 EU-7 and Forrester Research Mobile Media Application Spending Forecast 2012–2017 US six billion SMS messages sent in the US alone every day. Majority of texts, i.e., 80% of the texts generated from American adults. The rise of the SPAM can be attributed to the SMS success open rate of 98% as opposed to 20% success open rate through emails. The response rate goes higher for text messages with 45% and 6% for emails. Americans exchange twice as many text messages as phone calls.
Na?ve Bayes classifier
Considering exponential growth in big data and SMS traffic, there’s significant growth in SMS spam as a medium to commit fraud and advertise their job opportunities. The spam filtering can be applied through Na?ve Bayes classifier by classifying SMS whether SPAM or HAM. In essence, Na?ve Bayes classifier can work as anti-spam software with higher accuracy rates. In this Python implementation, it has shown an accuracy rate of 99.38% training and 98.15% of accuracy rate of testing. Kaggle dataset has been utilized to perform the SPAM detection through Na?ve Bayes classifier. Kaggle dataset file has two columns with the label v1 and v2. V1 contains label either spam or ham text data, while the v2 column contains the actual SMS message. Approximately, in US users receive 1.1 billion SPAM SMS messages and Chinese mobile users receive 8.29 billion SMS spams every week by various advertising media and fraudulent corporations. Many classifiers can be applied to filter the SMS SPAM problem such as rule induction, neural networks, decision trees, Na?ve Bayes, k-nearest neighbors, and support vector machines. One has to consider the fact that, classifying email is entirely different from classifying SMS text, as the length of the text is limited to 160 characters. Therefore, the featurization has to be adequate to identify between ham and spam. Historically, Na?ve Bayes classification algorithm has proven to be highly effective in identifying SPAM.
Figure 2. Kaggle SPAM dataset.
Figure 3.Adapted from The 2016 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob) Research Paper
TF-IDF Vectorizer vs. Common vectorizer strategies
As with other problems, the process involves at first loading the dataset by reading in Python with ISO-8859–1 encoding and applying Na?ve Bayes machine learning algorithm by training and testing stages of building a machine learning model. Any irrelevant column names in the file need to be dropped. The feature extraction can be performed either by count vectorizer or through TF-IDF vectorizer. The countvectorizer applies tokenization and occurrence counting through a single class. By applying the common vectorizer, the words can be tokenized through natural language processing and count the word occurrences through a minimalistic corpus of text files or documents. Alternatively, TF-IDF vectorizer can be applied as well as in the case of large text corpus; there will be the repetitive occurrence of words such as the, a, or is in the English language. The TFIDFTransformer and TFIDVectorizer in scikit learn will perform the count of the word occurrences.
Data Visualization
Generating a word cloud through wordcloud package shows the most frequently repeating SPAM words such as call, free, now, UK, ringtone, customer service, chat, landline, text, etc. with a combination of blue and green. The wordcloud generated from the program attached below:
The data visualization for ham shows the following word cloud from the program.
The data visualization for ham shows the following word cloud from the program.
Figure 5. Python program output from Jupyter console for HAM text.
Results
I have shared the program on Github at GPSingularity.