Text Analysis - Word Cloud
Raja Saurabh Tiwari
Vice President @ Citi | Java , Cloud, ML Solutions | Gen AI enthusiast | Wildlife Photography
Text Analysis :
Text analysis one of the richest area in the Machine Learning space. Text analysis is the process of deriving the meaningful insight from the text, sentence, or document also knows as Corpus.
More formally
Text mining, also referred to as text data mining, similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.
Courtesy : wiki
There are many use cases of text analysis. You may want to analyze the tweets of people. What messages are exchanged in WhatsApp what kind of posts are people sharing over Facebook and other social platforms.
Now analysis can really give you a lot opportunity in clustering the people in certain groups and this can eventually help you in targeting correct audience for your business or make policies accordingly.
From what is trending in twitter to what people are discussing can all be analyzed using various tools available in the market.
Today we'll talk about the one such method called WordCloud. Look at the image below. The image actually shows the most frequently used words in the people have discussed over a chat group. I have taken this text example from one of my WhatsApp group.
Creating Word Cloud:
Let's understand how to create such cloud using Python libraries
Pre-requisite : Dump your data into into a tab delimited file. For this example I have exported WhatsApp data into a text file for demonstration purpose.
Natural Language Toolkit is one of the most widely used libraries in the Machine Learning domain for text analysis. Let's import it
import nltk
Let's import modules from ntlk to help clean the text and remove the noise in the text.
from nltk.tokenize import word_tokenize from nltk.corpus import stopwords
Let's import pandas for various operations related to data import and manipulation.
import pandas as pd
Read the data into memory as a Dataframe
pd.read_csv("whatsppmessages.csv")
After reading the text into the memory you need to clean the text. ( High level steps are described below but it can be really complex when drilling down to details and doing real text analysis).
- The text in WhatsApp will be in the format : 22/11/2020, 23:16 - "$username": "the text"
- Split the text and get the message you want to analyze.
- Remove Stop words like ('is', 'the', 'a' ,etc.)
Create the word cloud using below code
wordcloud = WordCloud(max_font_size=40).generate(text) plt.figure() plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show()
And there you go, That's it.
Now you can easily analyze it. The word "power" is the word used most frequently in the chat followed by done and today.
We've been discussing problem of electricity in Pune so the word power has been most frequent.
In Machine Learning one of the most important part is presentation. How you can easily convey the summary/outcome of the analysis to the stakeholders. These kind of diagrams, summary charts play key role in telling the story and convincing the client.
Let's meet again with some other interesting topic/application of Machine Learning.
Raja Saurabh Tiwari