Text Analysis - Word Cloud

Text Analysis - Word Cloud

Text Analysis :

Text analysis one of the richest area in the Machine Learning space. Text analysis is the process of deriving the meaningful insight from the text, sentence, or document also knows as Corpus. 

More formally

Text mining, also referred to as text data mining, similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.

Courtesy : wiki

There are many use cases of text analysis. You may want to analyze the tweets of people. What messages are exchanged in WhatsApp what kind of posts are people sharing over Facebook and other social platforms.

 Now analysis can really give you a lot opportunity in clustering the people in certain groups and this can eventually help you in targeting correct audience for your business or make policies accordingly.

 From what is trending in twitter to what people are discussing can all be analyzed using various tools available in the market.

Today we'll talk about the one such method called WordCloud. Look at the image below. The image actually shows the most frequently used words in the people have discussed over a chat group. I have taken this text example from one of my WhatsApp group.

Creating Word Cloud:

Let's understand how to create such cloud using Python libraries

Pre-requisite : Dump your data into into a tab delimited file. For this example I have exported WhatsApp data into a text file for demonstration purpose.

Natural Language Toolkit is one of the most widely used libraries in the Machine Learning domain for text analysis. Let's import it

import nltk

Let's import modules from ntlk to help clean the text and remove the noise in the text.

from nltk.tokenize import word_tokenize


from nltk.corpus import stopwords

Let's import pandas for various operations related to data import and manipulation.

import pandas as pd

Read the data into memory as a Dataframe

pd.read_csv("whatsppmessages.csv")

After reading the text into the memory you need to clean the text. ( High level steps are described below but it can be really complex when drilling down to details and doing real text analysis).

  • The text in WhatsApp will be in the format : 22/11/2020, 23:16 - "$username": "the text"
  • Split the text and get the message you want to analyze.
  • Remove Stop words like ('is', 'the', 'a' ,etc.)

Create the word cloud using below code

wordcloud = WordCloud(max_font_size=40).generate(text)

plt.figure()

plt.imshow(wordcloud, interpolation="bilinear")

plt.axis("off")


plt.show()


No alt text provided for this image

And there you go, That's it.

Now you can easily analyze it. The word "power" is the word used most frequently in the chat followed by done and today.

We've been discussing problem of electricity in Pune so the word power has been most frequent.

In Machine Learning one of the most important part is presentation. How you can easily convey the summary/outcome of the analysis to the stakeholders. These kind of diagrams, summary charts play key role in telling the story and convincing the client.


Let's meet again with some other interesting topic/application of Machine Learning.

Raja Saurabh Tiwari

要查看或添加评论,请登录

Raja Saurabh Tiwari的更多文章

  • The Hidden Cost of AI

    The Hidden Cost of AI

    Artificial Intelligence (AI) is revolutionizing industries, enhancing automation, and creating new possibilities for…

    3 条评论
  • Agentic AI - My take

    Agentic AI - My take

    Introduction In recent months, Agentic AI has emerged as a focal point in the technology sector, captivating both…

    16 条评论
  • Large Language Models vs Small Language Models

    Large Language Models vs Small Language Models

    Before directly jumping to LLM, a quick recap on AI and Machine Learning. We all have been seeing the below image which…

    2 条评论
  • So what makes a good data science profile

    So what makes a good data science profile

    Let's start with some stats Data science was named the fastest-growing job in 2017 by LinkedIn, and in 2018 Glassdoor…

    3 条评论
  • Don't let your fear win

    Don't let your fear win

    Once Krishna and Balarama got late playing in the forest. They decided to rest in there over the night and thought to…

    1 条评论
  • Data Lake & Data Mesh

    Data Lake & Data Mesh

    Global data creation is projected to exceed 180 zettabytes in the next five years. It was always a struggle to create a…

  • Analytics of Data Scientists in Kaggle

    Analytics of Data Scientists in Kaggle

    Kaggle has recently published a report on the Kaggle users on various aspects. The trend shows analysis of people…

  • Machine Learning (Without CODE)

    Machine Learning (Without CODE)

    Machine learning is very fascinating for data science practitioners and everyone and there's a continuous effort…

    2 条评论
  • Statistics vs. Visualization (#Data Science)

    Statistics vs. Visualization (#Data Science)

    Understanding the statistical properties of the data is one of the key aspect of data science or Machine Learning…

  • AutoML - first glance

    AutoML - first glance

    "Machine Learning and AI attempts to automate manual work..

社区洞察

其他会员也浏览了