Get started with Topic Modeling in Python and Amazon Comprehend - Part 1 Natural Language Processing
How do we understand our customers?
To deliver a great customer experience, we analyze structured data such as the number of live chats, chat interactions and number of calls answered in a contact centre and we also analyze consumer behaviour which includes unstructured data.
From a marketing perspective, consumer behaviour includes actions of an individual or group that affects the purchase decision such as:
Consumer behaviour includes extracting value from unstructured data using text analytics and natural language processing.
Lesson Objectives
In this lesson you will learn about:
What is natural language processing?
According to IBM , natural language processing is a branch of artificial intelligence.
...combines computational linguistics—rule-based modeling of human language—with statistical and machine learning models to enable computers and digital devices to recognize, understand and generate text and speech.
Natural Language Processing may be used to perform the following tasks such as:
Solution Architecture
Below is the high-level solution architecture for uploading pre-processed text file into Amazon Comprehend for text analytics and natural language processing for tasks such as topic modeling analysis jobs.
What is topic modelling?
Amazon Comprehend uses an algorithm?Latent dirichlet allocation -based learning model to determine the topics in a set of documents. It examines each document to determine the context and meaning of a word.
It is a type of unsupervised learning where the target variable has no labels.
? Documents must be in UTF-8 formatted text files.
? Pre-select the number of topics e.g. 10
What are the industry use cases?
Topic modelling may be used in various industries to analyze:
How do you pre-process text data?
Dataset
For this tutorial, we will explore the open-source dataset New York Times comments available from Kaggle.com . These articles and comments were made between January to May 2017 and also January to April 2018.
Tutorial 1: How to get started in topic modelling using a Jupyter notebook and Amazon Code Whisperer
df = df.drop(columns =['abstract', 'articleID', 'articleWordCount','byline', 'documentType', 'headline','keywords','multimedia','newDesk','printPage','pubDate','sectionName','source','typeOfMaterial','webURL'], axis=1).sample(250)
df.head()
pip install wordcloud
The following stop words were removed:
['liberals', 'discovering', 'conservatives', 'known', 'years', 'negative', 'energy', 'powerful', 'force']
The output of the corpus:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]
These are the keywords from the 10 topics:
[(0, '0.008*"us" + 0.008*"care" + 0.006*"states" + 0.006*"president" + ' '0.006*"new" + 0.006*"york" + 0.004*"thought" + 0.004*"plan" + 0.004*"waive" ' '+ 0.004*"problems"'), (1, '0.011*"new" + 0.009*"back" + 0.007*"college" + 0.006*"would" + ' '0.005*"makes" + 0.005*"deal" + 0.005*"two" + 0.005*"one" + 0.005*"cities" + ' '0.005*"administration"'), (2, '0.005*"american" + 0.005*"good" + 0.005*"known" + 0.005*"comes" + ' '0.005*"next" + 0.005*"shows" + 0.004*"influence" + 0.003*"greeted" + ' '0.003*"seem" + 0.003*"uncooperative"'), (3, '0.010*"first" + 0.010*"could" + 0.007*"us" + 0.007*"trump" + ' '0.005*"companies" + 0.005*"force" + 0.005*"run" + 0.005*"newly" + ' '0.005*"provide" + 0.005*"food"'), (4, '0.009*"could" + 0.006*"would" + 0.006*"pay" + 0.006*"world" + ' '0.006*"century" + 0.006*"much" + 0.006*"mr" + 0.005*"new" + 0.004*"two" + ' '0.004*"said"'), (5, '0.012*"one" + 0.008*"trump" + 0.007*"history" + 0.006*"french" + ' '0.006*"president" + 0.005*"government" + 0.004*"party" + 0.004*"strike" + ' '0.004*"way" + 0.004*"look"'), (6, '0.015*"new" + 0.008*"york" + 0.008*"old" + 0.007*"president" + ' '0.005*"musical" + 0.005*"thursday" + 0.005*"set" + 0.005*"taking" + ' '0.005*"events" + 0.005*"may"'), (7, '0.009*"trump" + 0.009*"new" + 0.008*"administration" + 0.005*"took" + ' '0.005*"game" + 0.005*"supreme" + 0.005*"becomes" + 0.005*"stranger" + ' '0.005*"year" + 0.005*"american"'), (8, '0.011*"trump" + 0.010*"president" + 0.007*"need" + 0.005*"money" + ' '0.005*"election" + 0.005*"take" + 0.005*"say" + 0.005*"us" + 0.005*"place" ' '+ 0.005*"better"'), (9, '0.008*"say" + 0.007*"one" + 0.006*"show" + 0.006*"time" + 0.005*"power" + ' '0.005*"al" + 0.004*"officials" + 0.004*"get" + 0.004*"new" + 0.004*"first"')]
领英推荐
pip install mglearn
Invert rows with [:,::-1] to make descending
From the above 10 topics, we can see that the New York Times articles in April 2017 included a lot of topics about the former President Donald Trump, Syria, administration and election.
Tutorial 2: How to get started in topic modelling using Amazon Comprehend
Select Topic Modeling as the analysis type.
Provide the S3 location of the input data and the S3 location to output the processing job.
Enter the number of topics e.g. 10
Conclusion
Topic modeling is a type of unsupervised learning that can be used to explore key themes from documents and clustering from text data.
Together with structured data and unstructured data, topic modelling can assist with improving customer experience by examining consumer behaviour from IT support tickets, social media trends, customer feedback and target subscribers with curated content for newspapers in print and digital media.
AWS Blogs on Amazon Code Whisperer and Amazon Comprehend
To help you get started in installing and using AI coding companion Amazon Code Whisperer you may read my previous blogs:
Reference
Next Edition
In the next edition, we are exploring:
Until the next lesson, happy learning! ??
This Month
If you would like to review the code you may refer to my Github account .
read previous editions from my weekly newsletter or take my LinkedIn Learning course to learn how to get started with Amazon Code Whisperer.