登录查看更多内容

Get started with Topic Modeling in Python and Amazon Comprehend - Part 1 Natural Language Processing

Wendy Wong

发布日期: 2024年4月1日

How do we understand our customers?

To deliver a great customer experience, we analyze structured data such as the number of live chats, chat interactions and number of calls answered in a contact centre and we also analyze consumer behaviour which includes unstructured data.

From a marketing perspective, consumer behaviour includes actions of an individual or group that affects the purchase decision such as:

Consumer's attitude
Consumer's emotions
Consumer's preferences

Consumer behaviour includes extracting value from unstructured data using text analytics and natural language processing.

Lesson Objectives

In this lesson you will learn about:

What is natural language processing?
Solution Architecture
What is topic modelling?
What are the industry use cases?
How do you pre-process text data?
How to get started in topic modelling using a Jupyter notebook and Amazon Code Whisperer
How to get started in topic modelling using Amazon Comprehend

What is natural language processing?

According to IBM , natural language processing is a branch of artificial intelligence.

...combines computational linguistics—rule-based modeling of human language—with statistical and machine learning models to enable computers and digital devices to recognize, understand and generate text and speech.

Natural Language Processing may be used to perform the following tasks such as:

Customer Support Tickets
Mine call centre analytics
Extract customer sentiment, key phrases from customer surveys
Analyze customer interactions
Find key topics from customer feedback
Classify and extract entities from documents

Solution Architecture

Below is the high-level solution architecture for uploading pre-processed text file into Amazon Comprehend for text analytics and natural language processing for tasks such as topic modeling analysis jobs.

What is topic modelling?

Amazon Comprehend uses an algorithm?Latent dirichlet allocation -based learning model to determine the topics in a set of documents. It examines each document to determine the context and meaning of a word.

It is a type of unsupervised learning where the target variable has no labels.

? Documents must be in UTF-8 formatted text files.

? Pre-select the number of topics e.g. 10

What are the industry use cases?

Topic modelling may be used in various industries to analyze:

Marketing: social media
Media and Publishing: news articles
Contact Centre: customer feedback

How do you pre-process text data?

Create a Jupyter Notebook
Use Amazon Code Whisperer as your AI coding companion
Perform Exploratory Analysis
Take the pre-processed data and create an Amazon Comprehend analysis job for topic modeling.

Dataset

For this tutorial, we will explore the open-source dataset New York Times comments available from Kaggle.com . These articles and comments were made between January to May 2017 and also January to April 2018.

Tutorial 1: How to get started in topic modelling using a Jupyter notebook and Amazon Code Whisperer

Step 1: Open Anaconda Navigator and launch jupyter notebook

Step 2: Import python libraries

Step 3: Load the dataset. For this demo we examine one article from the New York Times

Step 4: Inspect the first three rows of data

Step 5: Remove columns from the first 250 New York Times articles

df = df.drop(columns =['abstract', 'articleID', 'articleWordCount','byline', 'documentType', 'headline','keywords','multimedia','newDesk','printPage','pubDate','sectionName','source','typeOfMaterial','webURL'], axis=1).sample(250)

Step 6: Inspect the first three rows of the dataframe again

df.head()

Step 7: Make the words lower case and remove punctuation. Import the regex library.

Step 8: Install the word cloud module

pip install wordcloud

Step 9: Create a word cloud to view changes to Step 7.

Step 10: Pre-process the data for Latent Dirichlet Allocation

The following stop words were removed:

['liberals', 'discovering', 'conservatives', 'known', 'years', 'negative', 'energy', 'powerful', 'force']

Step 11: Export the pre-processed file as a CSV file to process Topic Modeling analysis job in Amazon Comprehend

Step 12: Import gensim library to create a dictionary and create a corpus

The output of the corpus:

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]

Step 13: Train the model in Latent Dirichlet Allocation by specifying 10 topics

These are the keywords from the 10 topics:

[(0, '0.008*"us" + 0.008*"care" + 0.006*"states" + 0.006*"president" + ' '0.006*"new" + 0.006*"york" + 0.004*"thought" + 0.004*"plan" + 0.004*"waive" ' '+ 0.004*"problems"'), (1, '0.011*"new" + 0.009*"back" + 0.007*"college" + 0.006*"would" + ' '0.005*"makes" + 0.005*"deal" + 0.005*"two" + 0.005*"one" + 0.005*"cities" + ' '0.005*"administration"'), (2, '0.005*"american" + 0.005*"good" + 0.005*"known" + 0.005*"comes" + ' '0.005*"next" + 0.005*"shows" + 0.004*"influence" + 0.003*"greeted" + ' '0.003*"seem" + 0.003*"uncooperative"'), (3, '0.010*"first" + 0.010*"could" + 0.007*"us" + 0.007*"trump" + ' '0.005*"companies" + 0.005*"force" + 0.005*"run" + 0.005*"newly" + ' '0.005*"provide" + 0.005*"food"'), (4, '0.009*"could" + 0.006*"would" + 0.006*"pay" + 0.006*"world" + ' '0.006*"century" + 0.006*"much" + 0.006*"mr" + 0.005*"new" + 0.004*"two" + ' '0.004*"said"'), (5, '0.012*"one" + 0.008*"trump" + 0.007*"history" + 0.006*"french" + ' '0.006*"president" + 0.005*"government" + 0.004*"party" + 0.004*"strike" + ' '0.004*"way" + 0.004*"look"'), (6, '0.015*"new" + 0.008*"york" + 0.008*"old" + 0.007*"president" + ' '0.005*"musical" + 0.005*"thursday" + 0.005*"set" + 0.005*"taking" + ' '0.005*"events" + 0.005*"may"'), (7, '0.009*"trump" + 0.009*"new" + 0.008*"administration" + 0.005*"took" + ' '0.005*"game" + 0.005*"supreme" + 0.005*"becomes" + 0.005*"stranger" + ' '0.005*"year" + 0.005*"american"'), (8, '0.011*"trump" + 0.010*"president" + 0.007*"need" + 0.005*"money" + ' '0.005*"election" + 0.005*"take" + 0.005*"say" + 0.005*"us" + 0.005*"place" ' '+ 0.005*"better"'), (9, '0.008*"say" + 0.007*"one" + 0.006*"show" + 0.006*"time" + 0.005*"power" + ' '0.005*"al" + 0.004*"officials" + 0.004*"get" + 0.004*"new" + 0.004*"first"')]

Step 14-16: Analyze LDA results. Use a count vectorizer to create a bag-of-words-matrix.

Step 17: View lda instance

Danny Butvinik 10 个月前

Assessing GPT-4 on Reasoning; Mathematical Perspective…

Danny Butvinik 7 个月前

Solving Complex Problems Using FastAPI, LangChain, and…

Juan Carlos Lanas Ocampo 2 个月前

Step 18: Inspect the dimensions of the LDA components

Step 19 : Install mglearn module

pip install mglearn

Step 20: For each topic (a row in the components_), sort the features ascending.

Invert rows with [:,::-1] to make descending

Step 21: Print the 10 topics

From the above 10 topics, we can see that the New York Times articles in April 2017 included a lot of topics about the former President Donald Trump, Syria, administration and election.

Step 22: View the 5 most important words from the 10 topics:

Tutorial 2: How to get started in topic modelling using Amazon Comprehend

Step 1: Login to your AWS account as an IAM admin user
Step 2: Upload the output.csv file into S3 bucket

Step 3: In the search bar navigate to Amazon Comprehend

Step 4: Click Launch Amazon Comprehend

Step 5: Click Analysis jobs

Step 6: Click Create job

Step 7: Provide a name for the job.

Select Topic Modeling as the analysis type.

Provide the S3 location of the input data and the S3 location to output the processing job.

Enter the number of topics e.g. 10

Step 8: Select create an IAM role and click Create job.

Step 9: Job is submitted and being processed.

Step 10: The processing job is completed and the output may be examined.

Step 11: Download the output that is saved in your S3 bucket

Conclusion

Topic modeling is a type of unsupervised learning that can be used to explore key themes from documents and clustering from text data.

Together with structured data and unstructured data, topic modelling can assist with improving customer experience by examining consumer behaviour from IT support tickets, social media trends, customer feedback and target subscribers with curated content for newspapers in print and digital media.

AWS Blogs on Amazon Code Whisperer and Amazon Comprehend

To help you get started in installing and using AI coding companion Amazon Code Whisperer you may read my previous blogs:

Reference

Next Edition

In the next edition, we are exploring:

Get started with Sentiment Analysis in Python and Amazon Comprehend - Part 2 Natural Language Processing

Until the next lesson, happy learning! ??

This Month

If you would like to review the code you may refer to my Github account .

read previous editions from my weekly newsletter or take my LinkedIn Learning course to learn how to get started with Amazon Code Whisperer.

Get started with Topic Modeling in Python and Amazon Comprehend - Part 1 Natural Language Processing

Wendy Wong

How do we understand our customers?

Lesson Objectives

What is natural language processing?

Solution Architecture

What is topic modelling?

What are the industry use cases?

How do you pre-process text data?

Dataset

Tutorial 1: How to get started in topic modelling using a Jupyter notebook and Amazon Code Whisperer

领英推荐

Tutorial 2: How to get started in topic modelling using Amazon Comprehend

Conclusion

AWS Blogs on Amazon Code Whisperer and Amazon Comprehend

Reference

Next Edition

This Month

Coffee N Learn

1,136 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Autonomous Ops with LLM for Advanced Anomaly Detection

The Software Industry's "Kodak Moment" - When Code Writes Itself

A Report on Image Caption Generator

Enhancing Data Science with Large Language Models within Select Industries.

Leveraging AI for Efficient Conversation Retrieval and Management: A Dive into ChromaDB and DSPyGen

The Hottest Tools in Machine Learning and Data Science in 2024 (Part 1)

Using Generative AI to Simplify Database Queries with Natural Language Processing

Developing a Stock Market Prediction Tool using NLP with Practical Code Examples

Machine Learning Unlocked: A Step-by-Step Guide for Beginners and Beyond

How do we understand our customers?

Lesson Objectives

What is natural language processing?

Solution Architecture

What is topic modelling?

What are the industry use cases?

How do you pre-process text data?

Dataset

Tutorial 1: How to get started in topic modelling using a Jupyter notebook and Amazon Code Whisperer

领英推荐

Tutorial 2: How to get started in topic modelling using Amazon Comprehend

Conclusion

AWS Blogs on Amazon Code Whisperer and Amazon Comprehend

Reference

Next Edition

This Month

Coffee N Learn

1,136 位关注者

Sow seeds - AWS Certified AI Practitioner or AWS Certified Cloud Practitioner

2024年11月15日

Monitor resources - Building with AI on AWS

2024年11月3日

Transform everyday banking with AWS AI Intelligent Document Processing

2024年10月27日

AWS agents on Amazon Bedrock

2024年10月5日

Transforming contact center CX with data and AI

2024年9月22日

Building autonomous agents using Amazon Bedrock

2024年9月7日

Advertising and Marketing campaign analytics with AWS Marketing Cloud

2024年8月26日

Telling the story: Generative business intelligence dashboards

2024年8月15日

Data Engineering - Part 2: Streaming Media on AWS

2024年7月28日

Data Engineering - part 1: Creating a data lake with AWS Lake Formation

2024年7月14日

社区洞察

其他会员也浏览了

Autonomous Ops with LLM for Advanced Anomaly Detection

The Software Industry's "Kodak Moment" - When Code Writes Itself

A Report on Image Caption Generator

Enhancing Data Science with Large Language Models within Select Industries.

Leveraging AI for Efficient Conversation Retrieval and Management: A Dive into ChromaDB and DSPyGen

The Hottest Tools in Machine Learning and Data Science in 2024 (Part 1)

Using Generative AI to Simplify Database Queries with Natural Language Processing

Developing a Stock Market Prediction Tool using NLP with Practical Code Examples

Machine Learning Unlocked: A Step-by-Step Guide for Beginners and Beyond