Get started with Topic Modeling in Python and Amazon Comprehend - Part 1 Natural Language Processing

Get started with Topic Modeling in Python and Amazon Comprehend - Part 1 Natural Language Processing

How do we understand our customers?

To deliver a great customer experience, we analyze structured data such as the number of live chats, chat interactions and number of calls answered in a contact centre and we also analyze consumer behaviour which includes unstructured data.

From a marketing perspective, consumer behaviour includes actions of an individual or group that affects the purchase decision such as:

  • Consumer's attitude
  • Consumer's emotions
  • Consumer's preferences

Consumer behaviour includes extracting value from unstructured data using text analytics and natural language processing.

Lesson Objectives

In this lesson you will learn about:

  • What is natural language processing?
  • Solution Architecture
  • What is topic modelling?
  • What are the industry use cases?
  • How do you pre-process text data?
  • How to get started in topic modelling using a Jupyter notebook and Amazon Code Whisperer
  • How to get started in topic modelling using Amazon Comprehend

What is natural language processing?

According to IBM , natural language processing is a branch of artificial intelligence.

...combines computational linguistics—rule-based modeling of human language—with statistical and machine learning models to enable computers and digital devices to recognize, understand and generate text and speech.

Natural Language Processing may be used to perform the following tasks such as:

  • Customer Support Tickets
  • Mine call centre analytics
  • Extract customer sentiment, key phrases from customer surveys
  • Analyze customer interactions
  • Find key topics from customer feedback
  • Classify and extract entities from documents

Solution Architecture

Below is the high-level solution architecture for uploading pre-processed text file into Amazon Comprehend for text analytics and natural language processing for tasks such as topic modeling analysis jobs.

Image: Wendy Wong


What is topic modelling?

Amazon Comprehend uses an algorithm?Latent dirichlet allocation -based learning model to determine the topics in a set of documents. It examines each document to determine the context and meaning of a word.

It is a type of unsupervised learning where the target variable has no labels.

? Documents must be in UTF-8 formatted text files.

? Pre-select the number of topics e.g. 10

What are the industry use cases?

Topic modelling may be used in various industries to analyze:

  • Marketing: social media
  • Media and Publishing: news articles
  • Contact Centre: customer feedback

How do you pre-process text data?

  1. Create a Jupyter Notebook
  2. Use Amazon Code Whisperer as your AI coding companion
  3. Perform Exploratory Analysis
  4. Take the pre-processed data and create an Amazon Comprehend analysis job for topic modeling.

Dataset

For this tutorial, we will explore the open-source dataset New York Times comments available from Kaggle.com . These articles and comments were made between January to May 2017 and also January to April 2018.

Tutorial 1: How to get started in topic modelling using a Jupyter notebook and Amazon Code Whisperer

  • Step 1: Open Anaconda Navigator and launch jupyter notebook


  • Step 2: Import python libraries


  • Step 3: Load the dataset. For this demo we examine one article from the New York Times

  • Step 4: Inspect the first three rows of data


  • Step 5: Remove columns from the first 250 New York Times articles

df = df.drop(columns =['abstract', 'articleID', 'articleWordCount','byline', 'documentType', 'headline','keywords','multimedia','newDesk','printPage','pubDate','sectionName','source','typeOfMaterial','webURL'], axis=1).sample(250)


  • Step 6: Inspect the first three rows of the dataframe again

df.head()

  • Step 7: Make the words lower case and remove punctuation. Import the regex library.

  • Step 8: Install the word cloud module

pip install wordcloud

  • Step 9: Create a word cloud to view changes to Step 7.

  • Step 10: Pre-process the data for Latent Dirichlet Allocation

The following stop words were removed:

['liberals', 'discovering', 'conservatives', 'known', 'years', 'negative', 'energy', 'powerful', 'force']

  • Step 11: Export the pre-processed file as a CSV file to process Topic Modeling analysis job in Amazon Comprehend

  • Step 12: Import gensim library to create a dictionary and create a corpus

The output of the corpus:

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]

  • Step 13: Train the model in Latent Dirichlet Allocation by specifying 10 topics

These are the keywords from the 10 topics:

[(0, '0.008*"us" + 0.008*"care" + 0.006*"states" + 0.006*"president" + ' '0.006*"new" + 0.006*"york" + 0.004*"thought" + 0.004*"plan" + 0.004*"waive" ' '+ 0.004*"problems"'), (1, '0.011*"new" + 0.009*"back" + 0.007*"college" + 0.006*"would" + ' '0.005*"makes" + 0.005*"deal" + 0.005*"two" + 0.005*"one" + 0.005*"cities" + ' '0.005*"administration"'), (2, '0.005*"american" + 0.005*"good" + 0.005*"known" + 0.005*"comes" + ' '0.005*"next" + 0.005*"shows" + 0.004*"influence" + 0.003*"greeted" + ' '0.003*"seem" + 0.003*"uncooperative"'), (3, '0.010*"first" + 0.010*"could" + 0.007*"us" + 0.007*"trump" + ' '0.005*"companies" + 0.005*"force" + 0.005*"run" + 0.005*"newly" + ' '0.005*"provide" + 0.005*"food"'), (4, '0.009*"could" + 0.006*"would" + 0.006*"pay" + 0.006*"world" + ' '0.006*"century" + 0.006*"much" + 0.006*"mr" + 0.005*"new" + 0.004*"two" + ' '0.004*"said"'), (5, '0.012*"one" + 0.008*"trump" + 0.007*"history" + 0.006*"french" + ' '0.006*"president" + 0.005*"government" + 0.004*"party" + 0.004*"strike" + ' '0.004*"way" + 0.004*"look"'), (6, '0.015*"new" + 0.008*"york" + 0.008*"old" + 0.007*"president" + ' '0.005*"musical" + 0.005*"thursday" + 0.005*"set" + 0.005*"taking" + ' '0.005*"events" + 0.005*"may"'), (7, '0.009*"trump" + 0.009*"new" + 0.008*"administration" + 0.005*"took" + ' '0.005*"game" + 0.005*"supreme" + 0.005*"becomes" + 0.005*"stranger" + ' '0.005*"year" + 0.005*"american"'), (8, '0.011*"trump" + 0.010*"president" + 0.007*"need" + 0.005*"money" + ' '0.005*"election" + 0.005*"take" + 0.005*"say" + 0.005*"us" + 0.005*"place" ' '+ 0.005*"better"'), (9, '0.008*"say" + 0.007*"one" + 0.006*"show" + 0.006*"time" + 0.005*"power" + ' '0.005*"al" + 0.004*"officials" + 0.004*"get" + 0.004*"new" + 0.004*"first"')]

  • Step 14-16: Analyze LDA results. Use a count vectorizer to create a bag-of-words-matrix.

  • Step 17: View lda instance


  • Step 18: Inspect the dimensions of the LDA components



  • Step 19 : Install mglearn module

pip install mglearn

  • Step 20: For each topic (a row in the components_), sort the features ascending.

Invert rows with [:,::-1] to make descending

  • Step 21: Print the 10 topics

From the above 10 topics, we can see that the New York Times articles in April 2017 included a lot of topics about the former President Donald Trump, Syria, administration and election.

  • Step 22: View the 5 most important words from the 10 topics:

Tutorial 2: How to get started in topic modelling using Amazon Comprehend

  • Step 1: Login to your AWS account as an IAM admin user
  • Step 2: Upload the output.csv file into S3 bucket


  • Step 3: In the search bar navigate to Amazon Comprehend

  • Step 4: Click Launch Amazon Comprehend


  • Step 5: Click Analysis jobs


  • Step 6: Click Create job


  • Step 7: Provide a name for the job.

Select Topic Modeling as the analysis type.

Provide the S3 location of the input data and the S3 location to output the processing job.

Enter the number of topics e.g. 10

  • Step 8: Select create an IAM role and click Create job.


  • Step 9: Job is submitted and being processed.



  • Step 10: The processing job is completed and the output may be examined.



  • Step 11: Download the output that is saved in your S3 bucket



Conclusion

Topic modeling is a type of unsupervised learning that can be used to explore key themes from documents and clustering from text data.

Together with structured data and unstructured data, topic modelling can assist with improving customer experience by examining consumer behaviour from IT support tickets, social media trends, customer feedback and target subscribers with curated content for newspapers in print and digital media.

AWS Blogs on Amazon Code Whisperer and Amazon Comprehend

To help you get started in installing and using AI coding companion Amazon Code Whisperer you may read my previous blogs:

Reference

Next Edition

In the next edition, we are exploring:

  • Get started with Sentiment Analysis in Python and Amazon Comprehend - Part 2 Natural Language Processing

Until the next lesson, happy learning! ??

This Month

If you would like to review the code you may refer to my Github account .

read previous editions from my weekly newsletter or take my LinkedIn Learning course to learn how to get started with Amazon Code Whisperer.


要查看或添加评论,请登录

社区洞察

其他会员也浏览了