ContextBase - Topic Modeling

ContextBase - Topic Modeling

ContextBase - https://contextbase.github.io

All programming by John Akwei, ECMp ERMp Data Scientist

May 21, 2021


Table of Contents

The Problem

Our Solution

Section 1 - Data Import

Section 2 - Document Term Matrix

Section 3 - Topic Modeling

Section 4 - Tables and Charts

Section 5 - Conclusions

Section 6 - Appendix

Section 6a - Required Packages

Section 6b - Session Information

Section 7 - References


The Problem

The volume and complexity of text-based business data has increased exponentially, and now greatly exceeds the processing capabilities of humans. The vast majority of online Business data is unorganized, and exists in textual form such as emails, support tickets, chats, social media, surveys, articles, and documents. Manually sorting through online Business data, (in order to gain hidden insights), would be difficult, expensive, and impossibly time-consuming.

The internet has also introduced complexity to the demands from Customers to Businesses. The effectiveness of Marketing has been affected by the complexity of the Internet. Growing databases of Customer Responses present difficulties of interpreting the basic requirements from Customers, and the indicators of Customer intent are getting more complex.

New Machine Learning methods are required for improved extraction of business knowledge. New linguistic techniques are needed to mine Customer Response text data.


No alt text provided for this image


Our Solution

A result of the above demand, is the attention Topic Modeling has been gaining in recent years. ContextBase provides Topic Modeling of Client text data to precisely refine Business Policies and Marketing material. Topic Modeling is a text mining method derived from Natural Language Programming and Machine Learning. Topic Models are a solution for classifying document terms into themes, (or “topics”), and is applicable to the analysis of themes within novels, documents, reviews, forums, discussions, blogs, and micro-blogs.

ContextBase begins the process of Topic Modeling with Data Scientist/Programmer awareness of the sensitivity of Topic Modeling algorithms. After the Topic Modeling of Client text data, ContextBase manually characterizes the resulting topics to refine the arbitrariness of the topics. ContextBase also maintains awareness of Topic Model changes dependent on varying document contents.

The goal of ContextBase’s Topic Modeling of Client text data is to accomplish the programmatic deduction of stable Topic Models. As a result, ContextBase Topic Modeling allows for improvement in the Client’s business processes. This document is a unsupervised Machine Learning Topic Modeling of customer response text data posted to https://www.yelp.com/. The programming language used is R. The analysis includes information on required R packages, session information, data importation, normalization of the text, creation of a document term matrix, Topic Modeling coding, and outputted tables/graphs demonstrating the results of Topic Modeling.

This paper presents the topic modeling technique known as Latent Dirichlet Allocation (LDA), an intertemporal bimodal network to analyze the evolution of the semantic content of a business and a form of text-mining aiming at discovering the hidden (latent) thematic structure in large archives of documents.


Section 1 - Data Import

The data imported for this project is a collection of 10,000 customer feedback comments, posted to https://www.yelp.com/. To reduce the extensive amount of time required to process 10,096 comments, only the first 1000 comments were selected for processing. The column of comment data within the dataframe was formatted as character variables for subsequent Natural Language Processing algorithms.

# Import Data
import_data <- read.csv("yelp.csv")

# Process imported data for Topic Modeling algorithms
project_data <- data.frame(import_data$text)
rm(import_data)
names(project_data) <- "Data"
project_data$Data <- as.character(project_data$Data)

# Examine data
kable(head(project_data,1), caption = "Table 1. Yelp Data.")

Table 1. Yelp Data.

"My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better. Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I’ve ever had. I’m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing. While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best “toast” I’ve ever had. Anyway, I can’t wait to go back!"


Section 2 - Document Term Matrix

The following function normalizes the text within the 1000 selected customer responses by removing numbers, punctuation, and white space. Upper case letters are converted to lower case, and irrelevant stop words, (“the”, “a”, “an”, etc.) are removed. Lastly, a “Document Term Matrix” is created to classify the dataset terms into frequency of usage.

# Text Mining Function
dtmCorpus <- function(df) {
  df_corpus <- Corpus(VectorSource(df))
  df_corpus <- tm_map(df_corpus, function(x) iconv(x, to='ASCII'))
  df_corpus <- tm_map(df_corpus, removeNumbers)
  df_corpus <- tm_map(df_corpus, removePunctuation)
  df_corpus <- tm_map(df_corpus, stripWhitespace)
  df_corpus <- tm_map(df_corpus, tolower)
  df_corpus <- tm_map(df_corpus, removeWords, stopwords('english'))
  DocumentTermMatrix(df_corpus)
}


Section 3 - Topic Modeling

Topic Modeling treats each document as a mixture of topics, and each topic as a mixture of words, (or “terms”). Each document may contain words from several topics in particular proportions. The content of documents usually merge continuously with other documents, instead of existing in discrete groups. The same way as in the use of natural language by individuals usually merges continuously.

An example of a two-topic Topic Model of a journalism document is the modeling of the journalism document into “local” and “national” topics. The first topic, “local”, would contain terms like “traffic”, “mayor”, “city council”, “neighborhood”, and the second topic might contain terms like, “Congress”, “federal”, and “USA”. Topic Modelling would also statistically examine the terms that are common between topics.

In probabilistic terms, if a document is a set of topics, the proportion of descriptive words appearing in the document would correspond with the proportions of different topics. Topic Modeling applies a mathematical framework of this intuitive understanding of document contents. Statistical examination of document terms therefore reveals the set of document topics.

Topic Modeling is an application of the statistical technique, Latent Dirichlet Analysis, or LDA. LDA is derived from Latent Semantic Analysis, (developed in 1998), and Probabilistic Latent Semantic Analysis, (developed in 1999). Latent Semantic Analysis is a Distributional Analysis technique of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. Probabilistic Latent Semantic Analysis, (or Probabilistic Latent Semantic Indexing), analyzes two-mode and co-occurrence data to derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables.

In Probabilistic Latent Semantic Analysis, observations are analyzed in the form of co-occurrences (w,d) of words and documents. The probability of each co-occurrence is a mixture of conditionally independent multinomial distributions:

No alt text provided for this image

The term 'c' is the words' topic. The number of topics is a arbitrary hyperparameter that is determined by the Data Scientist. The first formulation is the symmetric formulation, where w and d are both generated from the latent class c using the probabilities of P(d|c) and P(w|c), whereas the second formulation is the asymmetric formulation for document d, where a latent class is chosen according to P(c|d), and a word is then generated from that class according to P(w|c). It is possible to model the co-occurrence of any couple of discrete variables, not only words and documents.

# Set parameters for Gibbs sampling
burnin <- 4000
iter <- 2000
thin <- 500
seed <- list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE

# Number of topics
k <- 10

# Create Document Term Model (dtm)
dtm <- dtmCorpus(project_data$Data[1:1000])

# Find the sum of words in each Document
rowTotals <- apply(dtm, 1, sum)

# Remove all docs without words
dtm.new <- dtm[rowTotals > 0,]

# Run LDA using Gibbs sampling
# ldaOut <- LDA(dtm.new, k, method="Gibbs", control=list(nstart=nstart, seed=seed, best=best, burnin=burnin, iter=iter, thin=thin))

# Save the variable, "ldaOut", for faster processing
# saveRDS(ldaOut, "yelp_ldaOut.rds")
ldaOut <- readRDS("yelp_ldaOut.rds")

# Docs to topics
ldaOut.topics <- as.matrix(topics(ldaOut))

# Top 6 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,10))

# Probabilities associated with each topic assignment
topicProbabilities <- as.data.frame(ldaOut@gamma)

# Find probability of term being generated from topic 
ap_topics <- tidy(ldaOut, matrix = "beta")

# Table of beta spread of terms per topic
beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

# Find top terms for 5 topics
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

# Create table of topics per document
ap_documents <- tidy(ldaOut, matrix="gamma")

# STM Package processing to determine the optimal number of Topics.

# Process data using function textProcessor()
processed <- textProcessor(project_data$Data)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...
# Prepare data using function prepDocuments()
out <- prepDocuments(processed$documents, processed$vocab,
                     processed$meta, lower.thresh = 2)
## Removing 17112 of 25712 terms (19796 of 540641 tokens) due to frequency 
## Removing 1 Documents with No Words 
## Your corpus now has 9997 documents, 8600 terms and 520845 tokens.
# MODEL SELECTION
# Run diagnostic using function searchK()
# kResult <- searchK(out$documents[1:1000], out$vocab, K = c(5,10,15,20,40), init.type = "Spectral", data = out$meta)

# Save the "KResult" dataframe for quicker processing
# saveRDS(kResult, "yelp_kResult.rds")
kResult <- readRDS("yelp_kResult.rds")

# STM Modeling
# From the "Semantic Coherence-Exclusivity Plot,
# the optimal number of Topic Models is set to 10
# model <- stm(out$documents, out$vocab, K = 10, max.em.its = 150,
#              data = out$meta, init.type = "Spectral")

# Save the "model" stm Topic Modeling file for quicker processing
# saveRDS(model, "model.rds")
model <- readRDS("model.rds")

# LDA Topic Modeling
# abstract.dfm <- dfm(project_data$Data, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove = c(stopwords("english")))
# dfm.trim <- dfm_trim(abstract.dfm, min_termfreq = 2, max_termfreq = 75)
# dfm.trim
# n.topics <- 10
# dfm2topicmodels <- convert(dfm.trim, to = "topicmodels")
# lda.model <- LDA(dfm2topicmodels, n.topics)

# Save the "ldaOut" stm Topic Modeling file for quicker processing
# saveRDS(lda.model, "yelp_lda.model.rds")
lda.model <- readRDS("yelp_lda.model.rds")

# The topic with the highest proportion for each text.
Topics_Data <- data.frame(Topic = topics(lda.model))

# Topic Similarities
lda.similarity <- as.data.frame(lda.model@beta) %>%
  scale() %>%
  dist(method = "euclidean") %>%
  hclust(method = "ward.D2")


Section 4 - Tables and Charts

Below are tables and graphic visualizations of the results of Topic Modeling of the Yelp dataset.

## [1] "Figure 1. Diagnostic Values"
No alt text provided for this image

The “Figure 1. Diagnostic Values” plot helps with the selection of numbers of topics by evaluating things like residuals, semantic coherence of the topics, and the likelihood for held-out datasets.


No alt text provided for this image

The “Figure 2. Semantic Coherence-Exclusivity Plot” correlates with human judgment of topic quality. Semantic coherence refers to co-occurrence of the most probable words in a given topic. Exclusivity balances the Semantic Coherence metric that has low variance when a few topics are dominated by very common words.


Table 1. Semantic Coherence-Exclusivity

No alt text provided for this image

“Table 1. Semantic Coherence-Exclusivity” allows for the verification of optimal beta spread generated by the LDA algorithm.


Table 2. Top 10 Terms Per Topic

No alt text provided for this image

“Table 2. Top 10 Terms Per Topic” examines the Yelp dataset, and finds the topic category that the dataset’s terms indicate the dataset corresponds, and includes the probabilities with which each topic is assigned. The Yelp dataset of customer responses is considered to be a mixture of all topics (10 in this case).


## [1] "Table 3. Topic Model - Highest Probability Words."
## A topic model with 10 topics, 9997 documents and a 8600 word dictionary.
## Topic 1 Top Words:
##       Highest Prob: get, room, need, like, stay, call, use 
##       FREX: nail, massag, spa, appoint, desk, salon, pedicur 
##       Lift: aji, apprais, appt, armour, batteri, biagio, blister 
##       Score: massag, nail, spa, pedicur, theater, repair, pool 
## Topic 2 Top Words:
##       Highest Prob: time, love, alway, ive, place, will, friend 
##       FREX: alway, ive, staff, owner, visit, everi, never 
##       Lift: chick-fil-, congratul, helpful, immatur, kyle, macalpin, saint 
##       Score: alway, ive, love, staff, year, time, best 
## Topic 3 Top Words:
##       Highest Prob: park, phoenix, can, area, kid, lot, see 
##       FREX: tour, trail, park, stadium, art, class, airport 
##       Lift: architectur, autograph, bachelor, billboard, birdi, bunker, carousel 
##       Score: park, class, stadium, trail, gym, airport, dog 
## Topic 4 Top Words:
##       Highest Prob: wine, delici, perfect, salad, breakfast, love, chees 
##       FREX: bruschetta, brunch, omelet, asada, lobster, pancak, toast 
##       Lift: alberto, amaretto, argentinian, barbacoa, bfast, bombero, bouch 
##       Score: wine, salad, steve, steak, breakfast, burrito, chees 
## Topic 5 Top Words:
##       Highest Prob: chicken, sauc, flavor, fri, order, sweet, dish 
##       FREX: thai, bbq, brisket, cream, sauc, broth, chili 
##       Lift: anis, biet, bite-s, cardamom, chantilli, chiang, cumin 
##       Score: sauc, chicken, rice, fri, pork, flavor, thai 
## Topic 6 Top Words:
##       Highest Prob: place, bar, drink, beer, night, good, like 
##       FREX: coffe, beer, bar, music, starbuck, donut, tap 
##       Lift: afflict, axi, barkeep, broadcast, bula, cartel, coffeehous 
##       Score: bar, beer, drink, coffe, hat, music, night 
## Topic 7 Top Words:
##       Highest Prob: like, good, just, realli, pizza, dont, get 
##       FREX: pizza, gyro, sandwich, pita, crust, greek, wing 
##       Lift: acrid, baba, babaganoush, barro, buca, cpk, dolmad 
##       Score: pizza, sandwich, tast, salad, crust, meat, eat 
## Topic 8 Top Words:
##       Highest Prob: burger, store, shop, like, can, price, also 
##       FREX: burger, groceri, yogurt, cupcak, market, bakeri, produc 
##       Lift: -averag, -store, americana, amex, basha, butterburg, char-gril 
##       Score: burger, store, shop, cupcak, wal-mart, fri, buy 
## Topic 9 Top Words:
##       Highest Prob: order, wait, back, ask, tabl, time, came 
##       FREX: minut, waitress, ask, wait, manag, tabl, arriv 
##       Lift: -min, amend, apologet, busboy, errand, hasti, maam 
##       Score: order, minut, tabl, server, ask, told, wait 
## Topic 10 Top Words:
##       Highest Prob: food, good, great, place, restaur, servic, price 
##       FREX: sushi, mexican, pasti, taco, chines, salsa, food 
##       Lift: abuelo, ajo, bento, bollywood, bulgogi, c-fu, dosa 
##       Score: food, sushi, mexican, taco, restaur, salsa, chines

“Table 3. Topic Model - Highest Probability Words.” summarizes the high likelihood co-occurring terms comprise a topic within the examined dataset. The goodness of fit of the primary assignment can be assessed by taking the ratio of the highest to second-highest probability and the second-highest to the third-highest probability and so on.


No alt text provided for this image

“Figure 3. Topic Model Proportions” examines the probability distribution of topics within the yelp dataset.


No alt text provided for this image

“Figure 4. Topic Network” estimates a graph of topic correlations using a threshold of the covariances. Topics that have a high correlation are connected.


Figure 5. Topic Proportion Network

No alt text provided for this image

“Figure 5. Topic Proportion Network” visualizes the correlation of topics combined with the log-proportion of each topic.


No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

“Figure 6” through “Figure 8” demonstrates the distance between words within topics.


No alt text provided for this image

“Figure 9. Histogram of the topic shares within the documents” examines topic proportions within documents. The gamma of the topics represent the correspondence level of the dataset and topic. The assignments list the topic with the highest probability.


No alt text provided for this image

“Table 4: Table of Topics Per Document” examines the first Yelp customer responses, and matches those documents with the topic category that the response’s terms indicate the responses correspond, and includes the probabilities with which each topic is assigned to a response. The gamma of the topics represent the correspondence level of the dataset of responses and topic. The Yelp dataset is considered to be a mixture of all topics (10 in this case). The topic assignments list the topic with the highest probability.


No alt text provided for this image

“Figure 10. Per Document Classification.” gamma matrices define the probabilities that each document is generated from each topic.


No alt text provided for this image

“Figure 11. Distribution of Topic Models.” is a histogram of the count and gamma spread of topics throughout the entire Yelp dataset.


No alt text provided for this image

The “Table 5. One Topic Per Term Model” beta matrix examines the topic model for probabilities that each word is generated from each topic.


No alt text provided for this image

“Figure 12: Histogram of Top 5 Terms Per Topic” displays the top five terms for the Yelp dataset’s ten topics. The probability of the terms appearing in the topic is represented by the histogram bars. Each topic contains all terms (words) in the corpus, albeit with different probabilities.


No alt text provided for this image

“Table 6. Probability of Term being generated from Topic” displays the beta spread of terms per ten topics. The beta spread allows for the characterization of topics by terms that have a high probability of appearing within the topic.


No alt text provided for this image

“Figure 13. Words with greatest beta difference between topics 2 and 1”


No alt text provided for this image

“Table 7. Topic Models - Related Terms”


No alt text provided for this image

“Table 8. Terms vs Topics”


No alt text provided for this image

The “Table 9. Documents > Topics” gamma matrix defines the probability that each document is generated from each topic.


## [1] "Figure 14. Topic 10 Wordcloud"
No alt text provided for this image

“Figure 14. Topic 10 Wordcloud”


No alt text provided for this image

“Figure 15. Topic shares on the corpus as a whole”


No alt text provided for this image

“Figure 16. Topic Model Similarity”


Section 5 - Conclusions

The LDA algorithm Topic Model contains a voluminous amount of useful information. This analysis outputs the top terms in each topic, the document to topic assignment, and the probabilities within the Topic Model. Gibbs sampling usually finds an optimal solution, that is variable mathematically for specific analyses. A variety of trial settings of parameters allows ContextBase to optimize the stability of Topic Modeling results.

Figure 1 demonstrates the arbitrariness of Topic Models. Here are ContextBase’s manual determination of the specific topics:

  1. The top words in Topic 1, “just”, “can”, “dont”, “one”, “get”, “also”, “make”, “better”, “think”, “around”, indicate this topic within the Yelp dataset is the topic of resolving a complaint with a business.
  2. Topic 2’s top words, “good”, “food”, “try”, “chicken”, “restaurant”, “ordered”, “menu”, “cheese”, “pizza”, “lunch”, indicate Topic 2 refers to food ordered at restaurants.
  3. Topic 3’s top words, “place”, “great”, “good”, “service”, “food”, “love”, “ive”, “best”, “always”, “like”, indicate Topic 3 refers to reasons customers liked businesses.
  4. Topic 4’s top words, “back”, “time”, “really”, “even”, “like”, “got”, “didnt”, “first”, “much”, “want”, very possibly refer to reasons customers returned to businesses.

The generated Topic Model demonstrates that the primary topic assignments are optimal when the ratios of probabilities are highest. Different values of “k” optimize the topic distributions.


Section 6 - Appendix

Section 6a - Required Packages

The needed R programming language packages are installed and included in the package library. The R packages included are packages for Topic Modeling, Natural Language Processing, data manipulation, and plotting.

Table 10. List of Required Packages

‘topicmodels’ ‘stm’ ‘tidytext’ ‘tidyr’ ‘tidyverse’ ‘tm’ ‘syuzhet’ ‘plyr’ ‘dplyr’ ‘data.table’ ‘stringr’ ‘lattice’ ‘broom’ ‘scales’ ‘magrittr’ ‘lsa’ ‘car’ ‘LDAvis’ ‘ggplot2’ ‘RColorBrewer’ ‘gridExtra’ ‘igraph’ ‘visNetwork’ ‘knitr’


Section 6b - Session Information

Session information is provided for reproducible research. The Session Information below is for reference when running the required packages, and R code.

R version 4.0.4 (2021-02-15)

Platform x86_64-w64-mingw32/x64 (64-bit)

Running Windows 10 x64 (build 19041)

RStudio: Integrated Development Environment for RRStudio Version1.0.153

要查看或添加评论,请登录

John Akwei的更多文章

社区洞察

其他会员也浏览了