ContextBase - Topic Modeling
ContextBase - https://contextbase.github.io
All programming by John Akwei, ECMp ERMp Data Scientist
May 21, 2021
Table of Contents
The Problem
Our Solution
Section 1 - Data Import
Section 2 - Document Term Matrix
Section 3 - Topic Modeling
Section 4 - Tables and Charts
Section 5 - Conclusions
Section 6 - Appendix
Section 6a - Required Packages
Section 6b - Session Information
Section 7 - References
The Problem
The volume and complexity of text-based business data has increased exponentially, and now greatly exceeds the processing capabilities of humans. The vast majority of online Business data is unorganized, and exists in textual form such as emails, support tickets, chats, social media, surveys, articles, and documents. Manually sorting through online Business data, (in order to gain hidden insights), would be difficult, expensive, and impossibly time-consuming.
The internet has also introduced complexity to the demands from Customers to Businesses. The effectiveness of Marketing has been affected by the complexity of the Internet. Growing databases of Customer Responses present difficulties of interpreting the basic requirements from Customers, and the indicators of Customer intent are getting more complex.
New Machine Learning methods are required for improved extraction of business knowledge. New linguistic techniques are needed to mine Customer Response text data.
Our Solution
A result of the above demand, is the attention Topic Modeling has been gaining in recent years. ContextBase provides Topic Modeling of Client text data to precisely refine Business Policies and Marketing material. Topic Modeling is a text mining method derived from Natural Language Programming and Machine Learning. Topic Models are a solution for classifying document terms into themes, (or “topics”), and is applicable to the analysis of themes within novels, documents, reviews, forums, discussions, blogs, and micro-blogs.
ContextBase begins the process of Topic Modeling with Data Scientist/Programmer awareness of the sensitivity of Topic Modeling algorithms. After the Topic Modeling of Client text data, ContextBase manually characterizes the resulting topics to refine the arbitrariness of the topics. ContextBase also maintains awareness of Topic Model changes dependent on varying document contents.
The goal of ContextBase’s Topic Modeling of Client text data is to accomplish the programmatic deduction of stable Topic Models. As a result, ContextBase Topic Modeling allows for improvement in the Client’s business processes. This document is a unsupervised Machine Learning Topic Modeling of customer response text data posted to https://www.yelp.com/. The programming language used is R. The analysis includes information on required R packages, session information, data importation, normalization of the text, creation of a document term matrix, Topic Modeling coding, and outputted tables/graphs demonstrating the results of Topic Modeling.
This paper presents the topic modeling technique known as Latent Dirichlet Allocation (LDA), an intertemporal bimodal network to analyze the evolution of the semantic content of a business and a form of text-mining aiming at discovering the hidden (latent) thematic structure in large archives of documents.
Section 1 - Data Import
The data imported for this project is a collection of 10,000 customer feedback comments, posted to https://www.yelp.com/. To reduce the extensive amount of time required to process 10,096 comments, only the first 1000 comments were selected for processing. The column of comment data within the dataframe was formatted as character variables for subsequent Natural Language Processing algorithms.
# Import Data import_data <- read.csv("yelp.csv") # Process imported data for Topic Modeling algorithms project_data <- data.frame(import_data$text) rm(import_data) names(project_data) <- "Data" project_data$Data <- as.character(project_data$Data) # Examine data kable(head(project_data,1), caption = "Table 1. Yelp Data.")
Table 1. Yelp Data.
"My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better. Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I’ve ever had. I’m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing. While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best “toast” I’ve ever had. Anyway, I can’t wait to go back!"
Section 2 - Document Term Matrix
The following function normalizes the text within the 1000 selected customer responses by removing numbers, punctuation, and white space. Upper case letters are converted to lower case, and irrelevant stop words, (“the”, “a”, “an”, etc.) are removed. Lastly, a “Document Term Matrix” is created to classify the dataset terms into frequency of usage.
# Text Mining Function dtmCorpus <- function(df) { df_corpus <- Corpus(VectorSource(df)) df_corpus <- tm_map(df_corpus, function(x) iconv(x, to='ASCII')) df_corpus <- tm_map(df_corpus, removeNumbers) df_corpus <- tm_map(df_corpus, removePunctuation) df_corpus <- tm_map(df_corpus, stripWhitespace) df_corpus <- tm_map(df_corpus, tolower) df_corpus <- tm_map(df_corpus, removeWords, stopwords('english')) DocumentTermMatrix(df_corpus) }
Section 3 - Topic Modeling
Topic Modeling treats each document as a mixture of topics, and each topic as a mixture of words, (or “terms”). Each document may contain words from several topics in particular proportions. The content of documents usually merge continuously with other documents, instead of existing in discrete groups. The same way as in the use of natural language by individuals usually merges continuously.
An example of a two-topic Topic Model of a journalism document is the modeling of the journalism document into “local” and “national” topics. The first topic, “local”, would contain terms like “traffic”, “mayor”, “city council”, “neighborhood”, and the second topic might contain terms like, “Congress”, “federal”, and “USA”. Topic Modelling would also statistically examine the terms that are common between topics.
In probabilistic terms, if a document is a set of topics, the proportion of descriptive words appearing in the document would correspond with the proportions of different topics. Topic Modeling applies a mathematical framework of this intuitive understanding of document contents. Statistical examination of document terms therefore reveals the set of document topics.
Topic Modeling is an application of the statistical technique, Latent Dirichlet Analysis, or LDA. LDA is derived from Latent Semantic Analysis, (developed in 1998), and Probabilistic Latent Semantic Analysis, (developed in 1999). Latent Semantic Analysis is a Distributional Analysis technique of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. Probabilistic Latent Semantic Analysis, (or Probabilistic Latent Semantic Indexing), analyzes two-mode and co-occurrence data to derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables.
In Probabilistic Latent Semantic Analysis, observations are analyzed in the form of co-occurrences (w,d) of words and documents. The probability of each co-occurrence is a mixture of conditionally independent multinomial distributions:
The term 'c' is the words' topic. The number of topics is a arbitrary hyperparameter that is determined by the Data Scientist. The first formulation is the symmetric formulation, where w and d are both generated from the latent class c using the probabilities of P(d|c) and P(w|c), whereas the second formulation is the asymmetric formulation for document d, where a latent class is chosen according to P(c|d), and a word is then generated from that class according to P(w|c). It is possible to model the co-occurrence of any couple of discrete variables, not only words and documents.
# Set parameters for Gibbs sampling burnin <- 4000 iter <- 2000 thin <- 500 seed <- list(2003,5,63,100001,765) nstart <- 5 best <- TRUE # Number of topics k <- 10 # Create Document Term Model (dtm) dtm <- dtmCorpus(project_data$Data[1:1000]) # Find the sum of words in each Document rowTotals <- apply(dtm, 1, sum) # Remove all docs without words dtm.new <- dtm[rowTotals > 0,] # Run LDA using Gibbs sampling # ldaOut <- LDA(dtm.new, k, method="Gibbs", control=list(nstart=nstart, seed=seed, best=best, burnin=burnin, iter=iter, thin=thin)) # Save the variable, "ldaOut", for faster processing # saveRDS(ldaOut, "yelp_ldaOut.rds") ldaOut <- readRDS("yelp_ldaOut.rds") # Docs to topics ldaOut.topics <- as.matrix(topics(ldaOut)) # Top 6 terms in each topic ldaOut.terms <- as.matrix(terms(ldaOut,10)) # Probabilities associated with each topic assignment topicProbabilities <- as.data.frame(ldaOut@gamma) # Find probability of term being generated from topic ap_topics <- tidy(ldaOut, matrix = "beta") # Table of beta spread of terms per topic beta_spread <- ap_topics %>% mutate(topic = paste0("topic", topic)) %>% spread(topic, beta) %>% filter(topic1 > .001 | topic2 > .001) %>% mutate(log_ratio = log2(topic2 / topic1)) # Find top terms for 5 topics ap_top_terms <- ap_topics %>% group_by(topic) %>% top_n(5, beta) %>% ungroup() %>% arrange(topic, -beta) # Create table of topics per document ap_documents <- tidy(ldaOut, matrix="gamma") # STM Package processing to determine the optimal number of Topics. # Process data using function textProcessor() processed <- textProcessor(project_data$Data) ## Building corpus... ## Converting to Lower Case... ## Removing punctuation... ## Removing stopwords... ## Removing numbers... ## Stemming... ## Creating Output... # Prepare data using function prepDocuments() out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2) ## Removing 17112 of 25712 terms (19796 of 540641 tokens) due to frequency ## Removing 1 Documents with No Words ## Your corpus now has 9997 documents, 8600 terms and 520845 tokens. # MODEL SELECTION # Run diagnostic using function searchK() # kResult <- searchK(out$documents[1:1000], out$vocab, K = c(5,10,15,20,40), init.type = "Spectral", data = out$meta) # Save the "KResult" dataframe for quicker processing # saveRDS(kResult, "yelp_kResult.rds") kResult <- readRDS("yelp_kResult.rds") # STM Modeling # From the "Semantic Coherence-Exclusivity Plot, # the optimal number of Topic Models is set to 10 # model <- stm(out$documents, out$vocab, K = 10, max.em.its = 150, # data = out$meta, init.type = "Spectral") # Save the "model" stm Topic Modeling file for quicker processing # saveRDS(model, "model.rds") model <- readRDS("model.rds") # LDA Topic Modeling # abstract.dfm <- dfm(project_data$Data, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove = c(stopwords("english"))) # dfm.trim <- dfm_trim(abstract.dfm, min_termfreq = 2, max_termfreq = 75) # dfm.trim # n.topics <- 10 # dfm2topicmodels <- convert(dfm.trim, to = "topicmodels") # lda.model <- LDA(dfm2topicmodels, n.topics) # Save the "ldaOut" stm Topic Modeling file for quicker processing # saveRDS(lda.model, "yelp_lda.model.rds") lda.model <- readRDS("yelp_lda.model.rds") # The topic with the highest proportion for each text. Topics_Data <- data.frame(Topic = topics(lda.model)) # Topic Similarities lda.similarity <- as.data.frame(lda.model@beta) %>% scale() %>% dist(method = "euclidean") %>% hclust(method = "ward.D2")
Section 4 - Tables and Charts
Below are tables and graphic visualizations of the results of Topic Modeling of the Yelp dataset.
## [1] "Figure 1. Diagnostic Values"
The “Figure 1. Diagnostic Values” plot helps with the selection of numbers of topics by evaluating things like residuals, semantic coherence of the topics, and the likelihood for held-out datasets.
The “Figure 2. Semantic Coherence-Exclusivity Plot” correlates with human judgment of topic quality. Semantic coherence refers to co-occurrence of the most probable words in a given topic. Exclusivity balances the Semantic Coherence metric that has low variance when a few topics are dominated by very common words.
Table 1. Semantic Coherence-Exclusivity
“Table 1. Semantic Coherence-Exclusivity” allows for the verification of optimal beta spread generated by the LDA algorithm.
Table 2. Top 10 Terms Per Topic
“Table 2. Top 10 Terms Per Topic” examines the Yelp dataset, and finds the topic category that the dataset’s terms indicate the dataset corresponds, and includes the probabilities with which each topic is assigned. The Yelp dataset of customer responses is considered to be a mixture of all topics (10 in this case).
## [1] "Table 3. Topic Model - Highest Probability Words." ## A topic model with 10 topics, 9997 documents and a 8600 word dictionary. ## Topic 1 Top Words: ## Highest Prob: get, room, need, like, stay, call, use ## FREX: nail, massag, spa, appoint, desk, salon, pedicur ## Lift: aji, apprais, appt, armour, batteri, biagio, blister ## Score: massag, nail, spa, pedicur, theater, repair, pool ## Topic 2 Top Words: ## Highest Prob: time, love, alway, ive, place, will, friend ## FREX: alway, ive, staff, owner, visit, everi, never ## Lift: chick-fil-, congratul, helpful, immatur, kyle, macalpin, saint ## Score: alway, ive, love, staff, year, time, best ## Topic 3 Top Words: ## Highest Prob: park, phoenix, can, area, kid, lot, see ## FREX: tour, trail, park, stadium, art, class, airport ## Lift: architectur, autograph, bachelor, billboard, birdi, bunker, carousel ## Score: park, class, stadium, trail, gym, airport, dog ## Topic 4 Top Words: ## Highest Prob: wine, delici, perfect, salad, breakfast, love, chees ## FREX: bruschetta, brunch, omelet, asada, lobster, pancak, toast ## Lift: alberto, amaretto, argentinian, barbacoa, bfast, bombero, bouch ## Score: wine, salad, steve, steak, breakfast, burrito, chees ## Topic 5 Top Words: ## Highest Prob: chicken, sauc, flavor, fri, order, sweet, dish ## FREX: thai, bbq, brisket, cream, sauc, broth, chili ## Lift: anis, biet, bite-s, cardamom, chantilli, chiang, cumin ## Score: sauc, chicken, rice, fri, pork, flavor, thai ## Topic 6 Top Words: ## Highest Prob: place, bar, drink, beer, night, good, like ## FREX: coffe, beer, bar, music, starbuck, donut, tap ## Lift: afflict, axi, barkeep, broadcast, bula, cartel, coffeehous ## Score: bar, beer, drink, coffe, hat, music, night ## Topic 7 Top Words: ## Highest Prob: like, good, just, realli, pizza, dont, get ## FREX: pizza, gyro, sandwich, pita, crust, greek, wing ## Lift: acrid, baba, babaganoush, barro, buca, cpk, dolmad ## Score: pizza, sandwich, tast, salad, crust, meat, eat ## Topic 8 Top Words: ## Highest Prob: burger, store, shop, like, can, price, also ## FREX: burger, groceri, yogurt, cupcak, market, bakeri, produc ## Lift: -averag, -store, americana, amex, basha, butterburg, char-gril ## Score: burger, store, shop, cupcak, wal-mart, fri, buy ## Topic 9 Top Words: ## Highest Prob: order, wait, back, ask, tabl, time, came ## FREX: minut, waitress, ask, wait, manag, tabl, arriv ## Lift: -min, amend, apologet, busboy, errand, hasti, maam ## Score: order, minut, tabl, server, ask, told, wait ## Topic 10 Top Words: ## Highest Prob: food, good, great, place, restaur, servic, price ## FREX: sushi, mexican, pasti, taco, chines, salsa, food ## Lift: abuelo, ajo, bento, bollywood, bulgogi, c-fu, dosa ## Score: food, sushi, mexican, taco, restaur, salsa, chines
“Table 3. Topic Model - Highest Probability Words.” summarizes the high likelihood co-occurring terms comprise a topic within the examined dataset. The goodness of fit of the primary assignment can be assessed by taking the ratio of the highest to second-highest probability and the second-highest to the third-highest probability and so on.
“Figure 3. Topic Model Proportions” examines the probability distribution of topics within the yelp dataset.
“Figure 4. Topic Network” estimates a graph of topic correlations using a threshold of the covariances. Topics that have a high correlation are connected.
Figure 5. Topic Proportion Network
“Figure 5. Topic Proportion Network” visualizes the correlation of topics combined with the log-proportion of each topic.
“Figure 6” through “Figure 8” demonstrates the distance between words within topics.
“Figure 9. Histogram of the topic shares within the documents” examines topic proportions within documents. The gamma of the topics represent the correspondence level of the dataset and topic. The assignments list the topic with the highest probability.
“Table 4: Table of Topics Per Document” examines the first Yelp customer responses, and matches those documents with the topic category that the response’s terms indicate the responses correspond, and includes the probabilities with which each topic is assigned to a response. The gamma of the topics represent the correspondence level of the dataset of responses and topic. The Yelp dataset is considered to be a mixture of all topics (10 in this case). The topic assignments list the topic with the highest probability.
“Figure 10. Per Document Classification.” gamma matrices define the probabilities that each document is generated from each topic.
“Figure 11. Distribution of Topic Models.” is a histogram of the count and gamma spread of topics throughout the entire Yelp dataset.
The “Table 5. One Topic Per Term Model” beta matrix examines the topic model for probabilities that each word is generated from each topic.
“Figure 12: Histogram of Top 5 Terms Per Topic” displays the top five terms for the Yelp dataset’s ten topics. The probability of the terms appearing in the topic is represented by the histogram bars. Each topic contains all terms (words) in the corpus, albeit with different probabilities.
“Table 6. Probability of Term being generated from Topic” displays the beta spread of terms per ten topics. The beta spread allows for the characterization of topics by terms that have a high probability of appearing within the topic.
“Figure 13. Words with greatest beta difference between topics 2 and 1”
“Table 7. Topic Models - Related Terms”
“Table 8. Terms vs Topics”
The “Table 9. Documents > Topics” gamma matrix defines the probability that each document is generated from each topic.
## [1] "Figure 14. Topic 10 Wordcloud"
“Figure 14. Topic 10 Wordcloud”
“Figure 15. Topic shares on the corpus as a whole”
“Figure 16. Topic Model Similarity”
Section 5 - Conclusions
The LDA algorithm Topic Model contains a voluminous amount of useful information. This analysis outputs the top terms in each topic, the document to topic assignment, and the probabilities within the Topic Model. Gibbs sampling usually finds an optimal solution, that is variable mathematically for specific analyses. A variety of trial settings of parameters allows ContextBase to optimize the stability of Topic Modeling results.
Figure 1 demonstrates the arbitrariness of Topic Models. Here are ContextBase’s manual determination of the specific topics:
- The top words in Topic 1, “just”, “can”, “dont”, “one”, “get”, “also”, “make”, “better”, “think”, “around”, indicate this topic within the Yelp dataset is the topic of resolving a complaint with a business.
- Topic 2’s top words, “good”, “food”, “try”, “chicken”, “restaurant”, “ordered”, “menu”, “cheese”, “pizza”, “lunch”, indicate Topic 2 refers to food ordered at restaurants.
- Topic 3’s top words, “place”, “great”, “good”, “service”, “food”, “love”, “ive”, “best”, “always”, “like”, indicate Topic 3 refers to reasons customers liked businesses.
- Topic 4’s top words, “back”, “time”, “really”, “even”, “like”, “got”, “didnt”, “first”, “much”, “want”, very possibly refer to reasons customers returned to businesses.
The generated Topic Model demonstrates that the primary topic assignments are optimal when the ratios of probabilities are highest. Different values of “k” optimize the topic distributions.
Section 6 - Appendix
Section 6a - Required Packages
The needed R programming language packages are installed and included in the package library. The R packages included are packages for Topic Modeling, Natural Language Processing, data manipulation, and plotting.
Table 10. List of Required Packages
‘topicmodels’ ‘stm’ ‘tidytext’ ‘tidyr’ ‘tidyverse’ ‘tm’ ‘syuzhet’ ‘plyr’ ‘dplyr’ ‘data.table’ ‘stringr’ ‘lattice’ ‘broom’ ‘scales’ ‘magrittr’ ‘lsa’ ‘car’ ‘LDAvis’ ‘ggplot2’ ‘RColorBrewer’ ‘gridExtra’ ‘igraph’ ‘visNetwork’ ‘knitr’
Section 6b - Session Information
Session information is provided for reproducible research. The Session Information below is for reference when running the required packages, and R code.
R version 4.0.4 (2021-02-15)
Platform x86_64-w64-mingw32/x64 (64-bit)
Running Windows 10 x64 (build 19041)
RStudio: Integrated Development Environment for RRStudio Version1.0.153