Topic Modeling and LDA in Python
Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Topic Models, in a nutshell, are a type of statistical language models used for uncovering hidden structure in a collection of texts.
There are several existing algorithms you can use to perform the topic modeling. The most common of them are, Latent Semantic Analysis (LSA/LSI), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA).
In this article, we’ll cover LDA, and implement a basic topic model.
Introduction?
Latent Dirichlet Allocation (LDA) is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities.
The Data
The data set we’ll use is a list of over one million news headlines published over a period of 15 years and can be downloaded from Kaggle. (https://www.kaggle.com/therohk/million-headlines/data)
Data Preprocessing
In order to preprocess data set we have, we will perform the following steps:
Loading Gensim and NLTK libraries:
Function to perform lemmatization and stemming steps on the data set:
Data set preview after preprocessing step:
领英推荐
Preprocess the headline text, saving the results as ‘processed_docs’
Bag of Words on the Data set
Create a dictionary from a preprocessed data set containing the number of times a word appears in the training set.
Gensim filter_extremes
Filter out tokens that appear in:
Gensim doc2bow
For each document we create a dictionary reporting how many words and how many times those words appear.
Preview Bag of Words for our sample preprocessed document:
Running LDA using Bag of Words
Train our LDA model using gensim.models.LdaMulticore and save it to ‘lda_model’
For each topic, we will explore the words occurring in that topic and its relative weight.
You can view/download the Jupyter Notebook here (https://nbviewer.org/github/just-arvind/article_src/tree/main/LDA_Topic_Modeling_Article.ipynb)