Topic Modeling and LDA in Python

Topic Modeling and LDA in Python


No alt text provided for this image

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Topic Models, in a nutshell, are a type of statistical language models used for uncovering hidden structure in a collection of texts.

There are several existing algorithms you can use to perform the topic modeling. The most common of them are, Latent Semantic Analysis (LSA/LSI), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA).

In this article, we’ll cover LDA, and implement a basic topic model.

Introduction?

Latent Dirichlet Allocation (LDA) is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities.

The Data

The data set we’ll use is a list of over one million news headlines published over a period of 15 years and can be downloaded from Kaggle. (https://www.kaggle.com/therohk/million-headlines/data)

No alt text provided for this image





Data Preprocessing

In order to preprocess data set we have, we will perform the following steps:

  • Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
  • Words less than 3 characters are removed from the data.
  • Stopwords are removed.
  • Word Lemmatization and Stemming

Loading Gensim and NLTK libraries:

No alt text provided for this image
No alt text provided for this image

Function to perform lemmatization and stemming steps on the data set:

No alt text provided for this image

Data set preview after preprocessing step:

No alt text provided for this image





Preprocess the headline text, saving the results as ‘processed_docs’

No alt text provided for this image

Bag of Words on the Data set

Create a dictionary from a preprocessed data set containing the number of times a word appears in the training set.

No alt text provided for this image

Gensim filter_extremes

Filter out tokens that appear in:

  • less than 15 documents (absolute number) or
  • more than 0.5 documents (fraction of total corpus size, not absolute number)
  • after the above two steps, keep only the first 100000 most frequent tokens.

No alt text provided for this image

Gensim doc2bow

For each document we create a dictionary reporting how many words and how many times those words appear.

No alt text provided for this image

Preview Bag of Words for our sample preprocessed document:

No alt text provided for this image





Running LDA using Bag of Words

Train our LDA model using gensim.models.LdaMulticore and save it to ‘lda_model’

No alt text provided for this image

For each topic, we will explore the words occurring in that topic and its relative weight.

No alt text provided for this image
No alt text provided for this image

You can view/download the Jupyter Notebook here (https://nbviewer.org/github/just-arvind/article_src/tree/main/LDA_Topic_Modeling_Article.ipynb)

要查看或添加评论,请登录

Lakebrains LLP的更多文章

社区洞察

其他会员也浏览了