登录查看更多内容

Understanding on Latent Dirichlet Allocation (LDA) with simple examples

Krishnakanth Rengarajan

Vice President – Delivery & Client Partner at Changepond Technologies

发布日期: 2019年11月29日

We live in the era of dealing with Data everywhere and understanding huge volume of data collected every day. We are always in the need of organizing, searching and understanding data and most data is unstructured making our work more complicated. One of the natural language processing approach is Topic Modelling widely used in identifying topics in unstructured data.

Topic modelling

Topic modelling provides us with methods to organize, understand and summarize large collections of free text data. In simple terms, Topic modelling is a method for finding a group of words (i.e. topic) from a collection of unstructured textual data that best represents keywords that are close to each other. We can think this as a way to obtain “dominant patterns of words” from the textual data.

This is an attempt to explain widely used technique Latent Dirichlet Allocation (LDA) with simple examples without diving deep into technical front and details, though there are many techniques used for topic modelling.

LDA - Latent Dirichlet Allocation

In Latent Dirichlet Allocation (LDA) model, each group of keywords available in a text corpus is viewed a topics that are closely related. The model proposes that each word in the text is attributable to one of the topics.

Simple example

Text A: I like cheese dosa and chips for breakfast.

Text B: We live in a three bedroom apartments.

Text C: For snacks I had chips, samosa and cheese cashews.

Text D: In my area there are many two bedroom apartments.

Text E: My son avoid having cheese & chips in our bedroom.

Top Topics (after iterations and identification)

Topic 1: cheese - 30%, chips - 30%, breakfast - 10%, snacks - 10% etc...(We can ideally correlate this topic evolve around food)

Topic 2: bedroom – 30%, apartment – 20%, cheese – 5%, area – 5% etc... (Similarly this topic around apartment or House)

Applying on our Text example

Text A: 100% Topic 1 (deals with food)

Text B: 100% Topic 2 (deals with apartment / house)

Text C: 100% Topic 1 (deals with food)

Text D: 100% Topic 2 (deals with apartment / house)

Text E: 80% Topic 1, 20% Topic 2 (deals with food and apartment / house)

Final Topic (of this article!)

We @Changepond mostly use Spacy + Gensim for text processing on huge volume of free texts with various Data Models including LDA.

要查看或添加评论，请登录

Krishnakanth Rengarajan的更多文章

Stopwords - What to remove and what "Not" to remove

2019年11月1日

Stopwords - What to remove and what "Not" to remove

Yes the words which we have been taught for many years as so called “important words” have become no meaning words in…

1 条评论

Understanding on Latent Dirichlet Allocation (LDA) with simple examples

Krishnakanth Rengarajan

Vice President – Delivery & Client Partner at Changepond Technologies

Krishnakanth Rengarajan的更多文章

社区洞察

其他会员也浏览了

?? Moving beyond RAG

Paper Review: Agentic Retrieval-Augmented Generation for Time Series?Analysis

???????????? ?????????????????? ?????? ?????? ????????????????????????

Understanding Cosine Similarity: A Key Metric in Data Science

Guide to Metrics and Thresholds for Evaluating RAG and LLM Models

How AI-driven analytics and LLMs are transforming data science

Custom Named Entity Recognition with Bidirectional LSTM

TimesFM: A Foundation Model Revolutionizing Time-Series Forecasting

Understanding Vector Databases: Their Role in LLMs and LVMs, Efficiency in Transformer Algorithms, and Key Security Considerations

Getting Started with Qdrant: A Beginner's Guide to Vector Search

Krishnakanth Rengarajan的更多文章

Stopwords - What to remove and what "Not" to remove

社区洞察

其他会员也浏览了

?? Moving beyond RAG

Paper Review: Agentic Retrieval-Augmented Generation for Time Series?Analysis

???????????? ?????????????????? ?????? ?????? ????????????????????????

Understanding Cosine Similarity: A Key Metric in Data Science

Guide to Metrics and Thresholds for Evaluating RAG and LLM Models

How AI-driven analytics and LLMs are transforming data science

Custom Named Entity Recognition with Bidirectional LSTM

TimesFM: A Foundation Model Revolutionizing Time-Series Forecasting

Understanding Vector Databases: Their Role in LLMs and LVMs, Efficiency in Transformer Algorithms, and Key Security Considerations

Getting Started with Qdrant: A Beginner's Guide to Vector Search