Understanding on Latent Dirichlet Allocation (LDA) with simple examples

Understanding on Latent Dirichlet Allocation (LDA) with simple examples

We live in the era of dealing with Data everywhere and understanding huge volume of data collected every day. We are always in the need of organizing, searching and understanding data and most data is unstructured making our work more complicated. One of the natural language processing approach is Topic Modelling widely used in identifying topics in unstructured data.

Topic modelling

Topic modelling provides us with methods to organize, understand and summarize large collections of free text data. In simple terms, Topic modelling is a method for finding a group of words (i.e. topic) from a collection of unstructured textual data that best represents keywords that are close to each other. We can think this as a way to obtain “dominant patterns of words” from the textual data.

This is an attempt to explain widely used technique Latent Dirichlet Allocation (LDA) with simple examples without diving deep into technical front and details, though there are many techniques used for topic modelling.

LDA - Latent Dirichlet Allocation

In Latent Dirichlet Allocation (LDA) model, each group of keywords available in a text corpus is viewed a topics that are closely related. The model proposes that each word in the text is attributable to one of the topics.

Simple example 

Text A: I like cheese dosa and chips for breakfast.

Text B: We live in a three bedroom apartments.

Text C: For snacks I had chips, samosa and cheese cashews.

Text D: In my area there are many two bedroom apartments.

Text E: My son avoid having cheese & chips in our bedroom.

LDA Topic Modelling

Top Topics (after iterations and identification)

Topic 1: cheese - 30%, chips - 30%, breakfast - 10%, snacks - 10% etc...(We can ideally correlate this topic evolve around food)

Topic 2: bedroom – 30%, apartment – 20%, cheese – 5%, area – 5% etc... (Similarly this topic around apartment or House)

Applying on our Text example

Text A: 100% Topic 1 (deals with food)

Text B: 100% Topic 2 (deals with apartment / house)

Text C: 100% Topic 1 (deals with food)

Text D: 100% Topic 2 (deals with apartment / house)

Text E: 80% Topic 1, 20% Topic 2 (deals with food and apartment / house)

Final Topic (of this article!)

We @Changepond mostly use Spacy + Gensim for text processing on huge volume of free texts with various Data Models including LDA.

要查看或添加评论,请登录

Krishnakanth Rengarajan的更多文章

社区洞察

其他会员也浏览了