Topic Modeling using NMF and LDA

Topic modeling is a statistical model to discover hidden semantic patterns in unstructured collection of documents. Large collection of documents are represented in terms of topics and topics are represented in terms of words. This Top-Down approach will help in exposing hidden insights from the corpus. In this approach, every document is a distribution of topics and every topic is a distribution of words. The topics extracted using Topic modeling are collection of similar words. The intuition behind Topic modeling is built on top of mathematical framework, which is based on probability and statistics of words in each topic.

Out of all the existing algorithms for topic modeling, Latent Dirichlet association (LDA) and Non-negative matrix factorization (NMF) are extensively used by Data modelers and widely accepted in scientific community for topic extraction. LDA is a probabilistic model and NMF is a matrix factorization and multivariate analysis technique.

The basic idea in topic modeling is to vectorize the given corpus by term frequency or term frequency-inverse document frequency and split that document term matrix into document – topic and topic – word subsets and thereby optimizing subsets either by using probabilistic or factorization techniques.

The challenge and ambiguity involved in Topic modeling is validation. The very approach of extracting topics from large collection of documents itself is unsupervised i.e., documents are not labelled prior modeling. Therefore, validating topics obtained from unsupervised approach is a tedious task. One has come out with their own validation technique depending upon their application. Due to the advent of dimensionality reduction techniques and advanced computational packages, one can visualize the similarity between topics extracted from corpus.

There are numerous applications of Topic modeling. The idea of searching for keywords in corpus can be tremendously enhanced by embedding topic modeling with search engines as topic models can pinpoint relevant words and documents by using a threshold probability distribution. Topic modeling is widely used in advanced research labs in the domain of healthcare, journalism, politics and Law enforcement. Modeling topics helps users in doing targeted research which undoubtedly leads to efficient results.

https://engineering.fissionlabs.com/topics/machine-learning/

Great share, Joseph!

回复
Dan Matics

Senior Media Strategist & Account Executive, Otter PR

5 个月

Great share, Joseph!

回复

要查看或添加评论,请登录

Joseph Prakash的更多文章

  • Managing Scrum

    Managing Scrum

    How to better manage SCRUM – The Prologue As technological innovations are rapidly increasing in the IT industry, it is…

    5 条评论
  • 15 Essential Tips for Outsourcing Decision Making

    15 Essential Tips for Outsourcing Decision Making

    If you’re a business owner or manager and are considering outsourcing of any type, you’re probably aware of the huge…

    1 条评论
  • 5 Tips on Managing Remote teams

    5 Tips on Managing Remote teams

    It today’s world working with remote teams has become a norm. A remote team could be a team of your own employees that…

  • Getting your first Reference Customers in a New Market

    Getting your first Reference Customers in a New Market

    Customers find themselves in a chicken and egg situation when it comes to entering new markets. It is difficult to make…

社区洞察

其他会员也浏览了