TOPIC MODELLING METHODS’ COMPARISON

TOPIC MODELLING METHODS’ COMPARISON

#Topic_modelling is a popular #statistical #analytical tool established to extract latent variables from large datasets, well suited for #text_processing (Blei 2012), #summarizing the text contents into topics driven from them to clarify what the text is mainly about, shortening the text. There are different types of topic modelling algorithms, briefly explained later on, capable of dealing with correlation between topics, and some are even suitable for short texts such as social data (Hong and Davison 2010).

Moreover, some algorithms can optimize the outcome regardless of the model type, resulting in more meaningful results. Topic modelling was developed in the 1980’s aiming to briefly describe elements of an extensive collection of data (Blei et al. 2003). Choosing the proper method is key to extracting meaningful results, and, in this part, 4 methods (#LSA, #NMF, #LDA, and #HDP) are explained briefly compared to one another.?

No alt text provided for this image

To summarize the topic modelling methods, the above table, was made including the differences between models, characteristics, their limitations , and how they can be beneficial. As was disgusted previously, LSA’s goal is to uncover hidden meaning , and to determine similarity between terms , and documents , and does so by creating a multidimensional space in which terms , and documents are placed , and based on their distance to one another their similarity is determined (George et al. 2017).

No alt text provided for this image

LDA seeks to characterize the documents, and to find the predetermined number of topics or themes in the documents to summarize what they are mainly about, and which document contributes most to specific topics. In terms of Page 26 of 103 algorithms, they are very different as presented before and shown in the table. Both, LSA and LDA, are widely used based on different goals and types of corpus, but what is clear is that the corpus must be preprocessed before entering in each of the models (Kalepalli et al. 2020). HDP is very much like LDA, but, having some extra advantages like not being limited to a fixed number of topics, being able to learn and evolve as the training takes place (Munir et al. 2019). The same goes for NMF, being similar to LSA simplified by not having the topic importance matrix.

References:

Blei. 2012. Probabilistic Topic Models. Surveying a suite of algorithms that offer a solution to managing large document archives. Communications of the ACM 55(4), pp. 77-84

Hong , and Davison. 2010. Empirical study of topic modelling in Twitter. SOMA '10: Proceedings of the First Workshop on Social Media Analytics. Pages 80-88. https://doi.org/10.1145/1964858.1964870

Blei. 2003. Latent Dirichlet allocation (Article). Journal of Machine Learning Research. Volume 3, Issue 4-5, 15 May 2003, Pages 993-1022

George, Soundarabai, and Krishnamurthi.2017. IMPACT OF TOPIC MODELLING METHODS , AND TEXT CLASSIFICATION TECHNIQUES IN TEXT MINING: A SURVEY. International Journal of Advances in Electronics , and Computer Science, ISSN: 2393-2835. https://www.iraj.in/journal/journal_file/journal_pdf/12-351-149622472172-77.pd

Kalepalli, Tasneem, Phani Teja , and Manne, "Effective Comparison of LDA with LSA for Topic Modelling," 2020 4th International Conference on Intelligent Computing , and Control Systems (ICICCS), Madurai, India, 2020, pp. 1245-1250, doi: 10.1109/ICICCS48265.2020.9120888.

Munir, Wasi, and Jami. 2019. Comparison of Topic Modelling Approaches for Urdu Text. Indian Journal of Science , and Technology. Vol 12(45) https://www.researchgate.net/profile/Siraj Munir2/publication/338209568_A_Comparison_of_Topic_Modelling_Approaches_for_Urdu_Tex t/links/5e07274f4585159aa49f9323/A-Comparison-of-Topic-Modelling-Approaches-for-UrduText.pdf

要查看或添加评论,请登录

Hosna Hamdieh的更多文章

社区洞察

其他会员也浏览了