A comparative study of methods for topic modelling in news articles

Rajan, S.D, Coombs, T, Jayabalan, M ORCID: 0000-0002-1599-965X and Ismail, N.A (2024) 'A comparative study of methods for topic modelling in news articles.' In: Bee Wah, Y, Al-Jumeily OBE, D and Berry, M.W, eds. Data science and emerging technologies. DaSET 2023. Lecture notes on data engineering and communications technologies (191). Springer, Singapore, pp. 269-277. ISBN 9789819702923

Official URL: https://doi.org/10.1007/978-981-97-0293-0_20

Abstract

The past few decades have seen an increase in textual data and influence from the news media. With the rise in available data, especially in regard to textual data from news media, it is imperative to quickly categorise news topics. In this research, the primary aim is to suggest a method for automatically identifying news topics in articles. The dataset used in this research was the news category published on Kaggle and comprised of 210,294 headlines and abstracts from HuffPost between 2012 and 2022. The dataset consisted of a total of 42 categories and six columns. Traditional modelling techniques did not perform well in comparison with Top2Vec, NMF or BERTopic. This research confirms the efficacy of Top2Vec and BERTopic, followed by NMF, LDA and LSA for analysing, news category data from a human-interpretation perspective. Though BERTopic was able to deduce 1145 topics from the data, it could not chuck unwanted words like “to”, “say”, “for” which do not add any value to the topic semantics. In summary, TF-IDF proved to be the best feature extraction technique and Top2Vec the best topic modelling technique.

Item Type: Book Chapter or Section
Divisions: Bath School of Design
Date Deposited: 22 Aug 2025 14:23
Last Modified: 22 Aug 2025 15:26
URN: https://researchspace.bathspa.ac.uk/id/eprint/17219
Request a change to this item or report an issue Request a change to this item or report an issue
Update item (repository staff only) Update item (repository staff only)