Improving Public Service Grievance Analysis: A Comparative Study of Topic Modelling Techniques with a Multi-Metric Data Cleaning Framework

Authors

  • Rahul Deka Author
    • Reetesh Kumar Srivastava Author

      DOI:

      https://doi.org/10.70715/jitcai.2025.v2.i3.031

      Keywords:

      E-governance, Natural Language Processing, Artificial Intelligence, Machine learning, Topic Modelling, LDA, BERTopic, NMF, LSA

      Abstract

      The SewaSetu portal, a single window system for government services in Assam, India, processes thousands of applications and hundreds of grievances daily. And many such government grievance portals routinely receive a substantial volume of public complaints, each containing valuable information but often embedded in unstructured text. Extracting patterns from such data can enable public agencies to respond more efficiently and allocate resources more strategically. Manual classification of these grievances is a time-consuming bottleneck. This paper describes a methodology for reliable topic discovery in this noisy domain. Unlike standard studies that rely solely on stopword removal, this paper introduces a robust ”Multi-Metric Gibberish Filtering Pipeline to ensure that the subsequent dataset was free of incoherent noise. We then proceeded to perform a comprehensive coherence benchmarking of four primary topic modeling algorithms, Latent Semantic Analysis(LSA), Non-Negative Matrix Factorization (NMF), Latent Dirichlet Allocation(LDA), and BERTopic across varied topic counts (K=5 to 50) on the cleaned grievance data. Analysis through the Cv (coherence score) showed that NMF considerably outperformed the alternatives, reaching the highest overall score of 0.7898 at K=35. This work sets a benchmark for preparing data to handle noisy government feedback and asserts the NMF with  35-topic configuration as the most effective and coherent model in extracting interpretable themes from public service delivery grievances.

       

      Downloads

      Download data is not yet available.

      References

      K. Chaudhury, A. Barua, R. Deka, T. Gogoi, A. Gupta, and S. Pyarelal, “Reforming and strengthening digital service delivery: Case of government of assam sub-theme: Improving service delivery,” in Conference Proceedings, 06 2020.

      Z. Tang, X. Pan, and Z. Gu, “Analyzing public demands on china’s online government inquiry platform: A bertopic-based topic modeling study,” Plos one, vol. 19, no. 2, p. e0296855, 2024. DOI: https://doi.org/10.1371/journal.pone.0296855

      I. Spasic and G. Nenadic, “Clinical text data in machine learning: systematic review,” JMIR medical informatics, vol. 8, no. 3, p. e17984, 2020. DOI: https://doi.org/10.2196/17984

      G. Sangeetha and L. M. Rao, “Modelling of e-governance framework for mining knowledge from massive grievance redressal data,”

      International Journal of Electrical and Computer Engineering (IJECE), vol. 6, no. 1, pp. 367–374, 2016.

      P. Gupta, O. P. Ijardar, A. Jadhav, and V. Saheb, “Ai-based solution to enable ease of grievance lodging and tracking for citizens across multiple departments,” in International Conference on Advances and Applications in Artificial Intelligence (ICAAAI 2025),pp. 1002–1022, Atlantis Press, 2025. DOI: https://doi.org/10.2991/978-94-6463-738-0_78

      R. K. Das, M. Panda, and H. Misra, “Decision support grievance redressal system using sentence sentiment analysis,” in Proceedings of the 13th International Conference on Theory and Practice of Electronic Governance, pp. 17–24, 2020. DOI: https://doi.org/10.1145/3428502.3428505

      S. Agarwal and A. Sureka, “Investigating the role of twitter in e-governance by extracting information on citizen complaints and grievances reports,” in International conference on big data analytics, pp. 300–310, Springer, 2017. DOI: https://doi.org/10.1007/978-3-319-72413-3_21

      K. Shah, H. Joshi, and H. Joshi, “Smart approach to recognize public grievance from microblogs,” Towar. Excell. UGC HRDC GU, vol. 13, no. 02, pp. 57–69, 2021. DOI: https://doi.org/10.37867/TE130206

      S. Vijayarani, M. J. Ilamathi, M. Nithya, et al., “Preprocessing techniques for text mining-an overview,” International Journal of Computer Science & Communication Networks, vol. 5, no. 1, pp. 7–16, 2015. DOI: https://doi.org/10.5121/ijcga.2015.5105

      D. Ellerman, New foundations for information theory: logical entropy and Shannon entropy. Springer Nature, 2021. DOI: https://doi.org/10.1007/978-3-030-86552-8

      D. Yogish, T. Manjunath, and R. S. Hegadi, “Review on natural language processing trends and techniques using nltk,” in International conference on recent trends in image processing and pattern recognition, pp. 589–606, Springer, 2018. DOI: https://doi.org/10.1007/978-981-13-9187-3_53

      J. Kaur and P. K. Buttar, “A systematic review on stopword removal algorithms,” International Journal on Future Revolution in Computer Science & Communication Engineering, vol. 4, no. 4, pp. 207–210, 2018.

      D. Khyani, B. Siddhartha, N. Niveditha, and B. Divya, “An interpretation of lemmatization and stemming in natural language processing,”

      Journal of University of Shanghai for Science and Technology, vol. 22, no. 10, pp. 350–357, 2021.

      W. A. Qader, M. M. Ameen, and B. I. Ahmed, “An overview of bag of words;importance, implementation, applications, and challenges,” in 2019 International Engineering Conference (IEC), pp. 200–204, 2019. DOI: https://doi.org/10.1109/IEC47844.2019.8950616

      S. Qaiser and R. Ali, “Text mining: use of tf-idf to examine the relevance of words to documents,” International journal of computer applications, vol. 181, no. 1, pp. 25–29, 2018. DOI: https://doi.org/10.5120/ijca2018917395

      L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018. DOI: https://doi.org/10.21105/joss.00861

      W. Liu, N. Zheng, and Q. You, “Nonnegative matrix factorization and its applications in pattern recognition,” Chinese Science Bulletin, vol. 51, no. 1, pp. 7–18, 2006. DOI: https://doi.org/10.1007/s11434-005-1109-6

      R. Rˇ ehu˚ˇrek, P. Sojka, et al., “Gensim—statistical semantics in python,” Retrieved from genism. org, 2011.

      D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.

      N. E. Evangelopoulos, “Latent semantic analysis,” Wiley Interdisciplinary Reviews: Cognitive Science, vol. 4, no. 6, pp. 683–692, 2013. DOI: https://doi.org/10.1002/wcs.1254

      R. Srivastava, S. Sharma, and P. Singh, “Exploring latent themes-analysis of various topic modelling algorithms,” International Journal of Advanced Research in Science, Communication and Technology, pp. 225–229, 06 2023. DOI: https://doi.org/10.48175/IJARSCT-11635

      M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,” arXiv preprint arXiv:2203.05794, 2022.

      M. Ro¨der, A. Both, and A. Hinneburg, “Exploring the space of topic coherence measures,” in Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM ’15, (New York, NY, USA), p. 399–408, Association for Computing Machinery, 2015. DOI: https://doi.org/10.1145/2684822.2685324

      C. Yin and Z. Zhang, “A study of sentence similarity based on the all-minilm-l6-v2 model with “same semantics, different structure” after fine tuning,” in 2024 2nd International Conference on Image, Algorithms and Artificial Intelligence (ICIAAI 2024), pp. 677–684, Atlantis Press, 2024. DOI: https://doi.org/10.2991/978-94-6463-540-9_69

      M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” arXiv:2203.05794, Mar. 2022, doi: https://doi.org/10.48550/arXiv.2203.05794 Focus to learn more.

      Downloads

      Published

      12/25/2025

      Data Availability Statement

      Our analysis was conducted using an in-house dataset of government public grievance data. This data is subject to official restrictions, and as such, it cannot be made available to external readers.

      How to Cite

      [1]
      R. Deka and R. K. Srivastava, “Improving Public Service Grievance Analysis: A Comparative Study of Topic Modelling Techniques with a Multi-Metric Data Cleaning Framework”, Journal of IT, Cybersecurity, & AI, vol. 2, no. 3, pp. 41–53, Dec. 2025, doi: 10.70715/jitcai.2025.v2.i3.031.

      Share