Comparative Study of K-Mean, K-Medoid and Hierarchical Clustering using Data of Tuberculosis Indicators in Indonesia

Nanda Rizqia Pradana Ratnasari

doi:10.54250/ijls.v5i02.181

Nanda Rizqia Pradana Ratnasari Institut Bio Scientia Internasional Indonesia, Jakarta, Indonesia

DOI: https://doi.org/10.54250/ijls.v5i02.181

Keywords: k-mean, k-medoid, hierarchical clustering

Abstract

Cluster analysis is an important topic and analysis in which the ultimate goal is to classify data into several groups based on similar basic. The most applied cluster methods or algorithms are k-means, k-medoids and hierarchical clustering methods. Therefore, this study aimed to compare methods in cluster analysis employing healthcare data on attributes related to TB. The best method will be assigned based on the level of accuracy for each algorithm and the number of clusters. There were four main steps in the clustering analysis used in this study, which were feature selection, clustering algorithm, cluster validation and interpretation. The clustering algorithm used are k-means, k-medoids and hierarchical clustering, with cluster sizes of 2, 3 and 4. The result showed that k-medoids have a higher accuracy than other clustering algorithms or methods. This study explained that compared to k-means and hierarchical clustering, k-medoid had the highest accuracy for both training and testing data. K-medoid was better than the other two algorithms as it was more robust to noise and outliers which were found in the datasets. This outcome was consistent with the training and testing datasets. In terms of the number of clusters, the two-cluster model was better than the three-cluster or the four-cluster model as this model could classify the groups vividly. The results were consistent in k-mean, k-medoid and hierarchical clustering methods, with the smallest sum of squares value of 24.7% for the k-mean. The smallest diameters and the average dissimilarities of k-medoid models were found in group 1. This result explained that group 1, in all algorithms, was more compact and more similar than other groups.

Downloads

Download data is not yet available.

Author Biography

Nanda Rizqia Pradana Ratnasari, Institut Bio Scientia Internasional Indonesia, Jakarta, Indonesia

Department of Bioinformatics, Institut Bio Scientia Internasional Indonesia

References

Alves de Souza, V., Rossi, R., Batista, G., & Rezende, S. (2017). Unsupervised active learning techniques for labeling training sets: An experimental evaluation on sequential data. Intelligent Data Analysis 21(5), 1061-1095.
Arora, P., Deepali, D., & Varshney, S. (2016). Analysis of K-Means and K-Medoids Algorithm For Big Data. Procedia Computer Science 78, 507 – 512.
Blashfield, R. (1984). The Classification of Psychopathology: Neo-Kraepelinian and Quantitative Approaches. Springer.
Cadena, A., Fortune, S., & Flynn, J. (2017). Heterogeneity in Tuberculosis. Nat Rev Immunol; 17(11), 691–702.
Castaldi, P., Dy, J., Ross, J., Chang, Y., Wshko, G., Curran-Everett, D., . . . Cho, M. (2014). Cluster analysis in the COPDGene study identifies subtypes of smokers with distinct patterns of airway disease and emphysema. Thorax: 69, 415-422.
Cheung, Y.-M. (2003). K*-Means: A new generalized k-means clustering algorithm. Pattern Recognition Letters 24, 2883 - 2893.
Clatworthy, J., Buick, D., Hankins, M., Weinman, J., & Horne, R. (2010). The Use and Reporting of Cluster Analysis in Health Psychology: A Review. British Journal of Health Psychology.
Eisen, M., Spellman, P., Brown, P., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, Vol: 9, 14863–14868.
Erawati, M., & Andriany, M. (2020). The Prevalence and Demographic Risk Factors for Latent Tuberculosis Infection (LTBI) Among Healthcare Workers in Semarang, Indonesia. J Multidiscip Healthc, 197-206.
Fialine, A., Alodia, D., Endriani, D., & Widodo, E. (2021). Implementasi Metode K-Medoids Clustering untuk Pengelompokan Provinsi di Indonesia Berdasarkan Indikator Pendidikan. Journal of Mathematics Education and Applied, Vol:2, No:2.
Gupta, A., Gupta, A., & Mishra, A. (2012). Research Paper on Cluster Techniques of Data Variations. International Journal of Advance Technology & Engineering Research (IJATER), 39-47.
Gupta, M., & Jain, R. (2014). A Performance Evaluation of SMCA Using Similarity Association & Proximity Coefficient Relation For Hierarchical Clustering. International Journal of Engineering Trends and Technology (IJETT), Vol:15, 354-359.
Hair, J. F., Black, W., Babin, B., & Anderson, R. (2010). Multivariate Data Analysis A Global Perspective 7th Edition. Pearson: Prentice Hall.
Jain, A., Murty, M., & Flynn, P. (1999). Data Clustering: Review. ACM Computing Surveys, Vol. 31, No. 3, 264-322.
Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., & Wu, A. (2002). An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Transactions On Pattern Analysis and Machine Intelligence, No: 24, Vol: 7, 881 - 892.
Kaufman, L., & Rousseeuw, P. (2005). Finding Groups in Data: An Introduction to Cluster Analysis s (Wiley series in probability and statistics). New York: Wiley-Interscience.
Kaur, K., & Attwal, K. (2014). A Review Paper on Clustering in Data Mining. Research Cell : An International Journal of Engineering Sciences, Vol: 3, 144-151.
Koo, H., Min, J., Kim, H., Ko, Y., Oh, J., Jeong, Y., Park, J. (2022). Cluster Analysis Categorizes Five Phenotypes of Pulmonary Tuberculosis. Scientific Report; 12:10084 .
Lakoh, S., Jiba, D., Adekanmbi, O., Poveda, E., Shar, F., Deen, G., Yendewa, G. (2020). Diagnosis and treatment outcomes of adult tuberculosis in an urban setting with high HIV prevalence in Sierra Leone: A retrospective study. International Journal of Infectious Diseases, Vol:96.
Landau, S., & Chis Ster, I. (2010). Cluster Analysis: Overview. Elsevier, 72-83.
Liao, M., Li, Y., Kianifard, F., Obi, E., & Arcona, S. (2016). Cluster Analysis and Its Application to Healthcare Claims Data: A Study of End-Stage Renal Disease Patients Who Initiated Hemodialysis. BMC Nephrology.
Mahendradhata, Y., Lambert, M., Deun, A., Matthys, F., Boelaert, M., & Stuyft, P. (2003). Strong general health care systems: a prerequisite to reach global tuberculosis control targets. Int J Health Plann Manage.
Mao, J., & Jain, A. (1995). A self-organizing network for hyperellipsoidal clustering (HEC). IEEE Trans. Neural Netw. 6, 296–317.
Mohajan, H. (2015). Tuberculosis is a Fatal Disease among Some Developing Countries of the World. American Journal of Infectious Diseases and Microbiology, 18-31.
Nielsen, F. (2016). Introdcution to HPC with MPI for Data Science. Switzerland: Springer Cham.
Noviyani, A., Nopsopon, T., & Pongpirul, K. (2021). Variation of tuberculosis prevalence across diagnostic approaches and geographical areas of Indonesia. PLOS ONE 16(10).
Novoselsky, A., & Kagan, E. (2021). An Introduction to Cluster Analysis. Weizmann Institute of Science, DOI: 10.13140/RG.2.2.25993.57448/1.
Nurhayati, Sinatrya, N., Wardhani, L., & Busman. (2018). Analysis of K-Means and K-Medoids’s Performance Using Big Data Technology. Journal Proceeding of The 6th International Conference on Cyber and IT Service Management (CITSM 2018).
Organization, W. H. (2020). Global Tuberculosis Report 2020. Geneva.
Organization, W. H. (2021, November 28). World Health Organization. Retrieved March 29, 2023, from https://www.who.int/indonesia/news/detail/28-11-2021-indonesia-commitment-to-eliminate-tb-by-2030-supported-by-the-highest-level-government#:~:text=Based%20on%20WHO%20Global%20TB,South%2DEast%20Asia%20Region%E2%80%9D.
Punithavalli, M., Nathiya, G., & Punitha, S. (2010). An Analytical Study on Behavior of Clusters Using K-Means, EM and K* Means Algorithm. (IJCSIS) International Journal of Computer Science and Information Security Vol. 7, No. 3, 185-190.
Salman, R., Kecman, V., Li, Q., Strack, R., & Test, E. (1998). Fast K-Means Algorithm Clustering. Proceedings of the Fifteenth International Conference on, 91 - 99.
Setyaningsih, S. (2012). Using Cluster Analysis Study to Examine the Successful Performance Entrepreneur in Indonesia. Elsiever: Procedia Economics and Finance 4, 286 – 298.
Shamitha, S., & Ilango, V. (2019). A Roadmap For Intelligent Data Analysis Using Clustering Algorithms And Implementation On Health Insurance Data. International Journal of Scientific and Technology Research, Vol: 8, 2008-2018.
Sharara, H., & Getoor, L. (2010). Group Detection. In C. Sammut, & G. Webb, Encyclopedia of Machine Learning (pp. 489-492). New York: Springer.
Soni, K., & Patel, A. (2017). Comparative Analysis of K-means and K-medoids Algorithm on IRIS Data. International Journal of Computational Intelligence Research, Vol: 13, No: 5, 899 - 906.
Strehl, A., Ghosh, J., & Mooney, R. (2000). Impact of Similarity Measures on Web-page Clustering. AAAI-2000: Workshop of Artificial Intelligence for Web Search.
Wierzchoń, S., & Kłopotek, M. (2018). Modern Algorithm of Cluster Analysis. Warsaw, Polland: Springer.
Xin Jin, & Jiawei Han. (2010). K-Medoids Clustering. In C. Sammut, & G. Webb, Encyclopedia of Machine Learning (pp. 564–565). Boston, MA: Springer.
Zhao, Y., & Zhou, X. (2021). K-means Clustering Algorithm and Its Improvement Reserach. Journal of Physics: Conference Series.