Scalable Distributed K-Means Clustering Using the Firefly Algorithm with Tree- and Hash-Based Optimization for Big Data
Received: 4 February 2026 | Revised: 5 April 2026 | Accepted: 18 April 2026 | Online: 6 June 2026
Corresponding author: Shivlingappa Battur
Abstract
The rapid growth of digital data has exposed significant limitations in traditional clustering methods, particularly with respect to scalability, computational overhead, and clustering quality. To address these challenges, this paper proposes Firefly–K-Means with Tree- and Hash-based optimization (FKTH), a scalable distributed clustering framework that integrates an adaptive Firefly Algorithm (FA) with K-Means, enhanced through KD-Tree–based distance computation, hash map–based constant-time centroid updates, and Hadoop MapReduce–based parallel processing. The adaptive Firefly component dynamically adjusts attraction, absorption, and randomness parameters during optimization to balance exploration and exploitation and avoid premature convergence. The proposed framework is evaluated on large-scale real-world datasets ranging from 100K to over 1M records across varying cluster node configurations. Experimental results demonstrate that FKTH achieves superior scalability and consistently outperforms existing metaheuristic-based clustering methods in terms of execution time, Silhouette Score, Davies–Bouldin Index (DBI), and F1-score, making it well suited for large-scale distributed data analytics.
Keywords:
distributed clustering, adaptive Firefly Algorithm, K-Means, Hadoop MapReduce, KD-Tree, large-scale data analytics, metaheuristic optimizationReferences
A. Badshah, A. Daud, R. Alharbey, A. Banjar, A. Bukhari, and B. Alshemaimri, "Big data applications: overview, challenges and future," Artificial Intelligence Review, vol. 57, no. 11, Sept. 2024, Art. no. 290.
S. Battur, N. Tejas, B. Naveenkumar, K. Aditi, T. V, and S. G. Totad, "Scalable Data Clustering Using Firefly Algorithm in Distributed Environment," in 6th International Conference on Data Science and Applications, Jaipur, India, 2025, pp. 348–358.
N. Sikarwar and R. S. Tomar, "A New Approach for Wireless Sensor Networks based on Tree-based Routing using Hybrid Fuzzy C-Means with Genetic Algorithm," Engineering, Technology & Applied Science Research, vol. 14, no. 3, pp. 14141–14147, June 2024.
A. M. Ikotun, M. S. Almutari, and A. E. Ezugwu, "K-Means-Based Nature-Inspired Metaheuristic Algorithms for Automatic Data Clustering Problems: Recent Advances and Future Directions," Applied Sciences, vol. 11, no. 23, Dec. 2021, Art. no. 11246.
N. Tremblay, G. Puy, P. Borgnat, R. Gribonval, and P. Vandergheynst, "Accelerated spectral clustering using graph filtering of random signals," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 2016, pp. 4094–4098.
K. Golalipour, E. Akbari, S. S. Hamidi, M. Lee, and R. Enayatifar, "From clustering to clustering ensemble selection: A review," Engineering Applications of Artificial Intelligence, vol. 104, Sept. 2021, Art. no. 104388.
X.-S. Yang, Nature-inspired Metaheuristic Algorithms, 2nd ed. Beckington, Somerset, UK: Luniver Press, 2010.
J. Xue and B. Shen, "A novel swarm intelligence optimization approach: sparrow search algorithm," Systems Science & Control Engineering, vol. 8, no. 1, pp. 22–34, Jan. 2020.
S. Battur, R. H. Shrinidhi, A. Kinagi, D. G. Nayana, M. Priya, and S. G. Totad, "Enhancing the Performance of PSO Algorithm for Clustering High-Dimensional Data Using Autoencoders," in International Conference on Data Science and Applications, Jaipur, India, 2023, pp. 515–534.
T. Hassanzadeh and M. R. Meybodi, "A new hybrid approach for data clustering using firefly algorithm and K-means," in The 16th CSI International Symposium on Artificial Intelligence and Signal Processing, Shiraz, Iran, 2012, pp. 007–011.
Q. Li, P. Wang, W. Wang, H. Hu, Z. Li, and J. Li, "An Efficient K-means Clustering Algorithm on MapReduce," in 19th International Conference on Database Systems for Advanced Applications, Bali, Indonesia, 2014, pp. 357–371.
M. M. Saeed, Z. A. Aghbari, and M. Alsharidah, "Big data clustering techniques based on Spark: a literature review," PeerJ Computer Science, vol. 6, Nov. 2020, Art. no. e321.
M. Sherar and F. Zulkernine, "Particle swarm optimization for large-scale clustering on apache spark," in 2017 IEEE Symposium Series on Computational Intelligence, Honolulu, HI, USA, 2017, pp. 1–8.
A. Trindade, "ElectricityLoadDiagrams20112014." UCI Machine Learning Repository, 2015.
J. Blackard, "Covertype." UCI Machine Learning Repository, 1998.
Y. K. C. Sakar, "Online Shoppers Purchasing Intention Dataset." UCI Machine Learning Repository, 2018.
S. B. Henrik Blunck, "Heterogeneity Activity Recognition." UCI Machine Learning Repository, 2015.
Downloads
How to Cite
License
Copyright (c) 2026 Shivlingappa Battur, Shashikumar Totad

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
