Exploratory Data Analysis and Water Potability Classification using Supervised Machine Learning Algorithms
Received: 11 September 2024 | Revised: 10 December 2024, 23 December 2024, and 07 January 2025 | Accepted: 11 January 2025 | Online: 3 April 2025
Corresponding author: Anupkumar Bongale
Abstract
This study investigates the critical task of assessing water potability using supervised machine-learning techniques. The problem statement involves accurately predicting water potability based on chemical and physical parameters, which are crucial for public health and environmental sustainability. Exploratory Data Analysis (EDA) highlighted significant insights into feature distributions and correlations, guiding preprocessing steps and model selection. The Synthetic Minority Oversampling Technique (SMOTE) was applied to mitigate class imbalance, ensuring robust model training. Three classification algorithms, namely Logistic Regression (LR), K-Nearest Neighbors (KNN), and Random Forest (RF), were evaluated, with RF exhibiting superior performance after Optuna hyperparameter tuning, achieving an accuracy of 68%. Based on the performance of RF and KNN, a weighted voting-based ensemble technique achieved an accuracy of 71%. This study emphasizes the importance of leveraging machine learning to support water quality assessment, offering reliable tools for decision-making in public health and environmental management.
Keywords:
SMOTE, machine learning, water quality, water potability, random forest, k nearest neighbors, logistic regressionDownloads
References
P. Jeffrey, Z. Yang, and S. J. Judd, "The status of potable water reuse implementation," Water Research, vol. 214, May 2022, Art. no. 118198.
N. Morin-Crini et al., "Worldwide cases of water pollution by emerging contaminants: a review," Environmental Chemistry Letters, vol. 20, no. 4, pp. 2311–2338, Aug. 2022.
N. U. H. Shar, G. Q. Shar, A. R. Shar, S. M. Wassan, Z. Q. Bhatti, and A. Ali, "Health Risk Assessment of Arsenic in the Drinking Water of Upper Sindh, Pakistan," Engineering, Technology & Applied Science Research, vol. 11, no. 5, pp. 7558–7563, Oct. 2021.
R. P. Shete, A. M. Bongale, and D. Dharrao, "IoT-enabled effective real-time water quality monitoring method for aquaculture," MethodsX, vol. 13, Dec. 2024, Art. no. 102906.
R. Shete, A. Bongale, and A. Bongale, "Internet of Things based Messaging Protocols for Aquaculture Applications - A Bibliometric Analysis and Review," Library Philosophy and Practice (e-journal), Apr. 2021.
R. K. Mishra, "Fresh Water availability and Its Global challenge," British Journal of Multidisciplinary and Advanced Studies, vol. 4, no. 3, pp. 1–78, May 2023.
H. Gunter, C. Bradley, D. M. Hannah, S. Manaseki-Holland, R. Stevens, and K. Khamis, "Advances in quantifying microbial contamination in potable water: Potential of fluorescence-based sensor technology," WIREs Water, vol. 10, no. 1, 2023, Art. no. e1622.
W. Yang, X. Wei, and S. Choi, "A Dual-Channel, Interference-Free, Bacteria-Based Biosensor for Highly Sensitive Water Quality Monitoring," IEEE Sensors Journal, vol. 16, no. 24, pp. 8672–8677, Sep. 2016.
B. Mizaikoff, "Infrared optical sensors for water quality monitoring," Water Science and Technology, vol. 47, no. 2, pp. 35–42, Jan. 2003.
T. Maqbool et al., "Exploring the relative changes in dissolved organic matter for assessing the water quality of full-scale drinking water treatment plants using a fluorescence ratio approach," Water Research, vol. 183, Sep. 2020, Art. no. 116125.
G. E. Adjovu, H. Stephen, D. James, and S. Ahmad, "Measurement of Total Dissolved Solids and Total Suspended Solids in Water Systems: A Review of the Issues, Conventional, and Remote Sensing Techniques," Remote Sensing, vol. 15, no. 14, Jan. 2023, Art. no. 3534.
E. K. Nti et al., "Water pollution control and revitalization using advanced technologies: Uncovering artificial intelligence options towards environmental health protection, sustainability and water security," Heliyon, vol. 9, no. 7, Jul. 2023.
K. Gunasekaran and S. Boopathi, "Artificial Intelligence in Water Treatments and Water Resource Assessments," in Artificial Intelligence Applications in Water Treatment and Water Resource Management, IGI Global Scientific Publishing, 2023, pp. 71–98.
E. Parimbelli, T. M. Buonocore, G. Nicora, W. Michalowski, S. Wilk, and R. Bellazzi, "Why did AI get this one wrong? — Tree-based explanations of machine learning model predictions," Artificial Intelligence in Medicine, vol. 135, Jan. 2023, Art. no. 102471.
M. Zhu et al., "A review of the application of machine learning in water quality evaluation," Eco-Environment & Health, vol. 1, no. 2, pp. 107–116, Jun. 2022.
M. M. M. Syeed, M. S. Hossain, M. R. Karim, M. F. Uddin, M. Hasan, and R. H. Khan, "Surface water quality profiling using the water quality index, pollution index and statistical methods: A critical review," Environmental and Sustainability Indicators, vol. 18, Jun. 2023, Art. no. 100247.
N. Nasir et al., "Water quality classification using machine learning algorithms," Journal of Water Process Engineering, vol. 48, Aug. 2022, Art. no. 102920.
T. Yan, S. L. Shen, and A. Zhou, "Indices and models of surface water quality assessment: Review and perspectives," Environmental Pollution, vol. 308, Sep. 2022, Art. no. 119611.
M. G. Uddin, S. Nash, A. Rahman, and A. I. Olbert, "Performance analysis of the water quality index model for predicting water state using machine learning techniques," Process Safety and Environmental Protection, vol. 169, pp. 808–828, Jan. 2023.
J. Park, W. H. Lee, K. T. Kim, C. Y. Park, S. Lee, and T. Y. Heo, "Interpretation of ensemble learning to predict water quality using explainable artificial intelligence," Science of The Total Environment, vol. 832, Aug. 2022, Art. no. 155070.
S. I. Abba et al., "Implementation of data intelligence models coupled with ensemble machine learning for prediction of water quality index," Environmental Science and Pollution Research, vol. 27, no. 33, pp. 41524–41539, Nov. 2020.
A. Kadiwal, "Water Quality." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/adityakadiwal/water-potability.
Downloads
How to Cite
License
Copyright (c) 2025 Priya B. Kamath, Geetanjali Sharma, Anupkumar Bongale, Deepak Dharrao, Modisane Seitshiro

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.