Exploratory Data Analysis and Water Potability Classification using Supervised Machine Learning Algorithms

Authors

  • Priya Kamath B. Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal-576104, Udupi, Karnataka, India https://orcid.org/0000-0002-5471-8822
  • Geetanjali Sharma Department of Computer Science and Engineering, Pimpri Chinchwad College of Engineering, Pune, India
  • Anupkumar Bongale Department of Artificial Intelligence and Machine Learning, Symbiosis Institute of Technology, Pune Campus, Symbiosis International (Deemed University), Pune, India https://orcid.org/0000-0002-5897-0283
  • Deepak Dharrao Department of Computer Science and Engineering, Symbiosis Institute of Technology, Pune Campus, Symbiosis International (Deemed University), Pune, India https://orcid.org/0000-0002-2540-6942
  • Modisane Seitshiro Centre for Business Mathematics and Informatics, North-West University, Potchefstroom, South Africa | National Institute for Theoretical and Computational Sciences (NITheCS), South Africa https://orcid.org/0000-0001-9557-3714
Volume: 15 | Issue: 2 | Pages: 20898-20903 | April 2025 | https://doi.org/10.48084/etasr.8904

Abstract

This study investigates the critical task of assessing water potability using supervised machine-learning techniques. The problem statement involves accurately predicting water potability based on chemical and physical parameters, which are crucial for public health and environmental sustainability. Exploratory Data Analysis (EDA) highlighted significant insights into feature distributions and correlations, guiding preprocessing steps and model selection. The Synthetic Minority Oversampling Technique (SMOTE) was applied to mitigate class imbalance, ensuring robust model training. Three classification algorithms, namely Logistic Regression (LR), K-Nearest Neighbors (KNN), and Random Forest (RF), were evaluated, with RF exhibiting superior performance after Optuna hyperparameter tuning, achieving an accuracy of 68%. Based on the performance of RF and KNN, a weighted voting-based ensemble technique achieved an accuracy of 71%. This study emphasizes the importance of leveraging machine learning to support water quality assessment, offering reliable tools for decision-making in public health and environmental management.

Keywords:

SMOTE, machine learning, water quality, water potability, random forest, k nearest neighbors, logistic regression

Downloads

Download data is not yet available.

References

P. Jeffrey, Z. Yang, and S. J. Judd, "The status of potable water reuse implementation," Water Research, vol. 214, May 2022, Art. no. 118198.

N. Morin-Crini et al., "Worldwide cases of water pollution by emerging contaminants: a review," Environmental Chemistry Letters, vol. 20, no. 4, pp. 2311–2338, Aug. 2022.

N. U. H. Shar, G. Q. Shar, A. R. Shar, S. M. Wassan, Z. Q. Bhatti, and A. Ali, "Health Risk Assessment of Arsenic in the Drinking Water of Upper Sindh, Pakistan," Engineering, Technology & Applied Science Research, vol. 11, no. 5, pp. 7558–7563, Oct. 2021.

R. P. Shete, A. M. Bongale, and D. Dharrao, "IoT-enabled effective real-time water quality monitoring method for aquaculture," MethodsX, vol. 13, Dec. 2024, Art. no. 102906.

R. Shete, A. Bongale, and A. Bongale, "Internet of Things based Messaging Protocols for Aquaculture Applications - A Bibliometric Analysis and Review," Library Philosophy and Practice (e-journal), Apr. 2021.

R. K. Mishra, "Fresh Water availability and Its Global challenge," British Journal of Multidisciplinary and Advanced Studies, vol. 4, no. 3, pp. 1–78, May 2023.

H. Gunter, C. Bradley, D. M. Hannah, S. Manaseki-Holland, R. Stevens, and K. Khamis, "Advances in quantifying microbial contamination in potable water: Potential of fluorescence-based sensor technology," WIREs Water, vol. 10, no. 1, 2023, Art. no. e1622.

W. Yang, X. Wei, and S. Choi, "A Dual-Channel, Interference-Free, Bacteria-Based Biosensor for Highly Sensitive Water Quality Monitoring," IEEE Sensors Journal, vol. 16, no. 24, pp. 8672–8677, Sep. 2016.

B. Mizaikoff, "Infrared optical sensors for water quality monitoring," Water Science and Technology, vol. 47, no. 2, pp. 35–42, Jan. 2003.

T. Maqbool et al., "Exploring the relative changes in dissolved organic matter for assessing the water quality of full-scale drinking water treatment plants using a fluorescence ratio approach," Water Research, vol. 183, Sep. 2020, Art. no. 116125.

G. E. Adjovu, H. Stephen, D. James, and S. Ahmad, "Measurement of Total Dissolved Solids and Total Suspended Solids in Water Systems: A Review of the Issues, Conventional, and Remote Sensing Techniques," Remote Sensing, vol. 15, no. 14, Jan. 2023, Art. no. 3534.

E. K. Nti et al., "Water pollution control and revitalization using advanced technologies: Uncovering artificial intelligence options towards environmental health protection, sustainability and water security," Heliyon, vol. 9, no. 7, Jul. 2023.

K. Gunasekaran and S. Boopathi, "Artificial Intelligence in Water Treatments and Water Resource Assessments," in Artificial Intelligence Applications in Water Treatment and Water Resource Management, IGI Global Scientific Publishing, 2023, pp. 71–98.

E. Parimbelli, T. M. Buonocore, G. Nicora, W. Michalowski, S. Wilk, and R. Bellazzi, "Why did AI get this one wrong? — Tree-based explanations of machine learning model predictions," Artificial Intelligence in Medicine, vol. 135, Jan. 2023, Art. no. 102471.

M. Zhu et al., "A review of the application of machine learning in water quality evaluation," Eco-Environment & Health, vol. 1, no. 2, pp. 107–116, Jun. 2022.

M. M. M. Syeed, M. S. Hossain, M. R. Karim, M. F. Uddin, M. Hasan, and R. H. Khan, "Surface water quality profiling using the water quality index, pollution index and statistical methods: A critical review," Environmental and Sustainability Indicators, vol. 18, Jun. 2023, Art. no. 100247.

N. Nasir et al., "Water quality classification using machine learning algorithms," Journal of Water Process Engineering, vol. 48, Aug. 2022, Art. no. 102920.

T. Yan, S. L. Shen, and A. Zhou, "Indices and models of surface water quality assessment: Review and perspectives," Environmental Pollution, vol. 308, Sep. 2022, Art. no. 119611.

M. G. Uddin, S. Nash, A. Rahman, and A. I. Olbert, "Performance analysis of the water quality index model for predicting water state using machine learning techniques," Process Safety and Environmental Protection, vol. 169, pp. 808–828, Jan. 2023.

J. Park, W. H. Lee, K. T. Kim, C. Y. Park, S. Lee, and T. Y. Heo, "Interpretation of ensemble learning to predict water quality using explainable artificial intelligence," Science of The Total Environment, vol. 832, Aug. 2022, Art. no. 155070.

S. I. Abba et al., "Implementation of data intelligence models coupled with ensemble machine learning for prediction of water quality index," Environmental Science and Pollution Research, vol. 27, no. 33, pp. 41524–41539, Nov. 2020.

A. Kadiwal, "Water Quality." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/adityakadiwal/water-potability.

Downloads

How to Cite

[1]
Kamath B., P., Sharma, G., Bongale, A., Dharrao, D. and Seitshiro, M. 2025. Exploratory Data Analysis and Water Potability Classification using Supervised Machine Learning Algorithms. Engineering, Technology & Applied Science Research. 15, 2 (Apr. 2025), 20898–20903. DOI:https://doi.org/10.48084/etasr.8904.

Metrics

Abstract Views: 6
PDF Downloads: 2

Metrics Information