Analyzing the Impact of Data Resampling on Stroke Prediction using Machine Learning
Received: 27 November 2024 | Revised: 8 January 2025 | Accepted: 12 January 2025
Corresponding author: Majid Rahardi
Abstract
This study focuses on stroke prediction using machine learning algorithms and evaluates the impact of different resampling techniques, including original, under-sampling, and over-sampling, on classification performance. The classifiers used in this study include Random Forest (RF), Decision Tree (DT), Gradient Boosting (GB), and K-Nearest Neighbor (KNN). Each model was trained and evaluated using performance metrics such as accuracy, precision, recall, F1-score, and AUC. The results demonstrate that RF trained on the oversampled dataset achieved the best performance with an accuracy of 94.31%, a precision of 93.52%, a recall of 95.27%, an F1-score of 94.39%, and an AUC of 98.46% on the test set. These findings highlight the effectiveness of oversampling in handling imbalanced datasets and the superiority of RF in stroke prediction tasks compared to other classification methods and resampling techniques.
Keywords:
machine learning, stroke classification, resamplingDownloads
References
T. Roushdy et al., "Applying the World Stroke Organization roadmap in planning a model for stroke service implementation in Matrouh Governorate-Egypt: a World Stroke Organization young future stroke leaders’ analytical study," The Egyptian Journal of Neurology, Psychiatry and Neurosurgery, vol. 59, no. 1, Nov. 2023, Art. no. 150.
J. Droś, N. Segiet, G. Początek, and A. Klimkowicz-Mrowiec, "Five-year stroke prognosis. Influence of post-stroke delirium and post-stroke dementia on mortality and disability (Research Study – Part of the PROPOLIS Study)," Neurological Sciences, vol. 45, no. 3, pp. 1109–1119, Mar. 2024.
K. P. Berg, V. F. I. Sørensen, S. N. F. Blomberg, H. C. Christensen, and C. Kruuse, "Recognition of visual symptoms in stroke: a challenge to patients, bystanders, and Emergency Medical Services," BMC Emergency Medicine, vol. 23, no. 1, Aug. 2023, Art. no. 96.
A. M. Alghamdi, M. A. Al-Khasawneh, A. Alarood, and E. Alsolami, "The Role of Machine Learning in Managing and Organizing Healthcare Records," Engineering, Technology & Applied Science Research, vol. 14, no. 2, pp. 13695–13701, Apr. 2024.
Z. Xie, C. Wang, X. Huang, Z. Wang, H. Shangguan, and S. Wang, "Prevalence and Risk Factors of Stroke in Inpatients with Type 2 Diabetes Mellitus in China," Current Medical Science, vol. 44, no. 4, pp. 698–706, Aug. 2024.
N. K. Al-Shammari et al., "Cardiac Stroke Prediction Framework using Hybrid Optimization Algorithm under DNN," Engineering, Technology & Applied Science Research, vol. 11, no. 4, pp. 7436–7441, Aug. 2021.
S. M. Alanazi and G. S. M. Khamis, "Optimizing Machine Learning Classifiers for Enhanced Cardiovascular Disease Prediction," Engineering, Technology & Applied Science Research, vol. 14, no. 1, pp. 12911–12917, Feb. 2024.
S. Saturi, "Review on Machine Learning Techniques for Medical Data Classification and Disease Diagnosis," Regenerative Engineering and Translational Medicine, vol. 9, no. 2, pp. 141–164, Jun. 2023.
A. Hassan, S. Gulzar Ahmad, E. Ullah Munir, I. Ali Khan, and N. Ramzan, "Predictive modelling and identification of key risk factors for stroke using machine learning," Scientific Reports, vol. 14, no. 1, May 2024, Art. no. 11498.
A. A. Abujaber, Y. Imam, I. Albalkhi, S. Yaseen, A. J. Nashwan, and N. Akhtar, "Utilizing machine learning to facilitate the early diagnosis of posterior circulation stroke," BMC Neurology, vol. 24, no. 1, May 2024, Art. no. 156.
H. Ha, Q. D. Bui, D. T. Tran, D. Q. Nguyen, H. X. Bui, and C. Luu, "Improving the forecast performance of landslide susceptibility mapping by using ensemble gradient boosting algorithms," Environment, Development and Sustainability, Mar. 2024.
M. Sabri et al., "A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors," Journal of Classification, vol. 41, no. 2, pp. 264–288, Jul. 2024.
Y. Katsura, S. Ohga, K. Shimo, T. Hattori, T. Yamada, and T. Matsubara, "A decision tree algorithm to identify predictors of post-stroke complex regional pain syndrome," Scientific Reports, vol. 14, no. 1, Apr. 2024, Art. no. 9893.
T. Wang, "Improved random forest classification model combined with C5.0 algorithm for vegetation feature analysis in non-agricultural environments," Scientific Reports, vol. 14, no. 1, May 2024, Art. no. 10367.
C. Kokkotis et al., "An Explainable Machine Learning Pipeline for Stroke Prediction on Imbalanced Data," Diagnostics, vol. 12, no. 10, Oct. 2022, Art. no. 2392.
K. Mridha, S. Ghimire, J. Shin, A. Aran, Md. M. Uddin, and M. F. Mridha, "Automated Stroke Prediction Using Machine Learning: An Explainable and Exploratory Study With a Web Application for Early Intervention," IEEE Access, vol. 11, pp. 52288–52308, 2023.
M. S. Islam, I. Hussain, M. M. Rahman, S. J. Park, and M. A. Hossain, "Explainable Artificial Intelligence Model for Stroke Prediction Using EEG Signal," Sensors, vol. 22, no. 24, Dec. 2022, Art. no. 9859.
H. Saleh, S. F. A. El-Ghany, E. M. G. Younis, and N. F. Omran, "Stroke Prediction using Distributed Machine Learning Based on Apache Spark." International Journal of Advanced Science and Technology, vol. 28, no. 15, pp. 89-97, 2019.
S. Dev, H. Wang, C. S. Nwosu, N. Jain, B. Veeravalli, and D. John, "A predictive analytics approach for stroke prediction using machine learning and neural networks," Healthcare Analytics, vol. 2, Nov. 2022, Art. no. 100032.
"Indicators of Heart Disease (2022 UPDATE)." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease.
C. Shu, C. Zheng, D. Luo, J. Song, Z. Jiang, and L. Ge, "Acute ischemic stroke prediction and predictive factors analysis using hematological indicators in elderly hypertensives post-transient ischemic attack," Scientific Reports, vol. 14, no. 1, Jan. 2024, Art. no. 695.
M. Afkanpour, E. Hosseinzadeh, and H. Tabesh, "Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review," BMC Medical Research Methodology, vol. 24, no. 1, Aug. 2024, Art. no. 188.
A. Ali, N. A. Emran, and S. A. Asmai, "Missing values compensation in duplicates detection using hot deck method," Journal of Big Data, vol. 8, no. 1, Dec. 2021, Art. no. 112.
C. Yang, E. A. Fridgeirsson, J. A. Kors, J. M. Reps, and P. R. Rijnbeek, "Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data," Journal of Big Data, vol. 11, no. 1, Jan. 2024, Art. no. 7.
S. Mondal, S. Ghosh, and A. Nag, "Brain stroke prediction model based on boosting and stacking ensemble approach," International Journal of Information Technology, vol. 16, no. 1, pp. 437–446, Jan. 2024.
D. Xu, S. Matinmehr, A. Sawchuk, and X. Luo, "Identifying clinical feature clusters toward predicting stroke in patients with asymptomatic carotid stenosis," International Journal of Data Science and Analytics, Aug. 2024.
T. Haritha and A. V. S. Babu, "Early-stage stroke prediction based on Parkinson and wrinkles using deep learning," Neural Computing and Applications, vol. 36, no. 30, pp. 18781–18805, Oct. 2024.
R. Y. Coley, Q. Liao, N. Simon, and S. M. Shortreed, "Empirical evaluation of internal validation methods for prediction in large-scale clinical data with rare-event outcomes: a case study in suicide risk prediction," BMC Medical Research Methodology, vol. 23, no. 1, Feb. 2023, Art. no. 33.
R. Bhowmick, S. R. Mishra, S. Tiwary, and H. Mohapatra, "Machine learning for brain-stroke prediction: comparative analysis and evaluation," Multimedia Tools and Applications, Aug. 2024.
Downloads
How to Cite
License
Copyright (c) 2025 Majid Rahardi, Afrig Aminuddin, Ferian Fauzi Abdulloh, Bima Pramudya Asaddulloh, Hesmeralda Rojas Enriquez, Kusnawi Kusnawi

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.