Optimized Machine Learning for Cancer Classification via Three-Stage Gene Selection
Received: 1 November 2024 | Revised: 8 December 2024 and 20 December 2024 | Accepted: 22 December 2024 | Online: 3 April 2025
Corresponding author: Sara Haddou Bouazza
Abstract
Gene selection from high-dimensional microarray data presents challenges such as overfitting, computational inefficiency, and feature redundancy. Despite significant advances, existing methods often suffer from limitations in scalability and interpretability, especially for precision oncology. This study introduces a novel Three-Stage Gene Selection (3SGS) strategy that addresses these issues through a combination of filter-based methods (signal-to-noise ratio, correlation coefficient, ReliefF) with accuracy-driven refinement and redundancy reduction. The 3SGS approach identifies minimal but highly predictive gene subsets, achieving 100% accuracy for leukemia and 98% for prostate cancer using only 3-4 genes. Compared to traditional methods, 3SGS enhances efficiency and interpretability, establishing itself as a scalable and robust solution for cancer classification.
Keywords:
artificial intelligence, data mining, machine learning, pattern recognition, computer scienceDownloads
References
H. Z. Almarzouki, "Deep-Learning-Based Cancer Profiles Classification Using Gene Expression Data Profile," Journal of Healthcare Engineering, vol. 2022, no. 1, 2022, Art. no. 4715998.
S. Gupta, M. K. Gupta, M. Shabaz, and A. Sharma, "Deep learning techniques for cancer classification using microarray gene expression data," Frontiers in Physiology, vol. 13, Sep. 2022.
S. Debnath et al., "Understanding the cross-talk of major abiotic-stress-responsive genes in rice: A computational biology approach," Journal of King Saud University - Science, vol. 35, no. 7, Oct. 2023, Art. no. 102786.
D. O. Enoma, J. Bishung, T. Abiodun, O. Ogunlana, and V. C. Osamor, "Machine learning approaches to genome-wide association studies," Journal of King Saud University - Science, vol. 34, no. 4, Jun. 2022, Art. no. 101847.
N. Behar and M. Shrivastava, "A Novel Model for Breast Cancer Detection and Classification," Engineering, Technology & Applied Science Research, vol. 12, no. 6, pp. 9496–9502, Dec. 2022.
S. Larabi Marie-Sainte and N. Alalyani, "Firefly Algorithm based Feature Selection for Arabic Text Classification," Journal of King Saud University - Computer and Information Sciences, vol. 32, no. 3, pp. 320–328, Mar. 2020.
A. E. Hegazy, M. A. Makhlouf, and G. S. El-Tawel, "Improved salp swarm algorithm for feature selection," Journal of King Saud University - Computer and Information Sciences, vol. 32, no. 3, pp. 335–344, Mar. 2020.
W. Ali and F. Saeed, "Hybrid Filter and Genetic Algorithm-Based Feature Selection for Improving Cancer Classification in High-Dimensional Microarray Data," Processes, vol. 11, no. 2, Feb. 2023, Art. no. 562.
R. Dash, "An Adaptive Harmony Search Approach for Gene Selection and Classification of High Dimensional Medical Data," Journal of King Saud University - Computer and Information Sciences, vol. 33, no. 2, pp. 195–207, Feb. 2021.
L. Moody, H. Chen, and Y.-X. Pan, "Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening," BMC Medical Genomics, vol. 13, no. 10, Oct. 2020, Art. no. 148.
Uzma and Z. Halim, "An ensemble filter-based heuristic approach for cancerous gene expression classification," Knowledge-Based Systems, vol. 234, Dec. 2021, Art. no. 107560.
A. Benkessirat and N. Benblidia, "A novel feature selection approach based on constrained eigenvalues optimization," Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 4836–4846, Sep. 2022.
M. K. P. Niyas and P. Thiyagarajan, "Feature selection using efficient fusion of Fisher Score and greedy searching for Alzheimer’s classification," Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 4993–5006, Sep. 2022.
S. Bose et al., "An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples," PeerJ Computer Science, vol. 7, Sep. 2021, Art. no. e671.
K. A. Uthman, F. M. Ba-Alwi, and S. M. Othman, "A survey on feature selection in microarray data: Methods algorithms and challenges," International Journal of Computer Sciences and Engineering, vol. 8, no. 10, pp. 106–116, 2020.
Z. Wang, Y. Zhou, T. Takagi, J. Song, Y.-S. Tian, and T. Shibuya, "Genetic algorithm-based feature selection with manifold learning for cancer classification using microarray data," BMC Bioinformatics, vol. 24, no. 1, Apr. 2023, Art. no. 139.
X. Deng, M. Li, S. Deng, and L. Wang, "Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification," Medical & Biological Engineering & Computing, vol. 60, no. 3, pp. 663–681, Mar. 2022.
D. Jiang, C. Tang, and A. Zhang, "Cluster analysis for gene expression data: a survey," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1370–1386, Aug. 2004.
J. Hou et al., "Distance correlation application to gene co-expression network analysis," BMC Bioinformatics, vol. 23, no. 1, Feb. 2022, Art. no. 81.
R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial Intelligence, vol. 97, no. 1, pp. 273–324, Dec. 1997.
S. Kwon, H. Lee, and S. Lee, "Image enhancement with Gaussian filtering in time-domain microwave imaging system for breast cancer detection," Electronics Letters, vol. 52, no. 5, pp. 342–344, 2016.
Y. M. Wazery, E. Saber, E. H. Houssein, A. A. Ali, and E. Amer, "An Efficient Slime Mould Algorithm Combined With K-Nearest Neighbor for Medical Classification Tasks," IEEE Access, vol. 9, pp. 113666–113682, 2021.
M. Alwohaibi, M. Alzaqebah, N. M. Alotaibi, A. M. Alzahrani, and M. Zouch, "A hybrid multi-stage learning technique based on brain storming optimization algorithm for breast cancer recurrence prediction," Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 5192–5203, Sep. 2022.
H. Elwahsh, M. A. Tawfeek, A. A. Abd El-Aziz, M. A. Mahmood, M. Alsabaan, and E. El-shafeiy, "A new approach for cancer prediction based on deep neural learning," Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 6, Jun. 2023, Art. no. 101565.
Y. E. Almalki et al., "LBP–Bilateral Based Feature Fusion for Breast Cancer Diagnosis," Computers, Materials & Continua, vol. 73, no. 2, pp. 4103–4121, 2022.
H. B. Sara and H. B. Jihad, "Artificial Intelligence Application for the Classification of Central Nervous System Tumors Based on Blood Biomarkers," in 2024 International Conference on Global Aeronautical Engineering and Satellite Technology (GAST), Marrakesh, Morocco, Apr. 2024, pp. 1–5.
A. Abubakar, Y. Jibrin, M. B. Maina, and A. B. Maina, "Classification of Alzheimer’s Disease Using Cnn-Based Features and Vit-Global Contextual Patterns from MRI Images." Social Science Research Network, May 06, 2024.
M. Çakir, M. Yilmaz, M. A. Oral, H. Ö. Kazanci, and O. Oral, "Accuracy assessment of RFerns, NB, SVM, and kNN machine learning classifiers in aquaculture," Journal of King Saud University - Science, vol. 35, no. 6, Aug. 2023, Art. no. 102754.
T. R. Golub et al., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, pp. 531–537, Oct. 1999.
"Gene expression dataset." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/crawford/gene-expression.
D. Singh et al., "Gene expression correlates of clinical prostate cancer behavior," Cancer Cell, vol. 1, no. 2, pp. 203–209, Mar. 2002.
"GNF Prostate Data." https://www.stat.cmu.edu/~jiashun/Research/software/HCClassification/Prostate/Readme.txt.
S. Osama, H. Shaban, and A. A. Ali, "Gene reduction and machine learning algorithms for cancer classification based on microarray gene expression data: A comprehensive review," Expert Systems with Applications, vol. 213, Mar. 2023, Art. no. 118946.
T. Nguyen, A. Khosravi, D. Creighton, and S. Nahavandi, "Hierarchical Gene Selection and Genetic Fuzzy System for Cancer Microarray Data Classification," PLOS ONE, vol. 10, no. 3, 2015, Art. no. e0120364.
X. Liu, A. Krishnan, and A. Mondry, "An Entropy-based gene selection method for cancer classification using microarray data," BMC Bioinformatics, vol. 6, no. 1, Mar. 2005, Art. no. 76.
A. Razzaque and D. A. Badholia, "PCA based feature extraction and MPSO based feature selection for gene expression microarray medical data classification," Measurement: Sensors, vol. 31, Feb. 2024, Art. no. 100945.
M. Vatankhah and M. Momenzadeh, "Self-regularized Lasso for selection of most informative features in microarray cancer classification," Multimedia Tools and Applications, vol. 83, no. 2, pp. 5955–5970, Jan. 2024.
G. Dagnew and B. h. Shekar, "Ensemble learning-based classification of microarray cancer data on tree-based features," Cognitive Computation and Systems, vol. 3, no. 1, pp. 48–60, 2021.
Downloads
How to Cite
License
Copyright (c) 2025 SARA HADDOU BOUAZZA

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.