Challenges in the classification of medical pathologies in older adults: a review of the impact of imbalanced datasets in Machine Learning

Authors

  • Amanda Milena Santacruz Madroñero Universidad Pontificia Bolivariana
  • Leonardo Betancur Agudelo Universidad Pontificia Bolivariana

DOI:

https://doi.org/10.26507/paper.4709

Keywords:

Imbalanced datasets, Machine learning,, Classification models, Older adults' health, AI-assisted medical diagnosis

Abstract

The development of Machine Learning (ML) models for the diagnosis and monitoring of medical pathologies in older adults faces multiple challenges, with one of the most critical being the issue of imbalanced datasets. In such cases, the scarcity of data corresponding to less frequent diseases affects the ability of algorithms to learn representative patterns, leading to models biased toward the majority class and a high rate of false negatives in critical conditions.

The objective of this study is to review the impact of class imbalance on the classification of medical pathologies in older adults and critically analyze current strategies to mitigate this issue. Traditional approaches such as oversampling (SMOTE, ADASYN), undersampling, and hybrid methods (SMOTE-ENN, SMOTE-Tomek Links) are compared, along with advanced Deep Learning techniques and synthetic data generation. Additionally, the study identifies more suitable evaluation metrics for assessing performance in imbalanced scenarios, such as AUC-PR, F1-score, and Balanced Accuracy.

The methodology follows a systematic literature review based on the PRISMA model, compiling relevant studies from high-impact scientific databases (PubMed, Scopus, IEEE Xplore). Throughout this process, specific inclusion and exclusion criteria were established to select research that specifically addresses the problem of class imbalance in medical applications for older adults.

Preliminary findings suggest that, while oversampling improves the representation of the minority class, it introduces the risk of overfitting and noise in the data. Undersampling, on the other hand, can compromise the information of the majority class, reducing the model’s overall predictive capacity. Hybrid techniques, according to the reviewed literature, appear to be a viable and superior alternative to the previous methods, as they combine the best aspects of both approaches and optimize class distribution within the dataset. However, they require precise calibration and careful evaluation in each specific application. Deep Learning models, such as Generative Adversarial Networks (GANs) and autoencoders, have been explored for synthetic data generation, showing potential in creating more realistic samples but with limitations in interpretability and high computational costs.

Given the above, although the hybrid approach presents relevant advantages, a definitive solution that completely eliminates the adverse effects of data imbalance in AI models applied to healthcare has yet to be identified. The issue remains open and represents an active area of interest within the scientific community, particularly at the doctoral research level, where new methodologies continue to be explored to enhance the fairness and reliability of predictive models in clinical settings. Therefore, it is crucial to continue advancing the development of more robust techniques that improve the detection of underrepresented pathologies in older adults, ensuring more accurate and effective diagnoses and medical care.

References

Tarekegn, A. N., Michalak, K., Costa, G., Ricceri, F., & Giacobini, M. (2020). Predictive Modeling for Frailty Conditions in Elderly People. JMIR Medical Informatics, 8(6), e16678. https://doi.org/10.2196/16678

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220–239. https://doi.org/10.1016/j.eswa.2016.12.035

Charte, F., Rivera, A. J., del Jesus, M. J., & Herrera, F. (2019). Dealing with difficult minority labels in imbalanced multilabel data sets. Neurocomputing, 326–340. https://doi.org/10.1016/j.neucom.2016.08.158

Xiao, C., Choi, E., & Sun, J. (2018). Opportunities and challenges in developing deep learning models using electronic health records data: A systematic review. Journal of the American Medical Informatics Association, 25(10), 1419–1428.

Yang, G., Wang, G., Wan, L., Wang, X., & He, Y. (2025). Utilizing SMOTE-TomekLink and machine learning to construct a predictive model for elderly medical and daily care services demand. Scientific Reports, 15, 8446. https://doi.org/10.1038/s41598-025-92722-1

Lee, J., Lee, S., Street, W. N., & Polgreen, L. A. (2022). Machine learning approaches to predict the 1-year-after-initial-AMI survival of elderly patients. BMC Medical Informatics and Decision Making, 22, 115. https://doi.org/10.1186/s12911-022-01854-1

Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., ... & Moher, D. (2021). Declaración PRISMA 2020: una guía actualizada para la publi-cación de revisiones sistemáticas. Revista Española de Cardiología, 74(9), 790–799. https://doi.org/10.1016/j.recesp.2021.06.016

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Syn-thetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1322–1328). IEEE. https://doi.org/10.1109/IJCNN.2008.4633969

Douzas, G., & Bação, F. (2018). Effective data generation for imbalanced learning us-ing conditional generative adversarial networks. Expert Systems with Applications, 91, 464–471. https://doi.org/10.1016/j.eswa.2017.09.030

Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432

How to Cite

[1]
A. M. Santacruz Madroñero and L. Betancur Agudelo, “Challenges in the classification of medical pathologies in older adults: a review of the impact of imbalanced datasets in Machine Learning”, EIEI ACOFI, Sep. 2025.

Downloads

Download data is not yet available.

Published

2025-09-08
Article metrics
Abstract views
Galley vies
PDF Views
HTML views
Other views
Escanea para compartir
QR Code
Crossref Cited-by logo