The field of glaucoma detection has advanced through machine learning (ML) approaches, exhibiting high diagnostic accuracies.1,2 However, a notable challenge arises from the lack of external validation for these models, leading to reduced interpretability and limited clinical applicability.3,4
In a prospective study conducted by Li et al., ML models were developed using retinal nerve fiber layer (RNFL) thickness obtained from a compensation model and OCT data in a diverse dataset of Asian and Caucasian glaucoma patients and controls. The ML models demonstrated superior performance in the Asian dataset (AUC = 0.96) compared to the Caucasian dataset. Notably, the model trained with compensated RNFL thickness (AUC = 0.93) outperformed models trained with original (AUC = 0.83) and measured data (AUC = 0.82).
The relatively modest sample size of the European validation set elicits concerns, while the inherent dissimilarity in demographic metrics between the European and Asian datasets further compounds the issue
This study assumes significant importance as it highlights an objective challenge in glaucoma ML models, whereby models trained on a specific dataset manifest reduced performance in other datasets. Nevertheless, the presented methods expose certain limitations. The relatively modest sample size of the European validation set elicits concerns, while the inherent dissimilarity in demographic metrics between the European and Asian datasets further compounds the issue. Consequently, it becomes imperative to investigate whether disparities in structural parameters among subjects across diverse datasets emanate from variations in data distribution or intrinsic disparities among ethnic groups.
However, glaucoma ML research is unequivocally confronted with the pressing challenge of diminishing model performance when applied to external datasets, necessitating urgent resolution. Consideration should be given to the development and implementation of data-sharing and privacy-preserving frameworks, such as federated learning. Furthermore, future research should prioritize compensatory models and the utilization of larger, diverse datasets. The construction of multicenter public datasets and databases encompassing structural parameters, such as RNFL thickness, across various ethnic populations, is imperative to enhance the generalizability and reliability of ML-based glaucoma detection.