We are witnessing an explosion in research exploring artificial intelligence (AI) applications in clinical medicine in general and in ophthalmology in particular. It is rare that a technology has the potential to disrupt clinical practice emerges that has such a wide range of applications across virtually all clinical subspecialties. Artificial intelligence is such a technology. With such an increase in the number of publications, the quality of these publications also varies considerably which can limit the validity, generalizability and comparability of the research. To address this issue there is a recent focus on developing standards and guidelines for reporting of artificial intelligence in medical research.1-4 One concern regarding AI studies is that it is difficult to assess the quality and validity of the label or ground truth used to train the models, which can often be subjective and variable. In Jammal et al., they build on their previous work that avoided the subjective and often variable human assessment of fundus photographs as their ground truth for glaucoma. Rather, they applied a 'machine to machine (M2M)'-based approach that used objective optical coherence tomography measurements of the retinal nerve fiber layer as their reference label for training deep learning algorithms to detect glaucomatous damage from fundus photographs.5 Their M2M approach was able to predict RNFL thickness from fundus photographs with high accuracy and thus avoid the use of subjective ground truth labelling.
Deep learning models trained on objective labels (e.g., RNFL thickness) can be effective in glaucoma detection
In the current study, they extend this strategy to compare the ability of human graders and their M2M deep learning algorithm to detect visual field damage. Two human experts provided estimates of glaucoma likelihood (on a scale from 0 to 10) while the M2M AI method provided estimates of RNFL thickness from fundus photographs. These quantitative metrics were then used to identify perimetric glaucoma (determined by expert graders using visual field data) and compared to visual field mean deviation. A strength of this approach is that it provides the authors an additional visual field reference standard to validate their AI approach. This study provided further confirmation that deep learning models trained on objective labels (e.g., RNFL thickness) can be effective in glaucoma detection. Specifically, this study found that, compared to subjective graders, the performance of the objective M2M algorithm was more strongly correlated with visual field metrics, particularly in the high specificity range relevant for screening. Another strength of this study was its thorough reporting of model accuracy. The authors reported not only area under the receiver operating characteristic curves (AUC), but also partial AUC at the high specificity range (85-100%) relevant for screening and precision recall curves which can help avoid overly optimistic estimates of the model performance when the data is unbalanced (e.g., unequal numbers of GON vs not GON cases).
Other concerns regarding the reporting of AI studies include providing information on how well the model will perform in other populations and on what the model is using to make its prediction (opening the black box). Estimating model generalizability on external test sets collected from diverse populations is becoming a standard for reporting AI results and was not completed in this publication. Although the authors did not include visualization strategies such as class activation maps to provide insight into model predictions, these analyses were included their original M2M publication. In future work, it will be important to understand how disease stage/severity (e.g., pre-perimetric vs. perimetric) impacts M2M-based predictions, how well it performs on external datasets, and whether the objective quantitative metric provided by the M2M approach can be used to detect glaucomatous progression.