advertisement
The authors compared test results from an FDA-registered head-mounted smartphone perimeter (PalmScan VF2000, Micro Medical Devices, Calabasas, California) to HFA SITA Standard 24-2 results in terms of MD and PSD using t-tests, and Bland-Altman plots. Bland-Altman analyses also were applied to quadrant means of deviations from age normal sensitivity, The authors also assessed PalmScan’s test-retest repeatability using Spearman’s correlations and intraclass correlation coefficients but made no comparisons to the HFA.
Perhaps more interesting than what the authors assessed is what they did not assess. The authors pointed out that this was only a pilot study, which may explain why they did not recruit normal subjects in order to compare the specificities of the two devices, and also why they did not collect HFA test-retest data so as to compare the intertest variabilities of the two devices in the same patient cohort. However, they also seem to have ignored possibly interesting analyses of data that they appear to have had at hand.
Example: The authors compared pointwise decibel deviations from age-normal but did not compare how often test points were outside normal limits for Total and Pattern Deviation. If the two devices were found to identify similar patterns of VF damage in the probability plots and similar numbers of test points outside normal limits, we might have said, “Wow, as a follow-on study, let’s go see how similar the specificities of the two devices are.” And if the results were significantly dissimilar that might have suggested a different next step.
Other Examples: What about comparison of test duration as a function of the severity of VF loss? How about comparing PalmScan’s test-retest variability, for instance in the 32 available eyes with mild glaucoma to reports in the literature of HFA variability in similar cohorts? Does PalmScan software have analyses that are functionally similar to the HFA’s Glaucoma Hemifield Test and/or Visual Field Index? If so, how did those analyses compare? How do the authors’ Bland-Altman plots of Mean Deviation differences presented in Figure 3 compare to similar plots for the HFA in the literature, for instance in Heijl et al, 2019?1
In our experience, pilot studies are performed during product development as a way of confirming proposed design decisions. Clinical evaluations of FDA cleared commercially released devices should not be treated as pilot studies, but should be scaled to address and document clinically critical performance metrics. Critical metrics include comparisons of diagnostic sensitivities to early, moderate, and advanced disease, specificity of pointwise and summary analyses, pointwise inter-test variability across the dynamic range, and test durations across the full range of VF damage. Devices being evaluated must be equipped with testing strategies, normative data and analysis methods that are intended for actual clinical use; otherwise study findings may not usefully apply to clinical care.
We view publication of instrument evaluations that deal solely with correlations and/or only evaluate MD & PSD and/or worry about comparisons of raw threshold sensitivities as totally insufficient and, therefore misleading.
We view publication of instrument evaluations that deal solely with correlations and/or only evaluate MD & PSD and/or worry about comparisons of raw threshold sensitivities as totally insufficient and, therefore misleading. Of course the results are correlated; how could they not be, given that the instruments being compared were designed to measure visual field sensitivity? MD & PSD are poor tools for diagnosing of glaucoma or other disease; pointwise analyses should instead be used. And, of course the raw sensitivities found by the devices being compared do not have to match; that’s what normative databases are for.2
If we are going to spend time and energy performing and publishing diagnostic studies involving devices that are FDA cleared and commercially available, we must think carefully about which metrics are critical for clinical success. Sensitivity, specificity, test-retest reproducibility, and testing time all have to be on the list.