Transformer-Based DME Classification Using Retinal OCT Images Without Data Augmentation: An Evaluation of ViT-B16 and ViT-B32 With Optimizer Impact

K. C. Pavithra; Preetham Kumar; M. Geetha; Sulatha V. Bhandary; K. B. Ajitha Shenoy; Guruprasad Rao; Steven Fernandes; Akshat Tulsani

doi:10.1109/ACCESS.2025.3620945

Back

Transformer-Based DME Classification Using Retinal OCT Images Without Data Augmentation: An Evaluation of ViT-B16 and ViT-B32 With Optimizer Impact

Journal article

Peer reviewed

Transformer-Based DME Classification Using Retinal OCT Images Without Data Augmentation: An Evaluation of ViT-B16 and ViT-B32 With Optimizer Impact

K. C. Pavithra, Preetham Kumar, M. Geetha, Sulatha V. Bhandary, K. B. Ajitha Shenoy, Guruprasad Rao, Steven Fernandes and Akshat Tulsani

IEEE access, Vol.13, pp.180781-180798

01/01/2025

DOI: https://doi.org/10.1109/ACCESS.2025.3620945

Abstract

Computer Science

Computer Science, Information Systems

Engineering

Engineering, Electrical & Electronic

Science & Technology

Technology

Telecommunications

Diabetic macular edema (DME) continues to be the most prevalent manifestation of impaired vision in people suffering from diabetes. To assess DME in individuals, ophthalmologists routinely adopt optical coherence tomography (OCT), a retinal imaging modality. With the clinical assessment of DME, computerized diagnosis based on deep learning (DL) and OCT has emerged as a vital tool. A huge amount of information is required for model training, which constitutes the main limitation of DL. Most medical datasets are not appropriate for training DL models owing to their relatively small size. Classical knowledge augmentation generally fails to bring about the anticipated outcomes. Using transfer learning (TL) is an appropriate strategy to deal with this issue. Without using any augmentation strategies, we investigate the effectiveness of Vision Transformer (ViT) models in classifying DME OCT pictures. Using three optimization algorithms, Adam, SGD, and RMSProp, two ViT variants, ViT-B16 and ViT-B32, were fine-tuned on a public and a private dataset. The statistical measures, accuracy (AC), Recall (RE) and precision (PR) are presented. Additionally, gradient-weighted class activation mapping (Grad-CAM) heatmaps are employed for illustrating predictions from the model, providing significant details concerning the process of making decisions. The findings show that ViT-B16 consistently performed better than ViT-B32 on both datasets, while the Adam optimizer produced better recall (with the highest score of 100%), and in certain cases, RMSProp delivers the maximum precision. We used 5-fold cross-validation for statistical rigor and also compared ViT-B16 to CNN baselines (ResNet-50, ResNet-101, and EfficientNet-B3), which demonstrated that ViTs consistently outperform CNN baselines, albeit with a greater computational cost. Our findings reveal that OCT image classification performance can be improved by using finer-resolution transformer models in conjunction with suitable strategies for optimization.

Files and links (1)

url

https://doi.org/10.1109/ACCESS.2025.3620945View

Published (Version of record) Open

Metrics

1 Record Views

Details

Title: Transformer-Based DME Classification Using Retinal OCT Images Without Data Augmentation: An Evaluation of ViT-B16 and ViT-B32 With Optimizer Impact
Creators: K. C. Pavithra - Manipal Academy of Higher Education
Preetham Kumar - Manipal Academy of Higher Education
M. Geetha - Manipal Academy of Higher Education
Sulatha V. Bhandary - Manipal Academy of Higher Education
K. B. Ajitha Shenoy - Manipal Academy of Higher Education
Guruprasad Rao - Manipal Academy of Higher Education
Steven Fernandes - Creighton University
Akshat Tulsani - New York University
Publication Details: IEEE access, Vol.13, pp.180781-180798
Publisher: IEEE
Number of pages: 18
Identifiers: 991006264895902656
Academic Unit: Computer Science, Design and Journalism
Language: English
Resource Type: Journal article