Database Credentialed Access

Embedding-Based Representations for BRSET and mBRSET

David Restrepo Chenwei Wu Michael Morley Leo Anthony Celi Luis Filipe Nakayama

Published: March 30, 2026. Version: 1.0.0


When using this resource, please cite:
Restrepo, D., Wu, C., Morley, M., Celi, L. A., & Nakayama, L. F. (2026). Embedding-Based Representations for BRSET and mBRSET (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/1h4p-vz70

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

BRSET and mBRSET are publicly available Brazilian ophthalmological datasets composed of curated retinal fundus photographs with associated clinical and demographic information. While these resources enable diverse research applications, training deep learning models directly on high-resolution images is computationally intensive and often restricted by privacy regulations limiting the circulation of identifiable medical images. To address these challenges and facilitate equitable reuse, this project provides a comprehensive release of precomputed image embeddings for both datasets. These representations were generated using state-of-the-art vision backbones: DINOv3 ViT-S/16 (384-d) and ViT-B/16 (768-d) for transformer-based features, alongside ConvNeXt-Tiny (768-d) and ConvNeXt-Base (1024-d) for convolutional features. All models were applied in inference-only mode with a standardized preprocessing pipeline. Each fundus photograph was converted into a fixed-length numerical vector and exported as a CSV file, where each row corresponds to a single image and its respective embedding. These representations preserve critical semantic and structural information, enabling downstream tasks such as clustering, similarity search, multimodal modeling, disease classification, and fairness assessment without requiring raw pixel access. By providing scalable, privacy-preserving embeddings derived from Brazilian ophthalmic data, this resource reduces computational barriers, accelerates AI model development, and supports global research participation, particularly in low-resource environments, ensuring that advanced ophthalmic AI tools are accessible to a broader scientific community.


Background

Ophthalmology imaging exams, such as fundus photography, optical coherence tomography (OCT), and corneal topography, are foundational tools for the diagnosis, staging, and longitudinal monitoring of ocular diseases, including diabetic retinopathy (DR), glaucoma, and keratoconus [1]. These modalities enable early detection of pathological changes, support timely therapeutic decisions, and play a central role in preventing avoidable vision loss [1, 2].

In recent years, artificial intelligence (AI) has emerged as a transformative force in ophthalmology, enhancing diagnostic accuracy, facilitating automated triage, and expanding access to screening programs, particularly in low-resource settings [2–4]. Still, the development of robust AI systems in ophthalmic imaging typically depends on large labeled datasets, high-performance computing resources, and access to raw image files. These requirements impose operational, financial, and regulatory barriers that limit both scalability and equitable participation of institutions, especially those outside high-income regions.

To bridge part of this gap, two publicly available datasets, BRSET and mBRSET, were introduced, providing curated retinal images with demographic and clinical annotations representative of real-world Brazilian populations [5, 6]. Despite their importance, workflows relying on raw images remain inaccessible to many research groups due to hardware constraints, long training times, and institutional policies that restrict the distribution of identifiable ophthalmic data.

Precomputed image embeddings offer a powerful alternative [7]. Embeddings correspond to fixed-length numerical vectors generated by advanced vision models that encode semantic, structural, and statistical features of the original retinal images. These compact representations reduce storage requirements, accelerate model development, and facilitate reproducibility, while mitigating privacy concerns by abstracting away pixel-level data [7].

This repository provides the vector embeddings for BRSET and mBRSET, computed using state-of-the-art foundation and embedding-based vision architectures. These resources aim to enable rapid experimentation, cross-institutional collaboration, bias and fairness assessments, and the development of multimodal or lightweight AI systems in ophthalmology.


Methods

The Brazilian Multilabel Ophthalmological Dataset (BRSET) [5] is a multimodal dataset designed to improve the representation of diverse populations in ophthalmological artificial intelligence research. Collected from three ophthalmology centers in São Paulo, BRSET comprises 16,266 macula-centered fundus photographs from 8,524 patients, captured using Nikon NF505 and Canon CR-2 retinal cameras under pharmacologic mydriasis. The dataset reflects the demographic heterogeneity of the Brazilian population, encompassing multiple nationalities, a broad age distribution, and a predominance of female participants (65.1%). All images were anonymized to ensure the removal of identifiable metadata.

Each image was labeled by retinal specialists based on a comprehensive multimodal annotation protocol, including:

  • anatomical features (optic disc, macula, vascular arcades),
  • image quality indicators,
  • multi-label pathology assessment, and
  • diabetic retinopathy grading using both ICDR and SDRG systems.

In addition to image-level annotations, BRSET includes structured clinical metadata extracted from electronic medical records, such as age, sex, clinical history, insulin therapy, and diabetes duration. By combining high-quality retinal imaging with detailed metadata, BRSET provides a robust foundation for developing and benchmarking AI models focused on demographic prediction, disease classification, and fairness analysis in underrepresented populations.

The Mobile Brazilian Ophthalmological Dataset (mBRSET) [6] was developed to enable research in mobile screening, teleophthalmology, and resource-constrained environments. It contains 5,164 fundus images from 1,291 patients, collected during the Itabuna Diabetes Campaign in Bahia, Brazil, using the Phelcom Eyer handheld retinal camera. Exams were performed under pharmacologic mydriasis. The cohort reflects the ethnic diversity of Bahia, a region with a high proportion of Afro-Brazilian and mixed-ancestry populations. The dataset also includes a majority female demographic (65.1%) with a mean age of 61.4 years.

mBRSET is notable for its real-world disease distribution, with 23.2% of examinations yielding positive results for diabetic retinopathy. Images were anonymized and graded by retinal specialists for:

  • image quality
  • anatomical structures, and
  • diabetic retinopathy severity using the ICDR system.

Clinical metadata associated with each exam, including age, sex, insulin use, and diabetes duration, is provided alongside the image annotations. Designed as a benchmark for AI development in portable and community-level clinical workflows, mBRSET supports research on mobile diagnostics, scalability of screening, and healthcare equity in underserved populations.

Embedding Models

Embedding vectors for BRSET and mBRSET were generated using state-of-the-art computer vision architectures: DINOv3 [8], ConvNeXt-Tiny [9], and ConvNeXt-Base [10]. These models were selected for their complementary inductive biases, representational diversity, and robust performance across heterogeneous retinal image quality. All models were applied in inference-only mode with frozen weights, ensuring full reproducibility of the feature extraction process and preventing variability associated with fine-tuning or domain adaptation.

DINOv3 [8] represents the latest evolution of the DINO family of self-supervised vision transformers, learning expressive visual representations without manual labels. In this project, we employed two configurations: ViT-S/16 (Small), a balanced architecture that captures essential semantic structure and local texture relevant to retinal imaging, and ViT-B/16 (Base), a larger variant with increased depth and attention heads. This latter configuration provides a higher-dimensional embedding space and enhanced representational capacity, capturing more complex global morphological patterns and subtle pathological features that may be less apparent in smaller architectures. Both DINOv3 variants produce dimensional embeddings that retain meaningful anatomical and pathological information, making them suitable for downstream tasks such as classification, clustering, fairness assessments, and cross-device generalization studies.

ConvNeXt-Tiny [9] was included as a lightweight and computationally efficient backbone that combines design principles derived from Vision Transformers with the inductive biases of classical convolutional networks. Trained in a supervised manner on ImageNet-1k, it generates dimensional embeddings with excellent inference speed, which is advantageous when processing large-scale datasets. Its compact architecture delivers stable representations even under variations in illumination, contrast, or image noise, common challenges in real-world retinal screening environments.

ConvNeXt-Base [10], a deeper and wider variant of the ConvNeXt family, offers increased representational capacity and produces richer, higher-dimensional features. Also trained on ImageNet-1k, it generates dimensional embeddings that can capture subtle anatomical differences, fine-grained texture patterns, and disease-related cues present in retinal fundus photography. The model balances depth, performance, and computational efficiency, making it a suitable option for advanced downstream analyses that require expressive features.

All embeddings were generated using a standardized preprocessing workflow that included consistent image resizing, normalization, and deterministic inference with fixed random seeds. The resulting vectors were exported in CSV format using float32 precision and are accompanied by an index file containing image_id, model name, and dataset origin. This structure ensures interoperability across experiments, supports reproducibility, and enables direct use of embeddings in downstream machine learning pipelines without requiring access to the original retinal images.


Data Description

The embedding outputs are provided at the image level, where each row corresponds to a single retinal fundus photograph. The first column contains the image identifier—image_id for BRSET files and file for mBRSET files—which links each embedding vector to the corresponding image in the original BRSET or mBRSET dataset. Each file contains a set of fixed-length numerical feature vectors generated from state-of-the-art vision models, representing the semantic and structural information present in the original fundus images without requiring access to pixel-level data.

The embeddings encode visual characteristics of the retinal photographs, such as anatomical structures, texture patterns, contrast, and disease-related abnormalities, into compact, high-dimensional feature vectors. Images with similar morphological or pathological patterns tend to produce embeddings with closer distances in the feature space, while images with different abnormalities or combinations of normal and abnormal features exhibit more distinct vector representations. These embeddings enable downstream tasks such as clustering, disease classification, bias analysis, and multimodal integration.

Six CSV files are provided, corresponding to embeddings generated from different model architectures and dataset origins. Each file contains one identifier column (image_id for BRSET files or file for mBRSET files) followed by a fixed number of numerical dimensions (0 … N-1), exported in float32 precision:

  • Embeddings_brset_dinov3_convnext_tiny.csvimage_id + 768-dimensional embedding vectors for each BRSET image, produced using ConvNeXt-Tiny.
  • Embeddings_brset_dinov3_vits16.csvimage_id + 384-dimensional embedding vectors for each BRSET image, produced using DINOv3 ViT-S/16.
  • Embeddings_mbrset_dinov3_convnext_base.csvfile + 1024-dimensional embedding vectors for mBRSET, extracted using ConvNeXt-Base.
  • Embeddings_mbrset_dinov3_vitb16.csvfile + 768-dimensional embedding vectors generated using ViT-B/16 (DINOv3).
  • Embeddings_mbrset_dinov3_convnext_tiny.csvfile + 768-dimensional embedding vectors for each mBRSET image, produced using ConvNeXt-Tiny.
  • Embeddings_mbrset_dinov3_vits16.csvfile + 384-dimensional embedding vectors for each mBRSET image, produced using DINOv3 ViT-S/16.

Columns

  • image_id (BRSET files) or file (mBRSET files) — unique identifier corresponding to a retinal fundus photograph in the respective parent dataset.
  • 0 … N-1 — numerical features representing the embedding vector, where N corresponds to the dimensionality of the model used (384, 768, or 1024).

Usage Notes

The embedding vectors provided in this repository enable a wide range of downstream machine learning applications without requiring access to raw retinal fundus images. Researchers can use these compact feature representations to train lightweight classifiers or regressors for tasks such as disease detection, diabetic retinopathy grading, image quality assessment, or demographic prediction. Since embeddings capture semantic and structural information inherent to the original images, they are also suitable for unsupervised tasks, including clustering, similarity-based retrieval, and visualization of latent representations. When combined with tabular clinical or demographic metadata, embeddings can support multimodal modeling pipelines, enabling prediction tasks that integrate anatomical, clinical, and epidemiological information. Additionally, because these representations preserve patient-invariant structure while abstracting sensitive image content, they offer an efficient avenue for studying bias, fairness, and model generalization across demographic or device-specific subgroups.

To ensure optimal use of these embeddings, several best practices should be considered. Normalizing the embedding vectors, through L2-normalization or standardized scaling, is recommended before training downstream classifiers, as this improves stability and reduces sensitivity to magnitude differences across dimensions. When applying the embeddings to external datasets or images captured by different devices, domain adaptation or calibration techniques may be necessary to account for shifts in acquisition conditions, camera characteristics, or population differences. Finally, users are encouraged to systematically evaluate performance across demographic subgroups, imaging devices, and clinical cohorts to identify potential sources of bias and ensure that the model behaves equitably and robustly across diverse populations.


Release Notes

Version 1.0.0: This is the first release.


Ethics

This project relies on the existing institutional review board (IRB) approvals associated with the parent datasets. The BRSET project was approved by the São Paulo Federal University institutional review board (CAAE 33842220.7.0000.5505), and the mBRSET project was approved by the Instituto de Ensino Superior Presidente Tancredo de Almeida Neves institutional review board (CAAE 64219922.3.0000.9667). All data processing for the generation of embeddings was conducted in accordance with these original ethical approvals.


Acknowledgements

The authors thank the BRSET and mBRSET data collection teams.


Conflicts of Interest

None to declare.


References

  1. Nakayama LF, Matos J, Quion J, Novaes F, Mitchell WG, Mwavu R, et al. Unmasking biases and navigating pitfalls in the ophthalmic artificial intelligence lifecycle: A narrative review. PLOS Digit Health. 2024;3: e0000618. doi:10.1371/journal.pdig.0000618
  2. Ting DSW, Pasquale LR, Peng L, Campbell JP, Lee AY, Raman R, et al. Artificial intelligence and deep learning in ophthalmology. Br J Ophthalmol. 2019;103: 167–175. doi:10.1136/bjophthalmol-2018-313173
  3. Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. 2019;25: 24–29. doi:10.1038/s41591-018-0316-z
  4. Zhang J, Lin S, Cheng T, Xu Y, Lu L, He J, et al. RETFound-enhanced community-based fundus disease screening: real-world evidence and decision curve analysis. NPJ Digit Med. 2024;7: 108. doi:10.1038/s41746-024-01109-5
  5. Nakayama LF, Goncalves M, Zago Ribeiro L, Santos H, Ferraz D, Malerbi F, et al. A Brazilian Multilabel Ophthalmological Dataset (BRSET). PhysioNet; 2023. doi:10.13026/XCXW-8198
  6. Nakayama LF, Zago Ribeiro L, Restrepo D, Santos Barboza N, Dias Fiterman R, Vieira Sousa ML, et al. mBRSET, a Mobile Brazilian Retinal Dataset. PhysioNet; 2024. doi:10.13026/QXPD-1Y65
  7. Restrepo D, Wu C, Cajas SA, Nakayama LF, Celi LA, López DM. Multimodal deep learning for low-resource settings: A vector embedding alignment approach for healthcare applications. arXiv [cs.LG]. 2024. doi:10.48550/arXiv.2406.02601.
  8. Siméoni O, Vo HV, Seitzer M, Baldassarre F, Oquab M, Jose C, et al. DINOv3. arXiv [cs.CV]. 2025. doi:10.48550/ARXIV.2508.10104
  9. Xia J, Yin Y, Li X. An efficient medical image classification method based on a lightweight improved ConvNeXt-Tiny architecture. arXiv [cs.CV]. 2025. doi:10.48550/ARXIV.2508.11532
  10. Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S. A ConvNet for the 2020s. arXiv [cs.CV]. 2022. doi:10.48550/ARXIV.2201.03545

Parent Projects
Embedding-Based Representations for BRSET and mBRSET was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Project Views

1

Current Version

1

All Versions
Project Views by Unique Registered Users
Corresponding Author
You must be logged in to view the contact information.

Files