Database Contributor Review
InReDD-Dataset-PAN924
Caio Uehara Martins , Camila Tirapelli , Hugo Gaêta-Araujo , Jose Augusto Baranauskas , Breno Zancan , Jose Carneiro , Alessandra Macedo
Published: Nov. 22, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Uehara Martins, C., Tirapelli, C., Gaêta-Araujo, H., Baranauskas, J. A., Zancan, B., Carneiro, J., & Macedo, A. (2025). InReDD-Dataset-PAN924 (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/r5nt-we67
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
InReDD-Dataset-PAN924 is a collection of 924 radiographic images annotated with mouth and teeth labels by specialists from the InReDD research group.
InReDD (Interdisciplinary Research Group in Digital Dentistry) is a collaborative research initiative at the University of São Paulo’s Ribeirão Preto Campus (USP-RP), uniting the Department of Computation and Mathematics (DCM-USP-RP) and the School of Dentistry of Ribeirão Preto (FORP-USP-RP). The group is dedicated to developing applied technologies for the field of Odontology.
In this context, the InReDD-Dataset-PAN924 is an image collection from the field of Odontology. It was developed to support descriptive analyses and to facilitate the creation and validation of artificial intelligence models. The data were collected primarily through clinical work at FORP-RP.
This manuscript draws upon a previously published work, “Development of a dental digital dataset for research in artificial intelligence: the importance of labeling performed by radiologists.” However, certain details have been adjusted or updated to account for temporal adaptations and contextual revisions. As a result, portions of the content may not correspond verbatim to the original publication, although the scientific essence and core contributions remain preserved.
Background
Radiographic imaging is a cornerstone of dental diagnostics and treatment planning. In recent years, Artificial Intelligence (AI) has emerged as a powerful auxiliary tool for interpreting these images, showing promise in detecting caries, periodontal disease, and other oral pathologies [1]. The predominant approach, supervised learning, requires training models on large sets of accurately labeled data [2], which are often a significant bottleneck in medical AI development [3]. The reliability of any AI model is fundamentally dependent on the quality of this "ground truth," the reference standard used for training and validation.
This work is part of a broader effort to create an automated solution for the SB Brasil survey, enhancing the "Brasil Sorridente" program, which aims to classify the oral health status of the Brazilian population [4]. To address the need for high-quality training data, we introduce a new dataset with three unique characteristics.
First, it is composed of panoramic radiographs, a common imaging modality in clinical practice. Second, the dataset represents a Brazilian population sample from the School of Dentistry (FORP) in Ribeirão Preto, São Paulo. Third, and most critically, the ground truth was established by a consensus of experienced radiologists, ensuring a high-quality, reliable reference standard for model training.
Making accurately labeled, heterogeneous datasets publicly available is crucial for advancing the field. This dataset provides a valuable resource for the research community to develop, test, and optimize new AI models.
Methods
All annotations were managed and performed using LyriaPACS [5], a web-based platform connected to the I-Medsys image server [6]. I-Medsys customized the PACS to support the research, providing features such as individual work areas with personal keywords, trackable access to the system, and a checklist for image annotation. These features ensured a blinded workflow, data security, and an organized annotation process.
Annotations were conducted by radiologists in a dimly lit room using a monitor with 1024 × 768 resolution and 24-bit color depth. The built-in enhancement tools of the Lyria software (i.e., zoom, brightness, and contrast) were adjusted by each radiologist as needed to assist in the diagnostic task.
For labeling, three dentomaxillofacial radiologists, each with a decade of experience, were involved in the evaluation and labeling of the radiographic images. The process was divided into two distinct tasks:
- Labeling: This involved numbering each tooth, which was considered an immutable attribute. For example, tooth 12 will always be identified as tooth 12.
- Annotation: This task focused on indicating the condition of a tooth, which is a changeable attribute. For instance, tooth 12 might have decay in one radiograph, be healthy in another, or be an implant in a third.
To ensure accuracy and avoid bias in the AI training data, a forced consensus methodology was used:
- One radiologist individually labeled and annotated all panoramic radiographs.
- A second radiologist then independently reviewed this work.
- Any disagreements between the first two radiologists were resolved by consulting a third radiologist, whose decision established the final consensus.
This consensus became the ground truth for each identified tooth. Further details are provided in the referenced article.
The labeling process followed the Federation Dentaire Internationale (FDI) tooth numbering system and proceeded in a clockwise sequence around the dental arch, starting with the right maxillary molars and ending with the right mandibular molars. A separate JSON file was generated for each annotation.
Data Description
The collection contains 924 anonymized panoramic dental radiographs, designed to support research in digital dentistry.
| Item | Count |
|---|---|
| Images | |
| Total images | 924 |
| – Subset of images labeled with teeth and mouth (polyline) (1) | 924 |
| – Subset of images labeled with teeth segmentation (polyline) (2) | 200 |
| Annotations | |
| Total rectangle box annotations (1) [924 mouth, 20 033 teeth] | 20 957 |
| Total teeth segmentation masks (2) | 4 621 |
| Categories of tooth conditions | 14 |
| Categories of mouth conditions | 4 |
| Categories of tooth positions (FDI) | 32 |
Key Features
- Image resolution: 2903 × 1536 px (95 dpi), stored as JPG files.
- Annotations: Provided in a COCO-compatible JSON format with dental-specific fields (split and combined).
- Tooth-level labels: Includes bounding boxes and hierarchical condition annotations.
- Metadata: Contains patient details (age and sex).
Dataset Statistics
- Age distribution: Patients range from 14 to 81 years old, with a median age of 35 years.
- Gender distribution:
- Female: ~60%
- Male: ~40%
- Tooth conditions:
- Healthy teeth: ~45%
- Restored teeth: ~25%
- Caries: ~15%
- Other conditions (e.g., implants, residual roots): ~15%
Annotations
Annotations are distributed as JSON files in a COCO-compatible format, while preserving dental-specific fields.
Labels are provided in two versions:
- Split format: The
teeth_fdi_labelsandmouth_and_teeth_labelsdirectories contain one JSON file per image, with annotations specific to that image. - Combined format: The
teeth_fdi_labels.jsonandmouth_and_teeth_labels.jsonfiles contain all annotations combined into a single JSON file for the entire dataset.
teeth_fdi_labels contains teeth segmentation masks and bounding box positions with FDI labels.
mouth_and_teeth_labels contains mouth and teeth positions based on rectangular segmentation with corresponding condition labels.
Each annotation can include:
1. Position:
- bbox: Defined for teeth and mouth regions.
- segmentation: Defined for teeth and mouth regions.
2. Instance:
- Tooth-level labels: Following the FDI numbering system (00–88).
- Condition annotations: Binary flags for 12 common findings (e.g., caries, crown, implant, root canal treatment).
Images ("images")
Each image entry contains metadata about the image:
| Field | Type | Description |
|---|---|---|
id |
int | Unique identifier for the image. |
license |
int | License ID for the image. |
file_name |
string | Name of the image file (e.g., 2-F-70.jpg). |
height |
int | Height of the image in pixels. |
width |
int | Width of the image in pixels. |
sex |
string | Patient's sex (M for male, F for female). |
age |
string | Patient's age (e.g., 70). |
file_name contains ID, sex, and age; however, specific fields were created for better usability. IDs are random and provided after the anonymization process. They do not follow a logical order because some raw data did not meet quality standards and were removed from the dataset.
Labels ("annotations")
Each annotation entry contains information about a specific object position (e.g., tooth or mouth region) in the image and includes a category_id representing the label (e.g., Mouth Edentulous or Teeth Implant).
| Field | Type | Description |
|---|---|---|
id |
int | Unique identifier for the annotation. |
image_id |
int | ID of the image this annotation belongs to. |
category_id |
int | ID of the category. |
bbox [op] |
list[list] | Bounding box coordinates for the detection. |
segmentation [op] |
list[list] | Polygon coordinates for the segmentation mask. |
bbox and segmentation follow COCO standards. Positions are stored in bounding box or segmentation formats:
BBOX: Defined as[x, y, width, height], wherexandyspecify the top-left corner of the bounding box in pixel coordinates, andwidthandheightrepresent its size in pixels.SEGMENTATION: Defined as a list of polygon points[[x1, y1, x2, y2, …, xn, yn]]outlining the object mask. Each pair of values represents thexandycoordinates of a vertex in the 2D image space.
Categories ("categories")
The categories field defines the possible classes for annotations:
| Field | Type | Description |
|---|---|---|
id |
int | Unique identifier for the category. |
name |
string | Name of the category (e.g., Ed). |
supercategory |
string / None | Higher-level grouping for the category (can be none). |
supercategory was used only in the mouth_and_teeth_labels split, grouping categories as Mouth and Teeth (Artificial, Natural, or Mixed) superclasses.
Tooth/Mouth Condition Categories
- Major Bounding Box (Mouth Labels):
- Ed: Edentulous
- De: Dentate
- Me: Maxilla edentulous
- Mne: Mandible edentulous
- Minor Bounding Box (Teeth Labels):
- Artificial Teeth (DA):
- Im: Implant
- Cp: Single prosthetic crown
- P: Pontic
- Natural Teeth (DN):
- H: Healthy
- Rr: Residual root
- M3i: Impacted third molar
- M3f: Developing third molar
- Te: Endodontic treatment
- Ri: Intraradicular post
- Dc: Crown destruction
- Di: Incisal wear
- C: Caries
- R: Restored
- I: Impacted
- Mixed Teeth (DM):
- TeM: Endodontic treatment
- RiM: Intraradicular post
- CpuM: Single prosthetic crown
- Artificial Teeth (DA):
Observations
- Tooth-level bounding boxes: Following the FDI two-digit numbering system (00–88).
- Condition annotations: Binary flags for 12 common findings (e.g., caries, crown, implant, root canal treatment, periapical lesion).
Usage Notes
Annotations are distributed as JSON files in a COCO-compatible format [7], while preserving dental-specific fields. We particularly recommend using FiftyOne as a tool for organizing and exploring the dataset. Two Python scripts are provided: one demonstrates how to load the data in FiftyOne, and the other generates dataset statistics.
You can use these files to load the images and map the image IDs of each annotation to the corresponding annotation JSON. The annotation schema allows you to access all label information, where the BBOX and SEGMENTATION fields represent points in the 2D image space.
The images were converted to a 16-bit format using lossless compression, which reduces file size without compromising image quality. This format, standard in the clinic’s image bank and widely adopted in medical imaging, ensures both efficiency and fidelity. All images were anonymized, coded, and saved as lossless JPEGs with an original resolution of 2903 × 1536 pixels at 300 dpi, later reduced to 90 dpi for subsequent processing.
Note that the dataset represents a demographic sample of patients who visited the School of Dentistry and therefore predominantly includes individuals from Ribeirão Preto, São Paulo, Brazil.
Release Notes
Version 1.0.0
This version represents the initial set of images, containing subset splits of labels. Further work intends to add additional labels to these images and include more images in the collection.
In a broader view of our dataset development plan, we intend to add other types of files that represent additional modalities of information (e.g., 3D intraoral scans, CBCT image files, patient anamnesis text, etc.), which will encompass this image collection and other modality collections to create a multimodal, general dataset.
Ethics
The use of the images was submitted to and approved by the Ethics Committee (Plataforma Brasil, CAAE: 51238021.2.0000.5419).
Acknowledgements
We acknowledge the University of São Paulo’s Ribeirão Preto Campus (USP-RP), the Department of Computation and Mathematics (DCM-USP/RP), the Faculty of Philosophy, Sciences and Letters at Ribeirão Preto, and the School of Dentistry of Ribeirão Preto (FORP-USP/RP) for providing the infrastructure and environment that supported this research.
We also thank the São Paulo Research Foundation (FAPESP), the Innovation Agency USP (AUSPIN), and the USP Unified Scholarship Program (PUB) for funding this research.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- Pauwels R. A brief introduction to concepts and applications of artificial intelligence in dental imaging. Oral Radiol. 2021;37:153-60.
- Panetta K, Rajendran R, Ramesh A, et al. Tufts Dental Database: a multimodal panoramic X-Ray dataset for benchmarking diagnostic systems. IEEE J Biomed Health Inform. 2022;26:1650-9.
- Jader G, Fontineli J, Ruiz M, et al. Deep instance segmentation of teeth in panoramic x-ray images. 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI). New York: IEEE; 2018. p. 400-7.
- Ministério da Saúde (Brasil). Brasil Sorridente [Internet]. Brasília (DF): Ministério da Saúde; 2025 [cited 2025 Nov 6]. Available from: https://www.gov.br/saude/pt-br/composicao/saps/brasil-sorridente
- Carvalho DF, Camacho-Guerrero JA, Marques PM, Macedo AA. Lyria PACS: a case study saves ten million dollars in a Brazilian hospital. In: 28th IEEE International Symposium on Computer-Based Medical Systems; 2015; São Carlos, Brazil. p. 326–9. doi: 10.1109/CBMS.2015.87.
- I-medsys. Lyria PACS RT [Internet]. Ribeirão Preto (BR): I-medsys; [cited 2025 Nov 6]. Available from: https://i-medsys.com/lyriaRTusa.html
- Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, editors. Computer Vision: ECCV 2014. Cham: Springer; 2014. p. 740–55. (Lecture Notes in Computer Science; vol. 8693). doi: 10.1007/978-3-319-10602-1_48.
Access
Access Policy:
Only credentialed users who sign the DUA can access the files. In addition, users must have individual studies reviewed by the contributor.
License (for files):
PhysioNet Contributor Review Health Data License 1.5.0
Data Use Agreement:
PhysioNet Contributor Review Health Data Use Agreement 1.5.0
Required training:
No training required
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/r5nt-we67
DOI (latest version):
https://doi.org/10.13026/85hv-ct26
Project Website:
https://inredd.com.br/en/solutions/open-data
Corresponding Author
Files
- be a credentialed user
- submit a request to the authors to use the data for your project