Name: RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports
Published: Sept. 12, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Jean-Benoit Delbrouck

Published: Sept. 12, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Delbrouck, J. (2025). RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/j8e7-pr22

MLA	Delbrouck, Jean-Benoit. "RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/j8e7-pr22
APA	Delbrouck, J. (2025). RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/j8e7-pr22
Chicago	Delbrouck, Jean-Benoit. "RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/j8e7-pr22
Harvard	Delbrouck, J. (2025) 'RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports' (version 1.0.0), PhysioNet. RRID:SCR_007345. Available at: https://doi.org/10.13026/j8e7-pr22
Vancouver	Delbrouck J. RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports (version 1.0.0). PhysioNet. 2025. RRID:SCR_007345. Available from: https://doi.org/10.13026/j8e7-pr22

Additionally, please cite the original publication:

Jean-Benoit Delbrouck, Pierre Chambon, Zhihong Chen, Maya Varma, Andrew Johnston, Louis Blankemeier, Dave Van Veen, Tan Bui, Steven Truong, and Curtis Langlotz. 2024. RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports. In Findings of the Association for Computational Linguistics: ACL 2024, pages 12902–12915, Bangkok, Thailand. Association for Computational Linguistics.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

Radiology reports are essential for clinical care but pose challenges for automated processing due to their unstructured nature. Existing datasets like RadGraph-1.0 focus narrowly on chest X-rays (CXR), limiting their applicability. We introduce RadGraph-XL, a large-scale, expert-annotated dataset of 2,300 radiology reports with over 410,000 labeled entities and relations, spanning four anatomy-modality pairs: chest computed tomography (CT), abdomen/pelvis CT, brain magnetic resonance imaging (MR), and CXR.

Each report is annotated by board-certified radiologists using a detailed schema that captures observations, anatomical references, and their relationships. A novel post-processing step identifies measurement-related entities, a clinically valuable category. Trained models using RadGraph-XL outperform prior methods and GPT-4, and generalize well to out-of-domain data such as deep vein thrombosis (DVT) ultrasound reports.

RadGraph-XL is released publicly with models and annotations to support applications in clinical natural language processing (NLP), medical imaging artificial intelligence, and foundation model evaluation, setting a new benchmark for structured information extraction in radiology.

Background

Traditionally, extracting structured data from radiology reports has been difficult because these reports are written in free-text form and contain specialized medical terminology. Prior efforts, such as RadGraph-1.0 [1], focused on chest X-ray (CXR) data and provided a valuable framework for labeling clinical entities (e.g., anatomies and observations) and their relationships (e.g., "located at," "modify," "suggestive of"). However, RadGraph-1.0 was limited to just one imaging modality (CXR) and thus could not meet the increasing need for fine-grained, structured information across a broader range of anatomical regions and imaging techniques.

With radiology research expanding to diverse modalities such as computed tomography (CT) and magnetic resonance imaging (MRI), and anatomies such as chest, abdomen/pelvis, and brain, and with tasks like clinical monitoring, disease tracking, and artificial intelligence (AI)-driven image analysis relying on richer annotations, there was a clear gap. RadGraph-XL was therefore created to provide a large-scale, expert-labeled dataset encompassing multiple anatomy-modality pairs (chest CT, abdomen/pelvis CT, brain MRI, and CXR). By significantly increasing the data volume and complexity of annotations, RadGraph-XL seeks to advance automated radiology report analysis, facilitate improved model performance, and enable new research on measurement extraction and structured information retrieval.

Methods

Report Selection

A total of 2,300 radiology reports were curated from two large institutional sources: Medical Information Mart for Intensive Care Chest X-ray (MIMIC-CXR) and Stanford Health Care. Rather than using all available reports, we employed a targeted sampling strategy to ensure clinical diversity and semantic coverage across different imaging contexts. Specifically, we focused on four modality–anatomy pairs:

Chest computed tomography (CT)
Abdomen/Pelvis CT
Brain magnetic resonance imaging (MRI)
Chest X-rays (CXR)

The report selection process involved three steps:

Condition coverage: Reports were filtered to ensure a balanced representation of disease types and imaging findings.
Semantic clustering: We used Universal Sentence Encoder (USE) embeddings and t-distributed stochastic neighbor embedding (t-SNE) projection to cluster the reports based on content similarity, and sampled uniformly across clusters to maintain topical diversity.
Length stratification: Reports were grouped by sentence length and sampled proportionally to represent both concise and complex narratives.

This approach ensured that the dataset includes a wide range of diagnostic content, report styles, and anatomical references.

Annotation Schema

We adopted and extended the RadGraph-1.0 schema, which defines entities and relations within the text of radiology reports. Each entity is a span of text labeled according to both clinical type and certainty:

Entity Labels:

Anatomy: Definitely Present – A body structure that is clearly visible or referenced in the image (e.g., "right lung").
Anatomy: Definitely Absent – A body structure that is noted as missing or removed (e.g., "absent gallbladder").
Anatomy: Uncertain – Unclear presence or visibility of a body part (e.g., "possible adrenal gland").
Observation: Definitely Present – A radiologic finding, diagnosis, or visual feature confidently stated (e.g., "pleural effusion").
Observation: Definitely Absent – A finding that is explicitly negated (e.g., "no pneumothorax").
Observation: Uncertain – Findings described with uncertainty or ambiguity (e.g., "could represent a mass").

In addition, we introduced a post-processing step to detect measurement-related entities, such as "4.6 cm" or "less than 6 mm", which are important for quantitative assessment in radiology.

Relation Labels:

Modify – One entity changes or qualifies another (e.g., "small mass" where "small" modifies "mass").
Located At – An observation is associated with an anatomical site (e.g., "effusion" located at "left pleural space").
Suggestive Of – One observation implies another (e.g., "consolidation" suggestive of "pneumonia").

Expert Review Process

Each report was double-annotated by board-certified radiologists, with a required minimum inter-annotator agreement rate of 50%. Disagreements were reviewed and resolved by an adjudicating radiologist. This rigorous process resulted in 406,141 validated annotations covering a broad and balanced distribution of entities and relations.

Data Description

Scope and Coverage

The dataset includes 2,300 expert-annotated radiology reports across four modality–anatomy pairs:

Chest computed tomography (CT)
Abdomen/Pelvis CT
Brain magnetic resonance imaging (MRI)
Chest X-ray (CXR)

Reports originate from two institutions, MIMIC-CXR and Stanford, to ensure stylistic and clinical diversity. This repository release contains only the MIMIC subset.

Dataset Structure and Variables

Each report is annotated with:

Entities
- Anatomy: present, absent, or uncertain
- Observation: present, absent, or uncertain
- Measurements: spans expressing sizes or dimensions (e.g., "5 mm", "2.5 × 1.5 cm")
Relations
- modify
- located at
- suggestive of

Annotations are stored in structured formats (e.g., JSON), containing:

Full report text
List of entities (text span, label, type)
List of relations (source entity, target entity, type)

Descriptive Statistics

Statistic	Value
Total Reports	2,300
Institutions	MIMIC-CXR, Stanford
Modality–Anatomy Pairs	4 (Chest CT, Abdomen/Pelvis CT, Brain MR, Chest X-ray)
Average Report Length	~410 words
Length Range (min–max)	~100 to 600+ words
Total Entities	226,563
— Anatomy Entities	113,121
— Observation Entities	113,442
Entity Type Breakdown
— Definitely Present	82,522 (observations), 113,121 (anatomy)
— Definitely Absent	22,882 (obs.), 4 (anat.)
— Uncertain	8,038 (obs.), 3 (anat.)
Total Relations	179,578
— Modify	113,679 (63.3%)
— Located At	59,154 (32.9%)
— Suggestive Of	6,745 (3.8%)
Measurements	3,297 entities annotated post hoc
— Most common in** Abd/Pelvis CT	1,421 mentions
Unique Entity Types	19,772 combinations of text and label
Unique Relation Triplets	67,323 (source entity, target entity, label)
Average Agreement (double annotation)	≥ 50% across all modalities

Data Splits

For reproducible experiments and fair comparisons, the dataset is divided into:

Training set: 2,320 reports
Validation set: 290 reports
Test set: 290 reports (used for official benchmarking)

Use Cases

This dataset is particularly suited for:

Clinical Named Entity Recognition (NER)
Relation extraction between anatomical and pathological concepts
Developing and evaluating medical information extraction pipelines
Benchmarking generalist or task-specific language models in healthcare NLP
Exploring cross-modality generalization

Sample

{
"dataset": "mimic-chest-ct",
"doc_key": 0,
"sentences": [
["STUDY", ":", "CT", "torso", ".", "HISTORY", ":", "Metastatic", "breast", "cancer", "..."]
],
"ner": [
[ [77, 77, "Anatomy::definitely present"],
[78, 78, "Observation::definitely present"],
...
]
],
"relations": [
[ [78, 78, 77, 77, "located_at"],
[84, 84, 83, 83, "modify"],
...
]
]
}

Usage Notes

Potential Applications

RadGraph-XL is a versatile resource designed to advance clinical natural language processing (NLP) in radiology. Key applications include:

Entity and Relation Extraction: Training models to identify and link anatomical structures, clinical observations, and measurements in radiology reports.
Measurement Understanding: Specialized support for analyzing size and length descriptors (e.g., "2.5 centimeters (cm)", "less than 6 millimeters (mm)"), which are critical for monitoring disease progression.
Structured Reporting: Enabling automated conversion of free-text radiology findings into structured data formats to support electronic medical record (EMR) integration and clinical workflows.
Summarization & Report Generation: Facilitating high-quality summarization and generation of radiology findings from raw text or imaging data.
Model Evaluation: Providing a standardized benchmark for comparing clinical NLP models, including domain-specific transformer architectures and large language models (LLMs).
Clinical Decision Support: Laying the foundation for downstream tasks like diagnostic assistance and patient trajectory analysis.

Resources and Tooling

The official RadGraph-XL GitHub repository provides:

Downloadable data files, annotation schema, and data splits.
Pretrained DyGIE++ (Biomedical Vision-Language Pretraining Chest X-ray Bidirectional Encoder Representations from Transformers, BiomedVLP-CXR-BERT) [2] and Span-based Entity and Relation Transformer (SpERT) [3] models for entity–relation extraction.
A detailed model card describing training parameters, performance metrics, and usage instructions.

Python utilities for:

Parsing and preprocessing the .jsonl data
Evaluating predictions using official metrics
Visualizing entity–relation graphs

Software Requirements:

Python ≥ 3.7
PyTorch ≥ 1.7
Hugging Face Transformers library

These tools enable rapid experimentation and seamless integration into research pipelines.

Known Limitations

While RadGraph-XL introduces several innovations, users should be aware of the following limitations:

Limited modality coverage: Only four modality–anatomy pairs are covered; modalities like ultrasound, mammography, or positron emission tomography (PET) are not included.
Institutional bias: Although sourced from two institutions (Medical Information Mart for Intensive Care Chest X-ray, MIMIC-CXR, and Stanford Health Care), institutional language and documentation style may not generalize globally.
Measurement handling: Measurement entities were added via post-processing and not manually validated in the same way as core annotations, which may introduce minor inconsistencies.
Annotation agreement: Inter-rater agreement was ≥50%, indicating variability in complex cases despite expert review.
Data availability: Only the MIMIC subset is publicly released due to data-sharing restrictions.

Ethics

RadGraph-XL enables structured data extraction from radiology reports, supporting medical AI and clinical research. The dataset is de-identified and IRB-approved, ensuring patient privacy. While our models improve information retrieval, potential biases and misinterpretations must be carefully monitored. We encourage fairness audits across demographics. RadGraph-XL is released for research purposes and should not be used in clinical care without further validation.

Conflicts of Interest

We do not report any financial or personal relationships that could be construed as conflicts of interest. The dataset is made publicly available for non-commercial research purposes, and all contributing institutions have approved its release under these terms.

References

Jain S, Agrawal A, Saporta A, Truong S, Duong DN, Bui T, Chambon P, Zhang Y, Lungren MP, Ng AY, Langlotz C, Rajpurkar P. RadGraph: Extracting Clinical Entities and Relations from Radiology Reports. In: Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). 2021.
Wadden D, Wennberg U, Luan Y, Hajishirzi H. Entity, Relation, and Event Extraction with Contextualized Span Representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). p. 5784–9. 2019.
Eberts M, Ulges A. Span-based joint entity and relation extraction with transformer pre-training. In: Proceedings of the European Conference on Artificial Intelligence (ECAI 2020). IOS Press; 2020. p. 2006–13.