Database Credentialed Access

Published: June 3, 2021. Version: 1.0.0

Jain, S., Agrawal, A., Saporta, A., Truong, S. Q., Nguyen Duong, D., Bui, T., Chambon, P., Lungren, M., Ng, A., Langlotz, C., & Rajpurkar, P. (2021). RadGraph: Extracting Clinical Entities and Relations from Radiology Reports (version 1.0.0). PhysioNet. https://doi.org/10.13026/hm87-5p47.

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

## Abstract

RadGraph is a dataset of entities and relations in full-text radiology reports. We designed a novel information extraction (IE) schema to structure clinical information in a radiology report with four entities and three relations. Our train set consists of 500 MIMIC-CXR radiology reports annotated according to our schema by board-certified radiologists. Our test set consists of 50 MIMIC-CXR and 50 CheXpert reports, which are independently annotated by two board-certified radiologists. Additionally, we release annotations generated by a benchmark deep learning model that achieves a micro F1 of 0.82 (MIMIC-CXR test set) and 0.73 (CheXpert test set) on an evaluation metric for end-to-end relation extraction, where entity boundaries, entity types, and relation type must be correct. We use our model to automatically generate entity and relation labels across 220,763 MIMIC-CXR  reports and 500 CheXpert reports, where annotations can be mapped to associated chest radiographs in the MIMIC-CXR and CheXpert datasets respectively. The dataset, which includes reports, entities, and relations, is de-identified according to the US Health Insurance Portability Act (HIPAA). This dataset is intended to support the development of natural language processing (NLP) methods for entity and relation extraction in radiology as well as enable multi-modal use cases that can leverage entities, relations, and associated radiographs.

## Background

Clinically relevant information from radiology text reports can be used for a variety of purposes, ranging from large-scale training of medical imaging models to population-level analysis. However, radiology reports largely consist of unstructured text written by radiologists for referring clinicians, which can be challenging to process for various applications. As such, numerous approaches have been developed to extract information from radiology reports.

Large-scale datasets, such as CheXpert [1] and MIMIC-CXR [2], use automated radiology report labelers [1][3][4][5] to extract common medical conditions from radiology reports. These labels do not capture fine-grained information, such as the specific entities and relations present in a radiology report. Other approaches capture more specific information in radiology reports by adopting entity extraction schemas [6][7] or other schemas that focus on facts [8] and spatial relations [9].

We release the first dataset with dense annotations of both entities and relations in radiology reports for an information extraction schema designed to capture more clinically relevant information within medical reports.

## Methods

The creation of RadGraph required annotating natural language in the form of free-text radiology reports. Our dataset was annotated by three board-certified radiologists, each with at least eight years of experience, according to our information extraction schema using a text labeling platform (Datasaur.ai, Sunnyvale, CA). Our schema was developed with a board-certified radiologist, Dr. Curt Langlotz, based on prior experience [7] designing a radiology report schema and observing the practical difficulties of labeling according to the schema. Our schema was then iteratively improved to further support ease and consistency of annotation over the course of multiple pilot labeling initiatives. To maintain higher quality annotations, our dataset does not include any of the annotations obtained during the pilots. For the train and dev sets, simple annotation mistakes were manually corrected.

For train and dev sets, we use radiology reports from the MIMIC-CXR [2] dataset. Our train set consists of 425 reports, and our dev set consists of 75 reports. Each report in both the train and dev sets was labeled by one of three board-certified radiologists. The patients associated with reports in the train set do not overlap with any patients associated with reports in the dev set.

For a test set, we use radiology reports from both the MIMIC-CXR and CheXpert [1] datasets to test generalization across institutions. Our test set consists of 100 reports, 50 from the MIMIC-CXR dataset and 50 from the CheXpert dataset. Each report in the test set was independently labeled by two board-certified radiologists. The patients associated with reports in the test set do not overlap with any patients associated with the rest of the reports in our dataset.

Additionally, we train a DYGIE++ model for joint entity and relation extraction [10] on report texts and annotations in our train set using pretrained weights from PubMedBERT [11] . Our model achieves a micro F1 score of 0.82 (MIMIC-CXR test set) and 0.73 (CheXpert test set) on an evaluation metric for end-to-end relation extraction, where entity boundaries, entity types, and relation type must be correct. We then use this model to automatically generate entity and relation annotations for 220,763 MIMIC-CXR reports and 500 CheXpert reports, which do not overlap with the train, dev, and test sets.

MIMIC-CXR reports were already de-identified [2]. We de-identified CheXpert reports using an automated, transformer-based de-identification algorithm followed by manual review. For the CheXpert reports, we used a hiding-in-plain-sight (HIPS) approach, which replaced personal health information (PHI) with fake PHI. PHI in MIMIC-CXR reports was replaced with three consecutive underscores.

## Data Description

### Data Schema Overview

Our schema defines two broad entity types: Observation and Anatomy. The Observation entity type includes three uncertainty levels: Definitely Present, Uncertain, and Definitely Absent. Thus, in total, we have four entities, which are labeled as “ANAT-DP”, “OBS-DP”, “OBS-U”, and “OBS-DA”. Our schema defines three relations between entities, which are labeled as “suggestive_of”, “located_at”, and “modify”.

### Detailed Data Schema Description

An entity is a continuous span of text.

• The Anatomy entity refers to an anatomical body part that occurs in a radiology report, such as a “lung”.

• The Observation entities refer to observations made when referring to the associated radiology image. Observations are associated with visual features, identifiable pathophysiologic processes, or diagnostic disease classifications. For example, an Observation could be “effusion” or more general phrases like “increased”. Each Observation has an associated level of uncertainty.

Relations are directed arrows from one entity to another that describe relationships between two entities. We define the types of relations in the form of “relation type (entity_type, entity_type)”.

• suggestive_of (Observation, Observation) is a relation between two Observation entities indicating that the status of the second Observation is inferred from that of the first Observation.

• located_at (Observation, Anatomy) is a relation between an Observation entity and an Anatomy entity indicating that the Observation is related to the Anatomy. While located_at often refers to location, it can also be used to describe other relations between an Observation and an Anatomy.

• modify (Observation, Observation) or (Anatomy, Anatomy) is a relation between two Observation entities or two Anatomy entities indicating that the first entity modifies the scope of, or quantifies the degree of, the second entity. As a result, all Observation modifiers are annotated as Observation entities, and all Anatomy modifiers are annotated as Anatomy entities.

### File Overview

The dataset consists of the following:

• train.json: File containing annotations obtained by board-certified radiologists for 425 radiology reports (MIMIC-CXR)

• dev.json: File containing annotations obtained by board-certified radiologists for 75 radiology reports (MIMIC-CXR)

• test.json: File containing annotations obtained by 2 different board-certified radiologists for 100 radiology reports (50 MIMIC-CXR, 50 CheXpert)

• MIMIC-CXR_graphs.json: File containing annotations obtained by our deep learning model for 220,763 radiology reports (MIMIC-CXR)

• CheXpert_graphs.json: File containing annotations obtained by our deep learning model for 500 radiology reports (CheXpert)

• models/model_checkpoint: Folder containing the saved model parameters used to automatically generate the annotations in MIMIC-CXR_graphs.json and CheXpert_graphs.json

• models/README.txt: File containing instructions for performing inference using model checkpoint

• models/inference.py: File containing code to extract entities and relations given a directory containing reports in txt format

### Detailed File Structure Description

Each json file holds a dictionary. The dev.json, train.json, MIMIC-CXR_graphs.json, and CheXpert_graphs.json files are organized as follows:

• The keys of the dictionary are unique identifiers for each report in the dataset split. For MIMIC-CXR reports, the unique identifier is the path to the relevant file in the MIMIC-CXR dataset. For CheXpert reports, the unique identifier is the path to the relevant chest radiograph study in the CheXpert dataset (only for inference, unique identifier has no mapping for the test split). The keys map to a nested dictionary containing information about the report.

• The keys of the nested dictionary are “data_source”, “data_split”, “text”, and “entities”.

• “data_source” maps to the dataset that contains the report, which is either MIMIC-CXR or CheXpert.

• “data_split” maps to either train, dev, test, or inference. The inference split indicates that the annotations were automatically generated by a model.

• “text” maps to a string that holds a report, where each token in the string is separated by a space. To support entity extraction, punctuation characters (periods, colons, etc.) are distinct tokens that have been separated by spaces.

• “entities” maps to a dictionary of entities labeled in the report. Each entity has an “entity_id”, which is a unique identifier of the entity in the report. “entity_id” maps to a dictionary with the following keys:

• “tokens” maps to one or more tokens that make up an entity.

• “labels” maps to one of the four entities defined by the schema.

• “start_ix” maps to the index of the entity’s first token, using zero-based indexing.

• “end_ix” maps to the index of the entity’s last token, using zero-based indexing.

• “relations” maps to a list of relations for which the entity is the subject. Each relation is a tuple of (“relation_type”, “object_id”). The “relation_type” is one of the three relations defined by the schema. The “object_id” is the id of the other entity in the relation.

The test.json file is organized as follows:

• The keys of the dictionary are unique identifiers for each report in the test set, following the same convention defined for dev.json and train.json above. The keys likewise map to a nested dictionary containing information about the report.

• The keys of the nested dictionary are “data_source” (defined above), “data_split” (defined above), “text” (defined above), “labeler_1”, and “labeler_2”.

• Since there are two labelers, “labeler_1” and “labeler_2” both map to a nested dictionary (defined above). The only difference from the nested dictionary defined above is that for each labeler, the “text” key is not included, as it is shared across labelers.

We provide an example for a particular report, where the text for the report is: “The lungs are clear. Cardiomediastinal and hilar contours are normal. There are no pleural effusions or pneumothorax.” This sample with annotated entities and relations is visualized in the following file: Example_Annotation_Figure.png.

We provide the corresponding dictionary for this example following the same formatting as examples in train.json, dev.json, CheXpert_graphs.json, and MIMIC-CXR_graphs.json:

{
'data_source': 'MIMIC-CXR',
'data_split': 'train',
'entities': {
'1': {
'end_ix': 2,
'label': 'ANAT-DP',
'relations': [

],
'start_ix': 2,
'tokens': 'lungs'
},
'2': {
'end_ix': 4,
'label': 'OBS-DP',
'relations': [
[
'located_at',
'1'
]
],
'start_ix': 4,
'tokens': 'clear'
},
'3': {
'end_ix': 6,
'label': 'ANAT-DP',
'relations': [

],
'start_ix': 6,
'tokens': 'Cardiomediastinal'
},
'4': {
'end_ix': 8,
'label': 'ANAT-DP',
'relations': [

],
'start_ix': 8,
'tokens': 'hilar'
},
'5': {
'end_ix': 9,
'label': 'ANAT-DP',
'relations': [
[
'modify',
'3'
],
[
'modify',
'4'
]
],
'start_ix': 9,
'tokens': 'contours'
},
'6': {
'end_ix': 11,
'label': 'OBS-DP',
'relations': [
[
'located_at',
'3'
],
[
'located_at',
'4'
]
],
'start_ix': 11,
'tokens': 'normal'
},
'7': {
'end_ix': 16,
'label': 'ANAT-DP',
'relations': [

],
'start_ix': 16,
'tokens': 'pleural'
},
'8': {
'end_ix': 17,
'label': 'OBS-DA',
'relations': [
[
'located_at',
'7'
]
],
'start_ix': 17,
'tokens': 'effusions'
},
'9': {
'end_ix': 19,
'label': 'OBS-DA',
'relations': [

],
'start_ix': 19,
'tokens': 'pneumothorax'
}
},
'text': 'The lungs are clear . Cardiomediastinal and hilar contours are normal . There are no pleural effusions or pneumothorax .'
}

## Usage Notes

Use of the dataset is free to all researchers after signing a data use agreement which stipulates, among other items, that (1) the user will not share the data, (2) the user will make no attempt to re-identify individuals, and (3) any publication that makes use of the data will also make the relevant code available.

The data has been used to train and test an entity and relation extraction model, which is also included in this release. We anticipate that researchers will use the data in the following ways: (1) develop NLP models for entity and relation extraction in radiology, (2) use our pre-trained model to label radiology report datasets, (3) develop multi-modal models that leverage our graphs (entities/relations) generated from radiology reports and the associated chest radiographs.

The data has the following limitations. (1) Our annotations do not capture the clinical context in a radiology report, such as information included in the Comparison or History section of the report. We focus on extracting the clinically relevant information in a report associated with the image being examined. (2) Our annotations are limited to chest X-ray radiology reports from MIMIC-CXR for our train set and MIMIC-CXR / CheXpert for our test set.

## Release Notes

Version 1.0.0: Initial upload of dataset

## Acknowledgements

We would like to acknowledge Datasaur.ai for generously providing us access to their labeling platform. We would like to acknowledge Leo Anthony Celi and Tom Pollard from the MIMIC-CXR team and Ethan Chi from Stanford University for their support.

## Conflicts of Interest

No conflicts of interest to declare.

## References

1. Irvin, Jeremy, et al. "Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019.
2. Johnson, Alistair EW, et al. "MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports." Scientific data 6.1 (2019): 1-8.
3. Smit, Akshay, et al. "CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT." arXiv preprint arXiv:2004.09167 (2020).
4. Jain, Saahil, et al. "VisualCheXbert: addressing the discrepancy between radiology report labels and image labels." Proceedings of the Conference on Health, Inference, and Learning. 2021.
5. Peng, Yifan, et al. "Negbio: a high-performance tool for negation and uncertainty detection in radiology reports." AMIA Summits on Translational Science Proceedings 2018 (2018): 188.
6. Sugimoto, Kento, et al. "Extracting clinical terms from radiology reports with deep learning." Journal of Biomedical Informatics 116 (2021): 103729.
7. Hassanpour, Saeed, and Curtis P. Langlotz. "Information extraction from multi-institutional radiology reports." Artificial intelligence in medicine 66 (2016): 29-39.
8. Steinkamp, Jackson M., et al. "Toward complete structured information extraction from radiology reports using machine learning." Journal of digital imaging 32.4 (2019): 554-564.
9. Datta, Surabhi, et al. "Understanding spatial language in radiology: representation framework, annotation, and spatial relation extraction from chest X-ray reports using deep learning." Journal of biomedical informatics 108 (2020): 103473.
10. Wadden, David, et al. "Entity, relation, and event extraction with contextualized span representations." arXiv preprint arXiv:1909.03546 (2019).
11. Gu, Yu, et al. "Domain-specific language model pretraining for biomedical natural language processing." arXiv preprint arXiv:2007.15779 (2020).

##### Parent Projects
RadGraph: Extracting Clinical Entities and Relations from Radiology Reports was derived from: Please cite them when using this project.
##### Access

Access Policy:
Only credentialed users who sign the DUA can access the files.