Database Credentialed Access

Chest ImaGenome Dataset

Published: July 13, 2021. Version: 1.0.0

Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio, J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi, L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset (version 1.0.0). PhysioNet. https://doi.org/10.13026/wv01-y230.

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

In recent years, with the release of multiple large datasets, automatic interpretation of chest X-ray (CXR) images with deep learning models have become feasible for specific abnormalities or for generating preliminary reports. However, despite reports of performance reaching similar levels to that of radiologists, a quantitative evaluation of the explainability of these models is hampered by the lack of locally labeled datasets for different findings. With the exception of a few human-labeled small-scale datasets for specific findings, such as pneumonia and pneumothorax, most of the CXR deep learning models to date are trained on global "weak" labels extracted from text reports, or trained via a joint image and unstructured text learning strategy. In our work, a joint rule-based natural language processing (NLP) and CXR atlas-based bounding box detection pipeline are used to automatically label 242072 frontal MIMIC CXRs locally. Inspired by the Visual Genome effort in the computer vision community [20], we constructed the first Chest ImaGenome dataset with a scene graph data structure to describe the data. Through a radiologist constructed CXR ontology, the annotations for each CXR are connected as an anatomy-centered scene graph, useful for image-level reasoning and multimodal fusion applications. Overall, our dataset contributes significantly to the research community by providing 1) 1,256 combinations of relation annotations between 29 CXR anatomical locations (objects with bounding box coordinates) and their attributes, structured as a scene graph per image, 2) over 670,000 localized comparison relations (for improved, worsened, or no change) between the anatomical locations across sequential exams, as well as 3) a manually annotated gold standard scene graph dataset from 500 unique patients.

Background

Recently, to accelerate radiology workflows, multiple large chest X-ray (CXR) datasets [1-4] have been released by the research community, which can be used to develop automatic abnormality detection or report generation algorithms. For detecting specific abnormalities from images, natural language processing (NLP) algorithms have been used to extract "weak" global labels (CXR abnormalities) from the associated reports [2, 5, 6]. For automatic report generation, self-supervised joint text and image architectures [7-11], first inspired by the image captioning related work in the non-medical domain [12-16], have been used to produce preliminary free-text radiology reports. However, both approaches lack rigorous localization assessment for explainability - namely, whether the model attended to the relevant anatomical location(s) (i.e. plausible causal features) for their predictions, which is critical for clinical applications. The latter joint image and text learning strategy is also known to learn heavy language priors from the text reports without having truly learned to interpret the imaging features [17,18]. Furthermore, even though architectures suitable for comparing imaging changes are available [33, 34], limited work has used NLP to derive change relations between exams from large datasets for training imaging models that can track progression for a wider variety of CXR findings or diseases.

To the best of our knowledge, no prior CXR datasets have attempted to automatically 1) extract relations between CXR attributes (labels) from reports and their anatomical locations (objects with bounding box coordinates) on the images as documented by the reporting radiologists; 2) nor has there been any localized comparison relation annotations between sequential CXR exams. Recent work presents a smaller manually annotated dataset that has some similarity to our efforts, though their dataset did not include comparison relations or imaging bounding box annotations [19]. Research on these two topics is valuable because radiology reports in effect are records of radiologists' complex clinical reasoning processes, where the anatomical location of observed imaging abnormalities is often used to narrow down on potential diagnoses, as well as for integrating information from other clinical modalities (e.g. CT findings, labs, etc) at the anatomical levels. Sequential exams are also routinely used by bedside clinicians to track patients' clinical progress after being started on different management paths. Therefore, documentations comparing sequential exams are prevalent in CXR reports and are clinically meaningful relations to learn about. Automatically structuring this type of documented radiology knowledge as well as disease progression descriptions from reports will help improve explainability evaluation and widen downstream clinical applications for CXR imaging algorithm development.

Furthermore, advanced algorithms for object detection and domain-knowledge-driven reasoning in the non-medical domain require a starting dataset that has localized labels on the images and meaningful relationships between them to learn about. In the non-medical domain, large locally labeled graph datasets (e.g., Visual Genome dataset [20]) enabled the development of algorithms that can integrate both visual and textual information and derive relationships between observed objects in images [21-23], as well as spurring a whole domain of research in visual question answering (VQA) and visual dialogue (VD) with the aim of developing interactive AI algorithms capable of reasoning over information from multiple sources [24-26]. These location, relation and semantic aware systems aim to capture important elements in the images in relation with complex human languages, in order to conversationally interact with humans about the visual content.

Our dataset makes an important step towards addressing this missing link in the medical imaging domain, starting with a large scene graph dataset for chest X-rays, which is one of the most commonly ordered imaging exams. The goal for releasing this dataset is to spur the development of algorithms that more closely reflect radiology experts’ reasoning processes. In addition, automatically describing localized imaging features in recognized medical semantics is the first step towards connecting potentially predictive pixel-level features from medical images with the rest of the digitalized patient records and external medical ontologies. These connections could aid both the development of anatomically relevant multimodal fusion models and the discovery of localized imaging fingerprints, i.e., patterns predictive of patient outcomes. With this PhysioNet contribution, we make the first Visual Genome-like graph dataset in the CXR domain accessible for the research community.

Methods

We describe the construction of the Chest ImaGenome dataset and the curation effort for a gold standard dataset in more detail in our accompanying paper currently under submission and peer review. Here we describe the must-know to understand and use the dataset.

Dataset Construction

The Chest ImaGenome dataset was automatically constructed from the MIMIC-CXR dataset [1] by borrowing several ideas from the construction of the Visual Genome dataset [20] in the non-medical domain. Whereas Visual Genome utilized web-based and crowd-sourced methods to manually collect all annotations, Chest ImaGenome harnessed NLP and image segmentation techniques to structure and add value to existing CXR images and their free-text reports, which were collected from radiologists in their routine workflow. We used atlas-based bounding box extraction techniques to structure the anatomies on frontal CXR images (AP or PA view) and used a rule-based text-analysis pipeline to relate the anatomies to various CXR attributes (finding, diseases, technical assessment, devices, etc) extracted from 217,013 reports. Altogether, we automatically annotated 242,072 scene graphs that locally and graphically describe the frontal images associated with these reports (one report can have one or more frontal images). Our goal is to not only locally label attributes relevant for key anatomical locations on the CXR images, but also to extract radiology knowledge from a large corpus of CXR reports to aid future semantics driven and multi-modal clinical reasoning works.

The construction of the Chest ImaGenome dataset builds on the works of [5, 27]. In summary, the text pipeline [2] first sections the report and retains only the finding and impression sentences. Then it uses a CXR concept dictionary (lexicons) to spot and detect the context (negated or affirmed) of 271 different CXR related named-entities from each retained sentence. The lexicons were curated in advance by two radiologists in consensus using a concept expansion and vocabulary grouping engine [29]. A set of sentence level filtering rules are applied to disambiguate some of the target concepts (e.g., 'collapse' mentioned in the CXR report can be about lung 'collapse' or related to spinal fracture as in vertebral body 'collapse'). Then the named-entities for CXR labels (attributes) are associated with the name-entities for anatomical location(s) described in the same sentence with a natural language parser, SpaCy [28]. Using a CXR ontology constructed by radiologists, the pipeline corrected obvious attribute-to-anatomy assignment errors (e.g. lung opacity wrongly assigned to mediastinum). Finally, the attributes for each of the target anatomical regions from repeated sentences are grouped to the exam level. The result is that, from each CXR report, we extract a radiology knowledge graph where CXR anatomical locations are related to different documented CXR attribute(s). The "reason for exam" sentence(s) from each report, which contain free text information about prior patient history, are separately kept in the final scene graph JSONs. Patient history information is critical for clinical reasoning but is a piece of information that is not technically part of the "scene" for each CXR.

For detecting the anatomical "objects" on the CXR images that are associated with the extracted report knowledge graph, a separate anatomy atlas-based bounding box pipeline extracts the coordinates of those anatomies from each frontal image. This pipeline is an extension of [27] that covers additional anatomical locations in this dataset. In addition, we manually validated or corrected the bounding boxes for 1071 CXR images (with and without disease, and excluded gold standard subjects) to train a Faster RCNN CXR bounding box detection model, which we used to correct failed bounding boxes (too small or missing) from the initial bounding box extraction pipeline (~7%). Finally, for quality assurance, we manually annotated 303 images that had missing bounding boxes for key CXR anatomies (lungs and mediastinum).

Extracting comparison relations between sequential exams at the anatomical level is another goal for the Chest ImaGenome dataset. After checking with the MIMIC team and reviewing their dataset documentation, we assume that the timestamps in the original MIMIC CXR dataset can be used to chronologically order the exams for each patient. We then correlated all report descriptions of changes (grouped as improved, worsened, or no change) between sequential exams with the anatomical locations described at the sentence level. To extract these comparison descriptions, we used a concept expansion engine [29] to curate and group relevant comparison vocabularies used in CXR reports. These comparison relations extracted between anatomical locations from sequential CXRs are only added to the final scene graphs for every patient's second or later CXR exam(s) -- i.e., comparison relations described in the first study of each patient in the MIMIC-CXR dataset are not added to the Chest ImaGenome dataset.

Finally, we have mapped all object and attribute nodes and comparison relations in the Chest ImaGenome dataset to a Concept Unique Identifier (CUI) in the Unified Medical Language System (UMLS) [30]. The UMLS ontology has incorporated the concepts from the Radlex ontology [31], which is constructed for the radiology domain. Choosing UMLS to index the Chest ImaGenome dataset widens its future applications in clinical reasoning tasks, which would invariably require medical concepts and relations outside the radiology domain.

Gold Standard Evaluation Dataset Curation

Finally, working with clinical collaborators from multiple academic institutions, we created a Gold Standard Dataset to evaluate the quality of the automatically derived annotations in the Chest ImaGenome dataset. We sampled 500 random patients who have two or more CXR exams from the Chest ImaGenome dataset. Due to the sheer number of different types of annotations we were after and the limited resources we had, we pursued a validation-plus-correction-if-needed strategy to annotate the gold standard dataset. Records of the annotation process are documented in the annotation_utils directory.

The text-related annotations were collected at the per sentence level within Excel. In the Excel tabular format, we presented one sentence and one extracted relation on each row to the annotators and asked them to decide if an annotation is correct (i.e., true positive) or not (i.e., false positive). All sentences from each report were shown in the original report order to the annotators so that they have both the context from the sentence and the whole report. This is important for localizing attribute and comparison descriptions, which can occasionally cross sentence boundaries. For recall, the annotators were instructed to mark out sentences with any object, attribute or comparison descriptions that were missed by the NLP pipeline so that they can come back and manually annotate as a second step (i.e., false negatives). To evaluate the object-to-attribute relationships in the scene graphs, we annotated the anatomical locations and the attributes they contain for the first CXR report for all 500 patients. Separately, for evaluating the comparison relations, we annotated all the comparison relations described in the report for the second CXR exam for all 500 patients. We kept the correct annotations (i.e. true positives and false negatives) for the gold standard ground truth files. Details of the annotation process are explained in our accompanying paper.

For anatomical object annotation, we used the jupyter-innotator package [35, 36] to create simple plug-ins in Jupyter Notebooks to annotate the bounding boxes: Correct_lung_bboxes_template.ipynb and Correct_mediastinum_bboxes_template.ipynb. We dual annotated the frontal CXR images from the first and second exam of the same 500 unique patients (i.e., altogether 1000 images). However, we only managed to annotate 29 different anatomical objects in CXRs. Therefore, more anatomical locations were annotated from texts than from images. The images were resized to 224x224 dimension to be displayed via the notebook user interface. Manually annotated bounding boxes are rescaled back to fit the original size images in the ground truth files. The annotators (all M.D.s) reviewed sample annotations from a radiologist to calibrate before starting annotation.

Data Description

There are many parallels between the Chest ImaGenome dataset and the Visual Genome [20] dataset as described in Table 1 below. The key differences are in the construction methodology, the currently much smaller range of possible objects and attributes (due to having only the CXR imaging modality), and the introduction of comparison relations between sequential images in the Chest ImaGenome dataset. Our expectation is that, in collaboration with more researchers in this field, we could expand the scope of anatomy based scene-graph-like datasets in the future, to support more intelligent modeling in the medical domain.

Table 1 - Dataset comparisons
Component Chest ImaGenome Visual Genome [20]
Scene One frontal CXR image in the current dataset. One (non-medical) everyday life image.
Questions For now, there is only one question per CXR, which is taken from the patient history (i.e. reason for exam) section from each CXR report. One or more questions that the crowd source annotators decided to ask about the image where the information from each question and the image should allow another annotator to answer it.
Answers N/A currently. However, report sentences are biased towards answering the question asked in the "reason for exam" sentence; hence, the knowledge graph we extract from each report should contain the answer(s). This was collected as answer(s) to the corresponding question(s) asked about the image.
Sentences (Region descriptions) Sentences from the finding and impression sections of a CXR report describing the image as collected from radiologists in their routine radiology workflow. True natural language descriptive sentences about the image collected from crowd-sourced everyday annotators.
Objects Anatomical structures or locations that have bounding box coordinates on the associated CXR image and are indexed to the UMLS ontology [30]. The people and physical objects with bounding box coordinates on the image and indexed to WordNet ontology [32].
Attributes Descriptions that are true for different anatomical structures visualized on the CXR image, e.g. There is a right upper lung (object) opacity (attribute), indexed to the UMLS ontology [30]. No Bbox coordinates. Various descriptive properties of the objects in the image, e.g. The shirt (object) is blue (attribute), indexed to WordNet ontology [32]. No Bbox coordinates.
Relations: object and attribute The relationship(s) between an anatomical object and its attribute(s) from the same CXR image, e.g. There is a (relation) right upper lung (object) opacity (attribute). Most objects have multiple attributes due to the report language. The relationship(s) between an object and its attribute(s) from the same image, e.g. The shirt (object) is (relation) blue (attribute).
Relations: object and object The comparison relationship (index to UMLS [30]) between the same anatomical object from two sequential CXR images for the same patient, e.g. There is a new (relation) right lower lobe (current & previous anatomical objects) atelectasis (attribute). The relationship (indexed to WordNet [32]) between objects in the same image, e.g. The boy (object 1) is beside (relation) the bus (object 2).
Relations: parent and child The idea is that the graph for each image should be consistent and correct as learnable and consumable radiology knowledge. Therefore, affirmed parent-child relations between nodes are embedded in the scene graphs -- e.g. if a child attribute is related to an object, then its parent would be too -- i.e. if right lung has consolidation (child), then it also has lung opacity (parent). N/A due to different graph construction strategy and goals. The annotators were asked to describe any (but not all) relations they observe in an image.
Scene graph Constructed from the objects, the attributes and the relationships between them for the image. Same but the nodes and edges overall would be more varied than Chest ImaGenome.
Sequence A super-graph for a set of chronologically ordered series of exams for the same patient. N/A (but would be a graph for a video in the non-medical context).

More detailed dataset characteristics are calculated in generate_scenegraph_statistics.ipynb.

Chest ImaGenome Scene Graph JSONs

The Chest ImaGenome dataset is stored in two main directories: one for the scene graphs that are automatically generated ("silver_dataset"), and another for the subset that was manually validated and corrected ("gold_dataset"). The scene_graph.zip file in the silver_dataset directory contains one JSON file for each scene graph. Each scene graph describes one frontal chest X-ray image.

The structure for each scene graph JSON is described in the following example, which has been decomposed into its components for easier explanation. The first level of the JSON below describes the patient or study level information that may not be available in the image. The fields are: 'image_id' (dicom_id in MIMIC-CXR), 'viewpoint' (AP or PA view), 'patient_id' (subject_id in MIMIC-CXR), 'study_id' (study_id in MIMIC-CXR), 'gender' and 'age_decile' demographics (from MIMIC-CXR's metadata), 'reason for exam' (patient history sentence(s) from the CXR reports with age removed), 'StudyOrder' (the order of the CXR study for the patient, which is derived from chronologically ordering the DICOM timestamps), and 'StudyDateTime; (from MIMIC's DICOM metadata, which has been previously de-identified into the future).


{
'image_id': '10cd06e9-5443fef9-9afbe903-e2ce1eb5-dcff1097',
'viewpoint': 'AP',
'patient_id': 10063856,
'study_id': 56759094,
'gender': 'F',
'age_decile': '50-60',
'reason_for_exam': '___F with hypotension.  Evaluate for pneumonia.',
'StudyOrder': 2,
'StudyDateTime': '2178-10-05 15:05:32 UTC',
'objects': [ <...list of {} for each object...> ],
'attributes':[ <...list of {} for each object...> ],
'relationships:[ <...list of {} of comparison relationships between objects from sequential exams for the same patient...> ]
}


For each scene graph, there are 3 separate nested fields to describe the "objects" on the CXR images, the "attributes" related to the different objects as extracted from the corresponding reports, and "relationships" to describe comparison relations between sequential CXR images for the same patient. These 3 fields are a list of dictionaries, where the format of each dictionary is modeled after the respective JSONs in the Visual Genome dataset [20].

For objects, each dictionary has the format below. The 'object_id' is unique across the whole dataset for the anatomical location on the particular image. Fields 'x1', 'y1', 'x2', 'y2', 'width' and 'height' are for a padded and resized 224x224 CXR frontal image, where coordinates 'x1', 'y1' are for the top left corner of the bounding box and 'x2', 'y2' are for the bottom right corner. The bounding box coordinates in the original image are denoted with 'original_*'. The remaining fields: 'bbox_name' is the name given to the anatomical location within the Chest ImaGenome dataset, and is useful for lookups in other parts of the scene graph JSON; 'synsets' contain the UMLS CUI for the anatomical location concept; and the 'name' is the UMLS name for that CUI [30].


{
'object_id': '10cd06e9-5443fef9-9afbe903-e2ce1eb5-dcff1097_right upper lung zone',
'x1': 48,
'y1': 39,
'x2': 111,
'y2': 93,
'width': 63,
'height': 54,
'bbox_name': 'right upper lung zone',
'synsets': ['C0934570'],
'name': 'Right upper lung zone',
'original_x1': 395,
'original_y1': 532,
'original_x2': 1255,
'original_y2': 1268,
'original_width': 860,
'original_height': 736
}

Each attribute dictionary aims to summarize all the CXR attribute descriptions for one anatomical location ('bbox_name'). This means, for a particular CXR anatomical location, all the sentences describing attributes related to it have been grouped into the 'phrases' field, where the order of sentences in the original report has been maintained. However, an anatomical location may not always be described or implied in the report. In that case, looking up the dictionary ['bbox_name'] will be False. The fields 'synsets' and 'name' are the same as in the objects' dictionaries, where they describe the UMLS CUI information for the anatomical location concept.


{
'right lung': True,
'bbox_name': 'right lung',
'synsets': ['C0225706'],
'name': 'Right lung',
'attributes': [['anatomicalfinding|no|lung opacity',
'anatomicalfinding|no|pneumothorax',
'nlp|yes|normal'],
['anatomicalfinding|no|pneumothorax']],
'attributes_ids': [['CL556823', 'C1963215;;C0032326', 'C1550457'],
['C1963215;;C0032326']],
'phrases': ['Right lung is clear without pneumothorax.',
'No pneumothorax identified.'],
'phrase_IDs': ['56759094|10', '56759094|14'],
'sections': ['finalreport', 'finalreport'],
'comparison_cues': [[], []],
'temporal_cues': [[], []],
'severity_cues': [[], []],
'texture_cues': [[], []],
'object_id': '10cd06e9-5443fef9-9afbe903-e2ce1eb5-dcff1097_right lung'
}

The 'attributes' field contains the relations between the anatomical location and the CXR attributes extracted from the respective sentences. Note that there can be multiple attributes extracted from each sentence. Therefore, the 'attributes' field is a list of lists. The 'attributes' in the lists follow the pattern of < categoryID | relation | label_name >, where categoryID is the radiology semantic category the authors gave to the CXR concept in consultation with multiple radiologists, and relation is the NLP context relating the label_name to the anatomical location. If the relation is 'no', then the label_name is specifically negated in the sentence. If the relation is 'yes', then the label_name is affirmed in the sentence. The order of the lists in the 'attribute_ids' field follow the lists in the 'attributes' field and map each label_name to UMLS CUIs. Thus, the way the Chest ImaGenome dataset is formulated, one can interpret a statement such as the 'right lung' <has no> 'lung opacity' as true in the extracted radiology knowledge graph, whereby each node has been mapped to an external recognized ontology.

The certainty of each relation in the CXR knowledge graph can be optionally further modified by the cues from the 'severity_cues' and 'temporal_cues' fields in each attribute dictionary. The severity cues can include 'hedge', 'mild', 'moderate' or 'severe', which are only assigned by co-occurrence at the sentence level. These extractions can benefit from future NLP improvement. Similarly, the temporal cues can modify the relation as either 'acute' or 'chronic' depending on clinical use cases.

The categoryIDs in the Chest ImaGenome dataset can be used to differentiate the use case for different attributes. They include:

• 'anatomicalfinding' - describes findings of anatomies where there is some subjectivity in the grouping of the phrases used to extract the labels.
• 'disease' - descriptions that are more diagnostic level and often require patient information outside the image and most subjective to the reading radiologist's inference/impression
• 'nlp' - normal or abnormal descriptions about different anatomical locations and can be somewhat subjective.
• 'technicalassessment' - image quality issues that affect radiologic interpretation of imaging observations
• 'tubesandlines' - medical support devices where radiologists need to report any placement issues
• 'devices' - medical devices where placement issues are less relevant
• 'texture' - these are only present in the 'texture_cues' field, we kept a set of highly non-specific attributes (e.g. opacity, lucency, interstitial, airspace) that tend to form the initial most objective descriptions about what is observed in the images by radiologists.

Finally, for comparison relationships, each dictionary has the format below. Each relationship dictionary describes the comparison relation(s) relevant for only one anatomical location ('bbox_name'). The 'relationship_id' uniquely identifies each comparison relationship between the object ('subject_id') on the current exam and the object ('object_id' for the same anatomical location) from the previous exam. The 'predicate' and 'synsets' are the UMLS CUIs for 'relationship_names', which is a list with usually one (but could be more) comparison relation type, which can be in ['comparison|yes|improved', 'comparison|yes|worsened', 'comparison|yes|no change']. The 'attributes' field records the attributes that are related to the anatomical location as per the sentence from the original report (kept in the 'phrase' field) that describes the comparison relationship.


{
'relationship_id': '56759094|7_54814005_C0929215_10cd06e9_4bb710ab',
'predicate': "['No status change']",
'synsets': ['C0442739'],
'relationship_names': ['comparison|yes|no change'],
'relationship_contexts': [1.0],
'phrase': 'Compared with the prior radiograph, there is a persistent veil -like opacity\n over the left hemithorax, with a crescent of air surrounding the aortic arch,\n in keeping with continued left upper lobe collapse.',
'attributes': ['anatomicalfinding|yes|atelectasis',
'anatomicalfinding|yes|lobar/segmental collapse',
'anatomicalfinding|yes|lung opacity',
'nlp|yes|abnormal'],
'bbox_name': 'left upper lung zone',
'subject_id': '10cd06e9-5443fef9-9afbe903-e2ce1eb5-dcff1097_left upper lung zone',
'object_id': '4bb710ab-ab7d4781-568bcd6e-5079d3e6-7fdb61b6_left upper lung zone'
}

Not all the sentences in the MIMIC-CXR v2.0.0 reports have made it into the Chest ImaGenome dataset, which only contains sentences that have the specific objects, attributes or relations targeted by version 1.0.0 of the dataset. We provide the preprocessing steps (Preprocess_mimic_cxr_v2.0.0_reports.ipynb) done to index the sentences from the original text reports in the "utils" directory, the output of which is cxr-mimic-v2.0.0-processed-sentences_all.txt.

CXR Scene Graphs Rendered in an enriched RDF Format

Radiology report sentences are fairly repetitive. Therefore, in the scene graph JSONS, one could see similar information described multiple times for a study. In addition, in the MIMIC reports we worked with, each report could also have a preliminary read section (recorded by trainee radiologists - i.e. resident M.D.s) that comes before the final report section (approved by a fully trained and experienced radiologist). Therefore, occasionally, the extraction from the sentences near the beginning of a CXR report can be different from the conclusion sentences later in the report. To render the scene graphs easier for downstream utilization, we also provide post-processing utils (scenegraph_postprocessing.py) to roll the annotations up to the report level for each relation. This is done by taking the last relation extracted for each anatomical location and attribute combinations for a report. The processing utils can either render the scene graphs in a tabular format or represent the information in a simpler enriched RDF format, which we used to generate the graph visualizations.

The enriched RDFs have the following format:

{
<study_id_i> : [
[ [node_id_1, node_type_1], [node_id_2, node_type_2], relation_name_A ],
[ [node_id_1, node_type_1], [node_id_3, node_type_3], relation_name_B ],
...
],
<study_id_i+1> : [
[ [node_id_1, node_type_1], [node_id_2, node_type_2], relation_name_A ],
[ [node_id_1, node_type_1], [node_id_3, node_type_3], relation_name_B ],
...
],
}

CXR Scene Graph Visualization

To visualize the report knowledge graph extracted from the CXR reports, we used a graph engine, neo4j [37], to plot the enriched RDF format of an example scene graph in Figure1 (cxr_knowledge_graph_neo4j.pdf) and Figure 2 (chest_imaGenome_graph_sample_fig1.pdf). Querying with neo4j is powerful but it requires more setup on the local machine. For an interactive demonstration, we also used NetworkX [38] to give users the options of exploring the Chest ImaGenome scene graphs interactively in a Jupyter Notebook: visualization.ipynb.

Consistent Dataset Splits

For comparable results in the future, we included .CSVs for our train, valid and test sets in the "splits" directory. The random data split was done at the patient level. We also included a .CSV (images_to_avoid.csv) with image IDs ('dicom_id') and 'study_id's for patients in the gold standard dataset, which should all be excluded from model training and validation. This file actually has images from 1000 patients in total as we have plans to expand the gold dataset, so that in the future final benchmark reporting could be done on a sufficiently large manually annotated gold standard dataset.

Gold Standard Dataset

We curated a manual gold standard evaluation dataset to measure the quality of the automatically derived annotations in the Chest ImaGenome dataset and for model benchmarking. Here we describe the three gold standard ground truth files in the 'gold_dataset' directory. They are in tabular format for comparison purposes.

1. gold_attributes_relations_500pts_500studies1st.txt - this is the manually annotated ground truth file for all the object-to-attribute relations from the first CXR study for 500 unique patients. The notebook object-attribute-relation_evaluation.ipynb explains in detail how we calculated the performance of object-to-attribute relation extraction.
2. gold_comparison_relations_500pts_500studies2nd.txt - this is the manually annotated ground truth for object-object comparison relations from the second CXR study for the same 500 unique patients. The notebook object-object-comparison-relation_evaluation.ipynb uses it to calculate the performance for object-to-object-comparison relation extraction.
3. bbox_coordinate_annotations_1_C.csv, bbox_coordinate_annotations_1_W.csv, bbox_coordinate_annotations_2_J.csv, and bbox_coordinate_annotations_2_S.csv - these files contain the manually annotated bounding box coordinates for the objects on the corresponding CXR images. The notebook object-object-comparison-relation_evaluation.ipynb calculates the bounding box object detection performance using these ground truth files, as well as consolidating the final gold_bbox_coordinate_annotations_1000images.csv.
4. Lastly, final_merging_report_and_bbox_ground_truth.ipynb combines the manual text and anatomical bbox annotations as gold_object_attribute_with_coordinates.txt and gold_object_comparison_with_coordinates.txt.

There are also a few supporting files for measuring the performance of the silver Chest ImaGenome dataset against this gold standard. They are:

• gold_all_sentences_500pts_1000studies.txt - this contains all the sentences tokenized from the original MIMIC-CXR reports that were used to create the gold standard dataset. We include this file because sentences with no relevant object, attribute or relation descriptions did not make it into the gold standard dataset. We renamed 'subject_id' from MIMIC-CXR dataset to 'patient_id' in Chest ImaGenome dataset to avoid confusion with field names for relationships in the scene graphs. Otherwise, the ids are unchanged. Sentences in the tokenized file are assigned to 'history', 'prelimread', or 'finalreport' in the 'section' column. The 'sent_loc' column contains the order of the sentences as in the original report. Minimal tokenization has been done to the sentences.
• gold_bbox_scaling_factors_original_to_224x224.csv - this contains the scaling 'ratio' and the paddings ('left', 'right', 'top', and 'bottom') added to square the image after resizing the original MIMIC-CXR dicoms to 224x224 sizes. These ratios were used to rescale the annotated coordinates for 224x224 images back to the original CXR image sizes.
• auto_bbox_pipeline_coordinates_1000_images.txt - this contains the bounding box coordinates that were automatically extracted by the Bbox pipeline for the different objects for images in the gold standard dataset. It is in tabular format like with the ground truth for easier evaluation purposes.

Dataset Evaluation

The notebooks for generating the Chest ImaGenome evaluation results in the tables below are provided under the analysis directory.

Table 2 (generated via object-attribute-relation_evaluation.ipynb) measures the NLP pipeline's precision, recall and F1 scores for extracting the relationships between objects (anatomical locations) and CXR attributes (findings, diseases, technical assessment, etc) in the scene graphs. Since at their most granular level, the annotations are at the sentence-level, we report both the sentence-level and report-level results for 500 reports from the first exam of each patient. However, for most purposes, report-level annotations are most suitable for downstream uses.

Table 2 - CXR report knowledge graph evaluation results from 500 reports
Object-Attribute Relations Sentence-level Report-level
Number of annotations 21593 16569
Precision 0.932 0.938
Recall 0.945 0.939
F1-score 0.939 0.939

Table 3 (generated via object-object-comparison-relation_evaluation.ipynb) shows the NLP results for comparison relations (improved, worsened, no change) between various anatomical locations described for the current study as compared to the patient's previous study. The results (attribute-sensitive / attribute-blind) are again shown at both sentence-level and report-level for 500 reports from the second exam of each patient.

Table 3 - Localized comparison relationships evaluation results from 500 reports
Object-object Comparison Relations Sentence-level Report-level
Number of annotations 5154 / 1787 3993 / 1374
Precision 0.831 / 0.856 0.832 / 0.858
Recall 0.590 / 0.663 0.762 / 0.790
F1-score 0.690 / 0.747 0.796 / 0.823

Lastly, Table 4 below shows the dataset evaluation at the anatomical location (object) level. The F1 scores are calculated for relations extracted between objects and attributes from the 500 gold standard reports, which is a breakdown of report-level results in Table 2 for the bounding boxes (Bboxes) shown. Using the 1000 CXR images in the gold standard dataset, we also calculated the intersection over union (IoU) between the automatically extracted Bboxes and the validated and corrected Bboxes (object-bbox-coordinates_evaluation.ipynb). Since we used an agree-or-correct annotation strategy for more efficient annotation, we also show the percentage of bounding boxes requiring manual correction in the gold dataset and the percentage missing in the final Chest ImaGenome dataset. Missing bounding boxes could be due to Bbox extraction failure or the anatomical location genuinely not being visible in the image (i.e., cut off or not in field of view), which is not uncommon for the costophrenic angles and apical zones.

Table 4 - CXR image object detection evaluation results
Bbox name (object) Object-attribute relations frequency (in 500 reports) Relationships F1 score (in 500 reports) Bbox IoU (over 1000 images) % Bboxes corrected (1000 images) % Relations missing Bbox coordinates (over whole dataset)
left lung 1453 0.933 0.976 9.90% 0.03%
right lung 1436 0.937 0.983 6.30% 0.04%
cardiac silhouette 633 0.966 0.967 9.70% 0.01%
mediastinum 601 0.952 ** ** 0.02%
left lower lung zone 609 0.932 0.955 8.60% 2.36%
right lower lung zone 580 0.902 0.968 6.00% 2.27%
right hilar structures 572 0.934 0.976 4.10% 1.91%
left hilar structures 571 0.944 0.971 4.30% 2.28%
upper mediastinum 359 0.940 0.994 1.40% 0.12%
left costophrenic angle 298 0.908 0.929 9.60% 0.63%
right costophrenic angle 286 0.918 0.944 6.90% 0.39%
left mid lung zone 173 0.940 0.967 5.70% 2.79%
right mid lung zone 169 0.830 0.968 5.30% 2.31%
aortic arch 144 0.965 0.991 1.40% 0.62%
right upper lung zone 117 0.873 0.972 5.80% 0.04%
left upper lung zone 83 0.811 0.968 6.40% 0.22%
right hemidiaphragm 78 0.947 0.955 7.90% 0.15%
right clavicle 71 0.615 0.986 2.80% 0.50%
left clavicle 67 0.642 0.983 3.00% 0.51%
left hemidiaphragm 65 0.930 0.944 11.30% 0.14%
right apical zone 58 0.852 0.969 5.40% 1.99%
trachea 57 0.983 0.995 0.90% 0.24%
left apical zone 47 0.938 0.963 6.20% 2.40%
carina 41 0.975 0.994 0.80% 1.47%
svc 19 0.973 0.995 0.70% 0.66%
right atrium 14 0.963 0.979 4.00% 0.18%
cavoatrial junction 5 1.000 0.977 4.30% 0.25%
abdomen 80 0.904 * * 0.26%
spine 132 0.824 * * 0.10%

* These anatomical locations are extracted by the Bbox pipeline but they are not manually annotated in the gold standard dataset due to resource constraints.

** The mediastinum bounding boxes were not not directly annotated due to resource constraints. Mediastinum's bounding box boundary can be derived from the ground truth for the upper mediastinum and the cardiac silhouette.

Usage Notes

The Chest ImaGenome dataset was automatically generated and so is limited by the performance of the NLP and the Bbox extraction pipelines. Furthermore, we cannot assume that all the clinically relevant CXR attributes are always described on every exam by the reporting radiologists. In fact, we have observed many implied object-attribute relation descriptions that are documented only in the form of comparisons (e.g. no change from previous) in short CXR reports. As such, even with perfect NLP extraction of object and attribute relations from individual reports, there would be missing information in the report knowledge graph constructed for some images. These technical areas are worth improving on in future research with more powerful NLP, image processing techniques and other graph-based techniques. Addressing missing relations will certainly improve this dataset too. Regardless, version 1.0.0 of the Chest ImaGenome dataset serves as a vision for a richer radiology imaging dataset.

Release Notes

This is version 1.0.0 of Chest ImaGenome dataset's release. We plan to build on version 2.0.0 of this dataset with clinically relevant question-answer pairs in the near future. If interested in collaborating, please contact research@joytywu.net and mmoradi@us.ibm.com.

Acknowledgements

This work was supported by the Rensselaer-IBM (http://airc.rpi.edu) AI Research Collaboration, part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons), and the IBM-MIT Critical Data Collaboration. We would also like to give acknowledgements to Dr. Tom Pollard for his help with the PhysioNet submission.

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

1. Johnson, A. E., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., ... & Horng, S. (2019). MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1), 1-8.
2. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., ... & Ng, A. Y. (2019, July). Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 590-597).
3. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., & Summers, R. M. (2017). Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2097-2106).
4. Demner-Fushman, D., Kohli, M. D., Rosenman, M. B., Shooshan, S. E., Rodriguez, L., Antani, S., ... & McDonald, C. J. (2016). Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2), 304-310.
5. Wu, J. T., Syed, A., Ahmad, H., Pillai, A., Gur, Y., Jadhav, A., ... & Syeda-Mahmood, T. (2020). AI Accelerated Human-in-the-loop Structuring of Radiology Reports. In AMIA Annual Symposium Proceedings (Vol. 2020, p. 1305). American Medical Informatics Association.
6. Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A. Y., & Lungren, M. P. (2020). CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv preprint arXiv:2004.09167.
7. Wang, X., Peng, Y., Lu, L., Lu, Z., & Summers, R. M. (2018). Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9049-9058).
8. Li, C. Y., Liang, X., Hu, Z., & Xing, E. P. (2018). Hybrid retrieval-generation reinforced agent for medical image report generation. arXiv preprint arXiv:1805.08298.
9. Zhang, Y., Ding, D. Y., Qian, T., Manning, C. D., & Langlotz, C. P. (2018). Learning to summarize radiology findings. arXiv preprint arXiv:1809.04698.
10. Liu, G., Hsu, T. M. H., McDermott, M., Boag, W., Weng, W. H., Szolovits, P., & Ghassemi, M. (2019, October). Clinically accurate chest x-ray report generation. In Machine Learning for Healthcare Conference (pp. 249-269). PMLR.
11. Zhang, Y., Wang, X., Xu, Z., Yu, Q., Yuille, A., & Xu, D. (2020, April). When radiology report generation meets knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 12910-12917).
12. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164).
13. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... & Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057). PMLR.
14. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137).
15. Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision (pp. 2641-2649).
16. Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., ... & Deng, L. (2017). Semantic compositional networks for visual captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5630-5639).
17. Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., & Saenko, K. (2018). Object hallucination in image captioning. arXiv preprint arXiv:1809.02156.
18. Agrawal, A., Batra, D., & Parikh, D. (2016). Analyzing the behavior of visual question answering models. arXiv preprint arXiv:1606.07356.
19. Datta, S., & Roberts, K. (2020). A dataset of chest X-ray reports annotated with Spatial Role Labeling annotations. Data in Brief, 32, 106056.
20. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., ... & Fei-Fei, L. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1), 32-73.
21. Xu, D., Zhu, Y., Choy, C. B., & Fei-Fei, L. (2017). Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410-5419).
22. Li, Y., Ouyang, W., Zhou, B., Wang, K., & Wang, X. (2017). Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1261-1270).
23. Yang, J., Lu, J., Lee, S., Batra, D., & Parikh, D. (2018). Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV) (pp. 670-685).
24. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425-2433).
25. Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., ... & Batra, D. (2017). Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 326-335).
26. De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., & Courville, A. (2017). Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5503-5512).
27. Wu, J., Gur, Y., Karargyris, A., Syed, A. B., Boyko, O., Moradi, M., & Syeda-Mahmood, T. (2020, April). Automatic bounding box annotation of chest x-ray data for localization of abnormalities. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) (pp. 799-803). IEEE.
28. Honnibal, M., Montani, I., Van Landeghem, S. & Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. Zenodo. URL https://spacy.io
29. Coden, A., Gruhl, D., Lewis, N., Tanenblatt, M., & Terdiman, J. (2012, September). Spot the drug! an unsupervised pattern matching method to extract drug names from very large clinical corpora. In 2012 IEEE second international conference on healthcare informatics, imaging and systems biology (pp. 33-39). IEEE.
30. Bodenreider, O. (2004). The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research, 32(suppl_1), D267-D270.
31. Langlotz, C. P. (2006). RadLex: a new method for indexing online educational materials.
32. Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39-41.
33. Li, M. D., Chang, K., Bearce, B., Chang, C. Y., Huang, A. J., Campbell, J. P., ... & Kalpathy-Cramer, J. (2020). Siamese neural networks for continuous disease severity evaluation and change detection in medical imaging. NPJ digital medicine, 3(1), 1-9.
34. Li, M. D., Arun, N. T., Gidwani, M., Chang, K., Deng, F., Little, B. P., ... & Kalpathy-Cramer, J. (2020). Automated assessment and tracking of COVID-19 pulmonary disease severity on chest radiographs using convolutional siamese neural networks. Radiology: Artificial Intelligence, 2(4), e200079.
35. Inline data annotator for Jupyter notebooks. https://github.com/ideonate/jupyter-innotater [Accessed: 7 July 2021]
37. Neo4j Graph Data Platform. https://neo4j.com/ [Accessed: 9 July 2021]
38. NetworkX, Network Analysis in Python. https://networkx.org/ [Accessed: 9 July 2021]

Parent Projects
Chest ImaGenome Dataset was derived from: Please cite them when using this project.
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.