Database Credentialed Access

CAD-Chest: Comprehensive Annotation of Diseases based on MIMIC-CXR Radiology Report

Mengliang Zhang Xinyue Hu Lin Gu Tatsuya Harada Kazuma Kobayashi Ronald Summers Yingying Zhu

Published: Dec. 8, 2023. Version: 1.0


When using this resource, please cite: (show more options)
Zhang, M., Hu, X., Gu, L., Harada, T., Kobayashi, K., Summers, R., & Zhu, Y. (2023). CAD-Chest: Comprehensive Annotation of Diseases based on MIMIC-CXR Radiology Report (version 1.0). PhysioNet. https://doi.org/10.13026/44pd-vz36.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Several extant chest X-ray (CXR) datasets predominantly comprise binary disease labels and exhibit a deficiency in providing comprehensive disease-related information. Crucial facets of disease management, including disease severity, diagnostic uncertainty, and precise localization, are often absent in these datasets, yet they hold substantial clinical significance. In this work, we present a comprehensive annotation of disease (CAD) on CXR images, which is named CAD-Chest dataset. We have leveraged radiology reports authored by medical professionals to meticulously devise label extraction protocols. These protocols facilitate the extraction of essential disease-related attributes, encompassing disease name, severity grading, and additional pertinent details. This dataset is poised to empower researchers and practitioners by offering a holistic perspective on diseases, transcending the mere presence or absence of binary classification.


Background

In recent years, substantial advancements [1-5] have been achieved in the realm of computer-aided diagnosis utilizing chest X-ray (CXR) images. Predominantly, extant applications have stemmed from the classification task, aimed at discerning the presence of specific diseases, or the detection task, geared towards localizing pathological conditions. Consequently, assigning multiple disease labels to each image or delineating bounding boxes around symptomatic regions becomes imperative for creating a comprehensive CXR image dataset. However, the annotation process of CXR images demands a substantial degree of expertise, rendering the establishment of such datasets a non-trivial endeavor.

Certain datasets extract disease labels from accompanying radiologist reports to mitigate the expenses associated with the disease labeling process. These textual documents are inherently structured and encompass detailed descriptions of pulmonary conditions meticulously recorded by radiologists following the acquisition of CXR images. In contrast to the labor-intensive alternative of relying on additional medical professionals for annotation, the extraction of disease labels from text not only conserves invaluable medical resources but also attains a level of accuracy commensurate with that of seasoned medical practitioners.

Furthermore, it is noteworthy that disease diagnosis in the realm of CXR images transcends the relatively straightforward binary classification task of disease presence or absence. Radiologists are often tasked with providing a comprehensive assessment, encompassing disease type, severity, precise localization, and other intricate details. In certain instances, owing to factors such as image quality and limited clinical experience, radiologists may not be able to definitively ascertain the presence of a disease, leading to the utilization of terminology denoting uncertainty in their descriptions. This phenomenon is commonly referred to as "uncertain label" and is evident in select CXR datasets, including but not limited to CheXpert [5] and MIMIC-CXR [1].

Various approaches have been proposed to address the issue of uncertain labels. One straightforward approach is to treat uncertain labels as negative or positive. For instance, CheXpert [5] adopts a three-class methodology, treating uncertain labels as a distinct category and constraining the cumulative probability of negative, positive, and uncertain labels to equal 1. While these methods offer a reasonable means of handling uncertain labels, it is important to note that their principal objective remains the binary classification of disease presence, and they may not afford comprehensive disease-specific information.

In light of this, we leverage the extensive MIMIC-CXR [1] and MIMIC-CXR-JPG [2] datasets to exhaustively extract and analyze disease-related information, to extend our scope beyond conventional disease label classification tasks. Therefore, our proposed dataset has two contributions:

1. We design a rule-based method to extract disease labels from text reports.

2. We consider comprehensive information on disease diagnosis and extract labels such as severity and uncertainty related to the disease.


Methods

Annotation Extraction

We constructed the CAD-Chest dataset from free-text reports released by the MIMIC-CXR [1] dataset. We obtain the disease information we want by setting some extraction rules. Specifically, we utilized NLTK [6] and Spacy [7] packages for biomedical text processing to extract entities from given reports. We set some disease-related words, and when these words are detected in the report, we think there is a description of the related disease in the report. Here we introduce how to extract these labels below:

  1. Disease Extraction

The same disease may be described differently, and doctors may use different words to describe the same disease in their reports. To extract the disease labels in the report, we provide a file `disease_list.txt` to include words representing diseases. In this file, for example, atelectasis has several representations: “atelectasis, collapse.”. When the text in the dictionary appears in the report, we think that the report describes the disease, so we search the text for the severity of the disease. degree, uncertainty, and location information.

  1. Severity Extraction

After detecting that a sentence in the report describes a disease, we need to extract the severity of the disease from this sentence. For example, a sentence in the report is “There is mild cardiomegaly.”, we detect the disease “cardiomegaly” in the first step, then we use grammatical analysis to find words related to the disease and filter out words that indicate severity. We provide a file `severity_words.txt` to include those words that describe severity. In the above example, "mild" is a word that describes “cardiomegaly”, so it is extracted as the severity of the disease. The same goes for extracting disease uncertainty and location.

Severity keywords in the “severity_words.txt” file are listed in Table I. We merge keywords with similar meanings and divide them into four different severity levels.

Table I. Keywords of disease severity.

Merged Level

Extracted Words

Mild

mild, small, trace, minor, minimal, minimally, subtle, mildly

Moderate

moderate, moderately, mild to moderate

Severe

severe, acute, massive, moderate to severe, moderate to large

No

no, without, clear of, negative, exclude, lack of, rule out

  1. Uncertainty Extraction

For disease uncertainty, we provide a file `uncertainty_words.txt` to include those words related to uncertainty. Due to unclear images or experience, doctors are uncertain when describing some diseases, such as “The left lower lung may have mild pleural effusion.”. “May” is a word that represents uncertainty, indicating that doctors are not entirely sure the disease exists. We determined the uncertainty of the “pleural effusion” by filtering out “may” from words related to the disease based on the documents provided.

Uncertainty keywords in the “uncertainty_words.txt” file are listed in Table II. Different from assigning binary labels to disease labels in other data sets, we assign different label values to these keywords according to the degree of uncertainty. A larger value indicates a lower degree of uncertainty, and a smaller value indicates a higher degree of uncertainty. 

Table II. Keywords of uncertainty.

Uncertain Words

Label Value

positive, change in

1.0

probable, likely, may, could, potential

0.7

might, possible

0.5

not exclude, difficult exclude, cannot be assessed, cannot be identified, impossible exclude, not rule out, cannot be evaluated

0.3

  1. Location Extraction

The disease location is also important. We provide a file `location_words.txt` to include words that describe disease location such as “left, right, lower”. For example, In the report text “The left lower lung may have mild pleural effusion.”, we have found the disease label “pleural effusion” and we also found “left, lower, lung” indicating location, so we take these words as the location of “pleural effusion”.

Evaluation

We have invited doctors to evaluate our extracted annotations from the radiology report.

This evaluation included 500 randomly extracted radiology reports and their corresponding extracted abnormality, severity, and uncertainty labels. It should be noted that the report may contain descriptions of several abnormalities. We counted the number of occurrences of several conditions: missed abnormality, missed severity, wrong abnormality, wrong severity, and wrong uncertainty. Each condition is described below:

  1. Missed abnormality: the abnormality label is not extracted by our method.
  2. Missed severity: the abnormality is extracted correctly, but the severity is not extracted.
  3. Wrong abnormality: abnormality that is not in the report is extracted.
  4. Wrong severity: the abnormality is extracted correctly, but the corresponding severity extracted is wrong.
  5. Wrong uncertainty: the abnormality is extracted correctly, but the corresponding uncertainty extracted is wrong.

We extracted approximately 2645 descriptions of the abnormalities from these 500 samples. We used this value as a total occurrence to calculate the percentage of occurrences of various conditions to reflect the accuracy of the labels we extracted. Here is the statistical data.

Table III. Evaluation of the 500 samples.

Condition

Occurrence

Percentage

Missed abnormality

59

2.23%

Missed severity

16

0.60%

Wrong abnormality

7

0.26%

Wrong severity

2

0.08%

Wrong uncertainty

54

2.20%

From the above statistics, the labels we extracted are correct in most cases, and the parts where label errors occur are mainly concentrated in missed abnormalities and incorrect uncertainty levels, with a frequency of around 2.20%.


Data Description

Data Example

The labels extracted from the radiology report are stored in the file `cad_disease.json`. For a patient taking a CXR image one time, the annotation structure is as follows:

{
        "study_id": "55088298",
        "subject_id": "18936629",
        "dicom_id": "61976388-5e534624-f6465079-76ea9caf-116f9938",
        "view": "antero-posterior",
        "study_order": 5,
        "entity": {
            "pneumothorax": {
                "id": 0,
                "entity_name": "pneumothorax",
                "report_name": "pneumothorax",
                "location": [
                    "left"
                ],
                "level": [
                    "minimal"
                ],
                "location2": null,
                "level2": null,
                "probability": "positive",
                "probability_score": 3
            },
            "atelectasis": {
                "id": 1,
                "entity_name": "atelectasis",
                "report_name": "atelectasis",
                "location": [
                    "left",
                    "basal"
                ],
                "level": [
                    "mild"
                ],
                "location2": null,
                "level2": null,
                "probability": "positive",
                "probability_score": 3
            },
            "edema": {
                "id": 2,
                "entity_name": "edema",
                "report_name": "pulmonary edema",
                "location": null,
                "level": null,
                "location2": null,
                "level2": null,
                "probability": "without",
                "probability_score": -3
            }
        }
    },
  • `study_id`: an integer unique for an individual study (i.e. an individual radiology report with one or more.
  • `subject_id`: an integer unique for an individual patient.
  • `dicom_id`: an identifier for the `DICOM` file. The stem of each .jpg image filename is equal to the `dicom_id`.
  • `view`: the orientation in which the chest radiograph was taken ("AP", "PA", "LATERAL", etc).
  • `entity`: the comprehensive annotation of diseases we extracted from radiologist reports. The entity dictionary includes disease annotations; each disease name is the key, and the annotation is the value.
  • `entity_name`: disease name.
  • `report_name`: disease name extracted from radiologist report.
  • `location`: disease location extracted from radiologist report. Sometimes, one disease may occur in more than one place. The dataset provides location 2 and level 2 for this scenario.
  • `level`: words extracted from radiologist reports to describe disease severity.
  • `probability`: words extracted from radiologists to describe disease uncertainty.
  • `probability_score`: This is used to express the degree of uncertainty. The larger the score, the more certain it is. Users can set the `probability_score` according to their own needs.

Compared with the 14 disease labels provided by the previous dataset, we expanded the number of disease labels and extracted as much disease information as possible from the report. These diseases include Atelectasis, Cardiomegaly, Edema, Pneumonia, etc., found in the file `disease_list.txt`. We have described how to extract disease labels from text in the Method section. In addition, how to extract severity, uncertainty, and location information is also introduced in the Method section.

Data Distribution

We present the disease distribution in extracted annotations. In CXR images, AP (anteroposterior) and PA (posteroanterior) are two common views, accounting for the vast majority of the MIMIC-CXR data set. Therefore, we counted the annotation distribution in reports corresponding to CXR images of AP and PA views.

There are 65379 “subject_id” and 227827 “study_id” in the dataset. For images and reports with AP and PA views, there are 63903 “subject_id” and 217982 “study_id”. We statistic the uncertainty distribution of 18 diseases. For each kind of uncertainty, we use the value assign method in Table 2 and count the disease occurrence under each uncertainty situation. The result is listed in Table IV, where columns 2 to 6 corresponds to different uncertainty. For example, if “atelectasis” is described with “may” in the report, it will be assigned a label value of 0.7 and be counted in column 3.

Table IV. Distribution of diseases with different uncertainty. Column 2 to column 6 corresponds to different uncertainty.

Disease

1

0.7

0.5

0.3

0

atelectasis

59328

16811

803

2136

1284

blunting of the costophrenic angle

2746

36

36

93

44

calcification

8201

200

37

12

98

cardiomegaly

38699

463

66

192

707

consolidation

13943

1623

285

1362

56336

edema

28866

4669

558

513

34857

emphysema

5529

398

44

13

129

fracture

7838

352

103

30

5469

granuloma

1705

522

34

5

87

hernia

2641

421

53

19

65

lung opacity

65234

735

225

2088

6176

pleural effusion

59853

8072

1750

1965

112325

pleural thickening

2757

759

75

22

70

pneumonia

16339

9938

1120

6082

31163

pneumothorax

10225

538

171

204

137301

scoliosis

2637

46

3

1

18

tortuosity of the thoracic aorta

1771

4

0

0

13

vascular congestion

12826

614

235

76

9770


Usage Notes

We have provided a description of the dataset. The CXR images and reports can be found in the MIMIC-CXR-JPG and MIMIC-CXR datasets. MIMIC-CXR-JPG provides the CXR images, and MIMIC-CXR provides the metadata of images, disease labels, and radiology reports. You can use subject_id and study_id to find the CXR image and corresponding report.


Ethics

The dataset is derived from the MIMIC-CXR database, which is a de-identified dataset that we have been granted access to via the PhysioNet Credentialed Health Data Use Agreement (v1.5.0).


Acknowledgements

We would like to acknowledge the MIMIC-CXR dataset to provide radiologist reports.


Conflicts of Interest

No conflicts.


References

  1. Johnson, A. E., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., ... & Horng, S. (2019). MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1), 317.
  2. Johnson, A. E., Pollard, T. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., Peng, Y., ... & Horng, S. (2019). MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042.
  3. Moukheiber D, Mahindre S, Moukheiber L, Moukheiber M, Wang S, Ma C, Shih G, Peng Y, Gao M. Few-Shot Learning Geometric Ensemble for Multi-label Classification of Chest X-Rays. InMICCAI Workshop on Data Augmentation, Labelling, and Imperfections 2022 Sep 16 (pp. 112-122). Cham: Springer Nature Switzerland.
  4. Wang R, Chen LC, Moukheiber L, Seastedt KP, Moukheiber M, Moukheiber D, Zaiman Z, Moukheiber S, Litchman T, Trivedi H, Steinberg R. Enabling chronic obstructive pulmonary disease diagnosis through chest X-rays: A multi-site and multi-modality study. International Journal of Medical Informatics. 2023 Oct 1;178:105211.
  5. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., ... & Ng, A. Y. (2019, July). Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 590-597).
  6. Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O'Reilly Media Inc.
  7. Explosion AI. (2021). spaCy 3.0: Industrial-strength Natural Language Processing in Python. https://spacy.io.

Parent Projects
CAD-Chest: Comprehensive Annotation of Diseases based on MIMIC-CXR Radiology Report was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Discovery
Corresponding Author
You must be logged in to view the contact information.

Files