Name: CAD-Chest: Comprehensive Annotation of Diseases based on MIMIC-CXR Radiology Report
Published: Dec. 8, 2023
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Mengliang Zhang , Xinyue Hu , Lin Gu , Tatsuya Harada , Kazuma Kobayashi , Ronald Summers , Yingying Zhu

Published: Dec. 8, 2023. Version: 1.0

When using this resource, please cite: (show more options)
Zhang, M., Hu, X., Gu, L., Harada, T., Kobayashi, K., Summers, R., & Zhu, Y. (2023). CAD-Chest: Comprehensive Annotation of Diseases based on MIMIC-CXR Radiology Report (version 1.0). PhysioNet. https://doi.org/10.13026/44pd-vz36.

MLA	Zhang, Mengliang, et al. "CAD-Chest: Comprehensive Annotation of Diseases based on MIMIC-CXR Radiology Report" (version 1.0). PhysioNet (2023), https://doi.org/10.13026/44pd-vz36.
APA	Zhang, M., Hu, X., Gu, L., Harada, T., Kobayashi, K., Summers, R., & Zhu, Y. (2023). CAD-Chest: Comprehensive Annotation of Diseases based on MIMIC-CXR Radiology Report (version 1.0). PhysioNet. https://doi.org/10.13026/44pd-vz36.
Chicago	Zhang, Mengliang, Hu, Xinyue, Gu, Lin, Harada, Tatsuya, Kobayashi, Kazuma, Summers, Ronald, and Yingying Zhu. "CAD-Chest: Comprehensive Annotation of Diseases based on MIMIC-CXR Radiology Report" (version 1.0). PhysioNet (2023). https://doi.org/10.13026/44pd-vz36.
Harvard	Zhang, M., Hu, X., Gu, L., Harada, T., Kobayashi, K., Summers, R., and Zhu, Y. (2023) 'CAD-Chest: Comprehensive Annotation of Diseases based on MIMIC-CXR Radiology Report' (version 1.0), PhysioNet. Available at: https://doi.org/10.13026/44pd-vz36.
Vancouver	Zhang M, Hu X, Gu L, Harada T, Kobayashi K, Summers R, Zhu Y. CAD-Chest: Comprehensive Annotation of Diseases based on MIMIC-CXR Radiology Report (version 1.0). PhysioNet. 2023. Available from: https://doi.org/10.13026/44pd-vz36.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Several extant chest X-ray (CXR) datasets predominantly comprise binary disease labels and exhibit a deficiency in providing comprehensive disease-related information. Crucial facets of disease management, including disease severity, diagnostic uncertainty, and precise localization, are often absent in these datasets, yet they hold substantial clinical significance. In this work, we present a comprehensive annotation of disease (CAD) on CXR images, which is named CAD-Chest dataset. We have leveraged radiology reports authored by medical professionals to meticulously devise label extraction protocols. These protocols facilitate the extraction of essential disease-related attributes, encompassing disease name, severity grading, and additional pertinent details. This dataset is poised to empower researchers and practitioners by offering a holistic perspective on diseases, transcending the mere presence or absence of binary classification.

Background

In recent years, substantial advancements [1-5] have been achieved in the realm of computer-aided diagnosis utilizing chest X-ray (CXR) images. Predominantly, extant applications have stemmed from the classification task, aimed at discerning the presence of specific diseases, or the detection task, geared towards localizing pathological conditions. Consequently, assigning multiple disease labels to each image or delineating bounding boxes around symptomatic regions becomes imperative for creating a comprehensive CXR image dataset. However, the annotation process of CXR images demands a substantial degree of expertise, rendering the establishment of such datasets a non-trivial endeavor.

Certain datasets extract disease labels from accompanying radiologist reports to mitigate the expenses associated with the disease labeling process. These textual documents are inherently structured and encompass detailed descriptions of pulmonary conditions meticulously recorded by radiologists following the acquisition of CXR images. In contrast to the labor-intensive alternative of relying on additional medical professionals for annotation, the extraction of disease labels from text not only conserves invaluable medical resources but also attains a level of accuracy commensurate with that of seasoned medical practitioners.

Furthermore, it is noteworthy that disease diagnosis in the realm of CXR images transcends the relatively straightforward binary classification task of disease presence or absence. Radiologists are often tasked with providing a comprehensive assessment, encompassing disease type, severity, precise localization, and other intricate details. In certain instances, owing to factors such as image quality and limited clinical experience, radiologists may not be able to definitively ascertain the presence of a disease, leading to the utilization of terminology denoting uncertainty in their descriptions. This phenomenon is commonly referred to as "uncertain label" and is evident in select CXR datasets, including but not limited to CheXpert [5] and MIMIC-CXR [1].

Various approaches have been proposed to address the issue of uncertain labels. One straightforward approach is to treat uncertain labels as negative or positive. For instance, CheXpert [5] adopts a three-class methodology, treating uncertain labels as a distinct category and constraining the cumulative probability of negative, positive, and uncertain labels to equal 1. While these methods offer a reasonable means of handling uncertain labels, it is important to note that their principal objective remains the binary classification of disease presence, and they may not afford comprehensive disease-specific information.

In light of this, we leverage the extensive MIMIC-CXR [1] and MIMIC-CXR-JPG [2] datasets to exhaustively extract and analyze disease-related information, to extend our scope beyond conventional disease label classification tasks. Therefore, our proposed dataset has two contributions:

1. We design a rule-based method to extract disease labels from text reports.

2. We consider comprehensive information on disease diagnosis and extract labels such as severity and uncertainty related to the disease.

Methods

Annotation Extraction

We constructed the CAD-Chest dataset from free-text reports released by the MIMIC-CXR [1] dataset. We obtain the disease information we want by setting some extraction rules. Specifically, we utilized NLTK [6] and Spacy [7] packages for biomedical text processing to extract entities from given reports. We set some disease-related words, and when these words are detected in the report, we think there is a description of the related disease in the report. Here we introduce how to extract these labels below:

Disease Extraction

The same disease may be described differently, and doctors may use different words to describe the same disease in their reports. To extract the disease labels in the report, we provide a file `disease_list.txt` to include words representing diseases. In this file, for example, atelectasis has several representations: “atelectasis, collapse.”. When the text in the dictionary appears in the report, we think that the report describes the disease, so we search the text for the severity of the disease. degree, uncertainty, and location information.

Severity Extraction

After detecting that a sentence in the report describes a disease, we need to extract the severity of the disease from this sentence. For example, a sentence in the report is “There is mild cardiomegaly.”, we detect the disease “cardiomegaly” in the first step, then we use grammatical analysis to find words related to the disease and filter out words that indicate severity. We provide a file `severity_words.txt` to include those words that describe severity. In the above example, "mild" is a word that describes “cardiomegaly”, so it is extracted as the severity of the disease. The same goes for extracting disease uncertainty and location.

Severity keywords in the “severity_words.txt” file are listed in Table I. We merge keywords with similar meanings and divide them into four different severity levels.

Table I. Keywords of disease severity.
Merged Level	Extracted Words
Mild	mild, small, trace, minor, minimal, minimally, subtle, mildly
Moderate	moderate, moderately, mild to moderate
Severe	severe, acute, massive, moderate to severe, moderate to large
No	no, without, clear of, negative, exclude, lack of, rule out

Uncertainty Extraction

For disease uncertainty, we provide a file `uncertainty_words.txt` to include those words related to uncertainty. Due to unclear images or experience, doctors are uncertain when describing some diseases, such as “The left lower lung may have mild pleural effusion.”. “May” is a word that represents uncertainty, indicating that doctors are not entirely sure the disease exists. We determined the uncertainty of the “pleural effusion” by filtering out “may” from words related to the disease based on the documents provided.

Uncertainty keywords in the “uncertainty_words.txt” file are listed in Table II. Different from assigning binary labels to disease labels in other data sets, we assign different label values to these keywords according to the degree of uncertainty. A larger value indicates a lower degree of uncertainty, and a smaller value indicates a higher degree of uncertainty.

Table II. Keywords of uncertainty.
Uncertain Words	Label Value
positive, change in	1.0
probable, likely, may, could, potential	0.7
might, possible	0.5
not exclude, difficult exclude, cannot be assessed, cannot be identified, impossible exclude, not rule out, cannot be evaluated	0.3

Location Extraction

The disease location is also important. We provide a file `location_words.txt` to include words that describe disease location such as “left, right, lower”. For example, In the report text “The left lower lung may have mild pleural effusion.”, we have found the disease label “pleural effusion” and we also found “left, lower, lung” indicating location, so we take these words as the location of “pleural effusion”.

Evaluation

We have invited doctors to evaluate our extracted annotations from the radiology report.

This evaluation included 500 randomly extracted radiology reports and their corresponding extracted abnormality, severity, and uncertainty labels. It should be noted that the report may contain descriptions of several abnormalities. We counted the number of occurrences of several conditions: missed abnormality, missed severity, wrong abnormality, wrong severity, and wrong uncertainty. Each condition is described below:

Missed abnormality: the abnormality label is not extracted by our method.
Missed severity: the abnormality is extracted correctly, but the severity is not extracted.
Wrong abnormality: abnormality that is not in the report is extracted.
Wrong severity: the abnormality is extracted correctly, but the corresponding severity extracted is wrong.
Wrong uncertainty: the abnormality is extracted correctly, but the corresponding uncertainty extracted is wrong.

We extracted approximately 2645 descriptions of the abnormalities from these 500 samples. We used this value as a total occurrence to calculate the percentage of occurrences of various conditions to reflect the accuracy of the labels we extracted. Here is the statistical data.

Table III. Evaluation of the 500 samples.
Condition	Occurrence	Percentage
Missed abnormality	59	2.23%
Missed severity	16	0.60%
Wrong abnormality	7	0.26%
Wrong severity	2	0.08%
Wrong uncertainty	54	2.20%

From the above statistics, the labels we extracted are correct in most cases, and the parts where label errors occur are mainly concentrated in missed abnormalities and incorrect uncertainty levels, with a frequency of around 2.20%.

Data Description

Data Example

The labels extracted from the radiology report are stored in the file `cad_disease.json`. For a patient taking a CXR image one time, the annotation structure is as follows:

{
        "study_id": "55088298",
        "subject_id": "18936629",
        "dicom_id": "61976388-5e534624-f6465079-76ea9caf-116f9938",
        "view": "antero-posterior",
        "study_order": 5,
        "entity": {
            "pneumothorax": {
                "id": 0,
                "entity_name": "pneumothorax",
                "report_name": "pneumothorax",
                "location": [
                    "left"
                ],
                "level": [
                    "minimal"
                ],
                "location2": null,
                "level2": null,
                "probability": "positive",
                "probability_score": 3
            },
            "atelectasis": {
                "id": 1,
                "entity_name": "atelectasis",
                "report_name": "atelectasis",
                "location": [
                    "left",
                    "basal"
                ],
                "level": [
                    "mild"
                ],
                "location2": null,
                "level2": null,
                "probability": "positive",
                "probability_score": 3
            },
            "edema": {
                "id": 2,
                "entity_name": "edema",
                "report_name": "pulmonary edema",
                "location": null,
                "level": null,
                "location2": null,
                "level2": null,
                "probability": "without",
                "probability_score": -3
            }
        }
    },

`study_id`: an integer unique for an individual study (i.e. an individual radiology report with one or more.
`subject_id`: an integer unique for an individual patient.
`dicom_id`: an identifier for the `DICOM` file. The stem of each .jpg image filename is equal to the `dicom_id`.
`view`: the orientation in which the chest radiograph was taken ("AP", "PA", "LATERAL", etc).
`entity`: the comprehensive annotation of diseases we extracted from radiologist reports. The entity dictionary includes disease annotations; each disease name is the key, and the annotation is the value.
`entity_name`: disease name.
`report_name`: disease name extracted from radiologist report.
`location`: disease location extracted from radiologist report. Sometimes, one disease may occur in more than one place. The dataset provides location 2 and level 2 for this scenario.
`level`: words extracted from radiologist reports to describe disease severity.
`probability`: words extracted from radiologists to describe disease uncertainty.
`probability_score`: This is used to express the degree of uncertainty. The larger the score, the more certain it is. Users can set the `probability_score` according to their own needs.

Compared with the 14 disease labels provided by the previous dataset, we expanded the number of disease labels and extracted as much disease information as possible from the report. These diseases include Atelectasis, Cardiomegaly, Edema, Pneumonia, etc., found in the file `disease_list.txt`. We have described how to extract disease labels from text in the Method section. In addition, how to extract severity, uncertainty, and location information is also introduced in the Method section.

Data Distribution

We present the disease distribution in extracted annotations. In CXR images, AP (anteroposterior) and PA (posteroanterior) are two common views, accounting for the vast majority of the MIMIC-CXR data set. Therefore, we counted the annotation distribution in reports corresponding to CXR images of AP and PA views.

There are 65379 “subject_id” and 227827 “study_id” in the dataset. For images and reports with AP and PA views, there are 63903 “subject_id” and 217982 “study_id”. We statistic the uncertainty distribution of 18 diseases. For each kind of uncertainty, we use the value assign method in Table 2 and count the disease occurrence under each uncertainty situation. The result is listed in Table IV, where columns 2 to 6 corresponds to different uncertainty. For example, if “atelectasis” is described with “may” in the report, it will be assigned a label value of 0.7 and be counted in column 3.

Table IV. Distribution of diseases with different uncertainty. Column 2 to column 6 corresponds to different uncertainty.
Disease	1	0.7	0.5	0.3	0
atelectasis	59328	16811	803	2136	1284
blunting of the costophrenic angle	2746	36	36	93	44
calcification	8201	200	37	12	98
cardiomegaly	38699	463	66	192	707
consolidation	13943	1623	285	1362	56336
edema	28866	4669	558	513	34857
emphysema	5529	398	44	13	129
fracture	7838	352	103	30	5469
granuloma	1705	522	34	5	87
hernia	2641	421	53	19	65
lung opacity	65234	735	225	2088	6176
pleural effusion	59853	8072	1750	1965	112325
pleural thickening	2757	759	75	22	70
pneumonia	16339	9938	1120	6082	31163
pneumothorax	10225	538	171	204	137301
scoliosis	2637	46	3	1	18
tortuosity of the thoracic aorta	1771	4	0	0	13
vascular congestion	12826	614	235	76	9770

Usage Notes

We have provided a description of the dataset. The CXR images and reports can be found in the MIMIC-CXR-JPG and MIMIC-CXR datasets. MIMIC-CXR-JPG provides the CXR images, and MIMIC-CXR provides the metadata of images, disease labels, and radiology reports. You can use subject_id and study_id to find the CXR image and corresponding report.

Ethics

The dataset is derived from the MIMIC-CXR database, which is a de-identified dataset that we have been granted access to via the PhysioNet Credentialed Health Data Use Agreement (v1.5.0).

Acknowledgements

We would like to acknowledge the MIMIC-CXR dataset to provide radiologist reports.

Conflicts of Interest

No conflicts.

References

Johnson, A. E., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., ... & Horng, S. (2019). MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1), 317.
Johnson, A. E., Pollard, T. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., Peng, Y., ... & Horng, S. (2019). MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042.
Moukheiber D, Mahindre S, Moukheiber L, Moukheiber M, Wang S, Ma C, Shih G, Peng Y, Gao M. Few-Shot Learning Geometric Ensemble for Multi-label Classification of Chest X-Rays. InMICCAI Workshop on Data Augmentation, Labelling, and Imperfections 2022 Sep 16 (pp. 112-122). Cham: Springer Nature Switzerland.
Wang R, Chen LC, Moukheiber L, Seastedt KP, Moukheiber M, Moukheiber D, Zaiman Z, Moukheiber S, Litchman T, Trivedi H, Steinberg R. Enabling chronic obstructive pulmonary disease diagnosis through chest X-rays: A multi-site and multi-modality study. International Journal of Medical Informatics. 2023 Oct 1;178:105211.
Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., ... & Ng, A. Y. (2019, July). Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 590-597).
Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O'Reilly Media Inc.
Explosion AI. (2021). spaCy 3.0: Industrial-strength Natural Language Processing in Python. https://spacy.io.