Database Open Access
Image-derived cardiomegaly biomarker values for 96K chest X-rays in MIMIC-CXR/MIMIC-CXR-JPG
Benjamin Duvieusart , Felix Krones , Guy Parsons , Lionel Tarassenko , Bartlomiej W Papiez , Adam Mahdi
Published: Aug. 23, 2024. Version: 1.0.0
When using this resource, please cite:
(show more options)
Duvieusart, B., Krones, F., Parsons, G., Tarassenko, L., Papiez, B. W., & Mahdi, A. (2024). Image-derived cardiomegaly biomarker values for 96K chest X-rays in MIMIC-CXR/MIMIC-CXR-JPG (version 1.0.0). PhysioNet. https://doi.org/10.13026/kfpv-zm25.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Cardiomegaly is a condition characterized by an abnormal enlargement of the heart, its identification is of paramount importance as it associate with a wide range of cardiac conditions. It is primary identified via the cardiothoracic ratio (CTR), however this metric can be inaccurate as it is affect by external factors such as breathing and body position. Multimodal approaches could mitigate these limitations by integrating non-imaging data, however reliable and explainable integration of imaging and non-imaging data remains a significant challenge. While this database does not directly use multimodal data, it hopes to tackle this challenge by extracting cardiomegaly biomarkers (CTR and cardiopulmonary area ratio) from chest X-rays. Thus encapsulating the relevant imaging information into individual datapoints, allowing easy integration of ‘imaging’ data with non-imaging data for more reliable diagnostic tools. The values were extracted from over 93,000 posterior-anterior MIMIC-CXR scans using detection and segmentation neural networks, tuned for cardiac and pulmonary identification.
Background
Cardiomegaly is a cardiac condition characterized by an abnormal enlargement of the heart, and is associated with a range of cardiopulmonary diseases (e.g. coronary artery disease, congenital heart disorders) [1]. The cardiothoracic ratio (CTR) is the most common measurement used by doctors to identify patients with cardiomegaly; it is defined as the ratio of the cardiac width (horizontal distance between extremes of the cardiac shadow) to the thoracic width (horizontal distance between the inner side of the ribs at the level of the hemidiaphragms) [2]. Cardiopulmonary area ratio (CPAR) is a novel measurement proposed in [3]; it is defined as the ratio of the cardiac shadow area to the area of the lungs. Both biomarkers are measured from posterior-anterior (PA) chest X-rays. While, these are based on medical practices, they have inherent inaccuracies as they are affected by dilation of the cardiac chambers, respiratory phase, and body posture [2]. However, the use of multimodal data can curb these inaccuracies by incorporating external data into the diagnosis, and mimicking clinicians’ processes.
This database presents automatically extracted CTR and CPAR values for more than 93,000 PA chest X-rays in MIMIC-CXR [4] / MIMIC-CXR-JPG [5]. It was originally generated in the context of [3], a study developing a multimodal approach to cardiomegaly diagnosis. Specifically, a subset of these CTR and CPAR values were used in a XGBoost model with other relevant cardiac data from MIMIC-CXR [4]. We hope this database will facilitate multimodal approaches on the cardiomegaly identification challenge by extracting the relevant information from the chest X-rays thus simplifying the integration of imaging and non-imaging data. Furthermore, we hope this database will incite continued research into the use of CTR and CPAR biomarkers by making them readily available.
Methods
Data Preparation
To extract CTR and CPAR values from chest X-rays a total of 4 deep learning models are used. Two Mask R-CNNs and two Faster R-CNNs developed in [3] for the segmentation and detection of the heart and lungs (see Usage Notes). Each model was pretrained on ImageNet and then tuned on 585 posterior-anterior chest X-rays and their ground truth segmentations masks. It is important to note that 200 of these chest X-rays were sourced from MIMIC-CXR-JPG, the other 285 samples were sourced from the JSRT [6] and Montgomery Count Tuberculosis [7,8] databases.
A CTR value is calculated as the ratio of the width of the cardiac and pulmonary bounding boxes, where the width of the pulmonary bounding box serves as a surrogate for thoracic width due to the wider availability of lung ground truths for model training. To retrieve the bounding boxes, ensemble models with an Intersection over Union (IoU) scores of 0.836 and 0.903 for cardiac and pulmonary detection respectively, are used. IoU is a simple metric used to evaluate segmentation masks, it is defined as the ratio of the intersection of the predicted and ground truth masks to the union of the two - a perfect mask will have an IoU of 1. These ensemble models pass each X-ray scan through tuned Faster R-CNN and Mask R-CNN models, and models’ predictions are combined to get a final bounding box.
CPAR values are calculated as the ratio of the area of the cardiac and pulmonary segmentation masks. To revertive the relevant segmentations, each X-ray is passed to a cardiac and a pulmonary Mask R-CNN model. Then Otsu thresholding [9] is applied to produce binary masks from which cardiac and pulmonary areas are estimated for CPAR. Otsu thresholding is a binary thresholding technique which maximizes the inter-class variance between the two classes.
Further details on the pipeline used to generate CTR and CPAR biomarker values can be found in Multimodal Cardiomegaly Classification with Image-Derived Digital Biomarkers, 2022 [3]. This inlcudes validation of the models completed against both the MIMIC-CXR labels and against clinical gold standard labels manually compelted by a clinician.
For samples where the pipeline failed to extrapolate a CPAR or CTR value, there are 3 possible error markings, as defined in the table below.
VALUE | CAUSE OF ERROR |
---|---|
2 | Unable to locate heart |
3 | Unable to locate lungs |
4 |
Unable to locate either |
Data Sources
This database contains:
- CTR values for 96,161 posterior-anterior chest X-rays available in the MIMIC-CXR/MIMIC-CXR-JPG database. Summary stats of CTR values comparing all samples to cardiomegaly positive samples shown below.
All Samples |
Cardiomegaly Positive Samples* |
|
---|---|---|
Total number of samples | 96,161 | 1,942 |
Non-erroneous samples** (%) | 95,203 (99.0%) | 1,913 (98.5%) |
Mean | 0.483 | 0.565 |
Std dev | 0.065 | 0.066 |
25th percentile | 0.438 | 0.527 |
75th percentile | 0.512 | 0.603 |
- CPAR values for 96,161 posterior-anterior chest X-rays available in the MIMIC-CXR database. Summary stats of CPAR values comparing all samples to cardiomegaly positive samples are shown below:
All Samples |
Cardiomegaly Positive Samples* |
|
---|---|---|
Total number of samples | 96,161 | 1,942 |
Non-erroneous samples** (%) | 93,879 (97.6%) | 1,881 (96.9%) |
Mean | 0.341 | 0.459 |
Std dev | 0.089 | 0.108 |
25th percentile | 0.280 | 0.386 |
75th percentile | 0.383 | 0.520 |
As expected cardiomegaly positive samples* have a higher biomarker values for both CTR and CPAR. In clinical situations, CTR above 0.5 is considered to be pathological [1] - this is in alignment with the results here with pathological samples having a mean of 0.565 compared to the whole dataset which has a mean of 0.483. CPAR also has a similar discrepancy between the whole dataset and the cardiomegaly positive cohort, with the pathological cohort having a CPAR of 0.459 on average, 45% larger than the mean CPAR for the whole dataset (0.341). Due to the novelty of the CPAR biomarker there is no clinically established threshold value which can be used to diagnose cardiomegaly, but the data suggests values around 0.35-0.4 are appropriate.
* The term "cardiomegaly positive samples" refers to scans with a positive cardiomegaly label for both NegBio and CheXpert automatic labeling in MIMIC-CXR-JPG, the criterion used to establish the presence of cardiomegaly in a scan with a sufficiently high degree of confidence.
** Non-erroneous samples refers to samples for which none of the 3 possible error values are raised
Data Description
The dataset consists of two CSV files - CPARs.csv and CTRs.csv - which hold biomarker values, and a README text file.
Detailed Description
- CPARs.csv : file linking CPAR values to DICOM filenames for 96,161 posterior-anterior chest X-rays in MIMIC-CXR/MIMIC-CXR-JPG. Valid (i.e. non-erroneous) CPAR values available for 93,879 (97.6%) of scans.
- CTRs.csv : file linking CTR values to DICOM filenames for 96,161 posterior-anterior chest X-rays in MIMIC-CXR/MIMIC-CXR-JPG. Valid CTR values available for 95,208 (99,0%) of scans.
- REDME.txt : text file describing the meaning of error values.
Folder Structure
<base>
└── CPARs.csv
└── CTRs.csv
└── README.txt
Usage Notes
By design, this database is built to be used with the MIMIC-IV, MIMIC-CXR, MIMIC-CXR-JPG databases to evaluate alternative approaches to multimodal AI. However, it can also be used in a uni-modal conext with labels from MIMIC-CXR-JPG to test the usefulness of the CTR and CPAR biomarkers, with regards to cardiomegaly as well as any other other cardiac condition.
Code showing extraction and use of CTR and CPAR biomarker values, including existing trained primary models, training of new primary models, construction of ensemble models, extraction of CTR and CPAR from MIMIC-CXR, and their implementation for cardiomegaly classification can be found in the CardiomegalyBiomarkers GitHub repository [10].
Ethics
The authors declare no ethics concerns
Conflicts of Interest
The authors have no conflicts of interest to declare
References
- Siddiqui, W., Amin, H. (2021). Cardiomegaly. StatPerls [Internet]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK542296/.
- Chaisangmongkon, W. et al. (2021) External validation of deep learning algorithms for cardiothoracic ratio measurement. IEEE Access, vol. 9, pp. 110287-110298. https://doi.org/10.1109/ACCESS.2021.3101253
- Duvieusart, B. et al. (2022). Multimodal Cardiomegaly Classification with Image-Derived Digital Biomarkers. Medical Image Understanding and Analysis. MIUA 2022: Lecture Notes in Computer Science, vol. 13413, pp. 13-27. Springer, Cham. https://doi.org/10.1007/978-3-031-12053-4_2
- Johnson, A. et al. (2019). MIMIC-CXR Database (version 2.0.0). PhysioNet. https://doi.org/10.13026/C2JT1Q.
- Johnson, A. et al. (2019). MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0). PhysioNet. https://doi.org/10.13026/8360-t248.
- Shiraishi, J et al. (2000). Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists' detection of pulmonary nodules. AJR. American journal of roentgenology vol. 174(1), pp. 71-4. https://doi.org/10.2214/ajr.174.1.1740071
- Candemir, S et al. (2014). Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE transactions on medical imaging vol. 33(2), pp. 577-90. https://doi.org/10.1109/TMI.2013.2290491
- Jaeger, S et al. (2014). Automatic tuberculosis screening using chest radiographs. IEEE transactions on medical imaging vol. 33(2), pp. 233-45. https://doi.org/10.1109/TMI.2013.2284099
- Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics vol. 9(1), pp. 62–66. doi: 10.0.4.85/TSMC.1979.4310076
- Duvieusart, B., Krones, F. (2022). CardiomegalyBiomarkers GitHub Repository. GitHub. Retrieved on 7 July 2023 from https://github.com/benduvi20/CardiomegalyBiomarkers
Parent Projects
Access
Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.
License (for files):
Open Data Commons Attribution License v1.0
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/kfpv-zm25
DOI (latest version):
https://doi.org/10.13026/4evw-jd69
Topics:
biomarkers
mimic-cxr
cpar
ctr
cardiomegaly
Corresponding Author
Files
Total uncompressed size: 11.9 MB.
Access the files
- Download the ZIP file (6.7 MB)
-
Download the files using your terminal:
wget -r -N -c -np https://physionet.org/files/cxr-cardiomegaly/1.0.0/
Name | Size | Modified |
---|---|---|
CPARs.csv (download) | 5.9 MB | 2022-10-26 |
CTRs.csv (download) | 5.9 MB | 2022-10-26 |
LICENSE.txt (download) | 19.9 KB | 2024-06-24 |
README.txt (download) | 515 B | 2022-07-29 |
SHA256SUMS.txt (download) | 302 B | 2024-08-26 |