Database Credentialed Access

MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing

Benedikt Boecking Naoto Usuyama Shruthi Bannur Daniel Coelho de Castro Anton Schwaighofer Stephanie Hyland Maria Teodora Wetscherek Tristan Naumann Aditya Nori Javier Alvarez Valle Hoifung Poon Ozan Oktay

Published: May 16, 2022. Version: 0.1

When using this resource, please cite: (show more options)
Boecking, B., Usuyama, N., Bannur, S., Coelho de Castro, D., Schwaighofer, A., Hyland, S., Wetscherek, M. T., Naumann, T., Nori, A., Alvarez Valle, J., Poon, H., & Oktay, O. (2022). MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing (version 0.1). PhysioNet.

Additionally, please cite the original publication:

Boecking B, Usuyama N, Bannur S, Castro D.C., Schwaighofer A, Hyland S, Wetscherek M, Naumann T, Nori A, Alvarez-Valle J, Poon H, and Oktay O. 2022. Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, Oct 23–27, 2022, Proceedings, Part XXXVI. Springer-Verlag 1–21.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


We release a new dataset, MS-CXR, with locally-aligned phrase grounding annotations by board-certified radiologists to facilitate the study of complex semantic modelling in biomedical vision–language processing. The MS-CXR dataset provides 1162 image–sentence pairs of bounding boxes and corresponding phrases, collected across eight different cardiopulmonary radiological findings, with an approximately equal number of pairs for each finding. This dataset complements the existing MIMIC-CXR v.2 dataset and comprises: 1. Reviewed and edited bounding boxes and phrases (1026 pairs of bounding box/sentence); and 2. Manual bounding box labels from scratch (136 pairs of bounding box/sentence).

This large, well-balanced phrase grounding benchmark dataset contains carefully curated image regions annotated with descriptions of eight radiology findings, as verified by radiologists. Unlike existing chest X-ray benchmarks, this challenging phrase grounding task evaluates joint, local image-text reasoning while requiring real-world language understanding, e.g. to parse domain-specific location references, complex negations, and bias in reporting style. This data accompany work showing that principled textual semantic modelling can improve contrastive learning in self-supervised vision–language processing.


Presently, no datasets exist that allow for phrase grounding of radiology findings, but some enable different forms of local image evaluations. VinDr [2], RSNA Pneumonia [3], and the NIH Chest X-ray [4] datasets provide bounding-box annotations but lack free-text descriptions. REFLACX [1] provides gaze locations (ellipses) captured with an eye tracker, dictated reports, but no full phrase matches to image regions. Phrase annotations for MIMIC-CXR data released in [5] are of small size (350 studies), only contain two abnormalities, for some samples have shortened phrases that were adapted to simplify the task. The ground-truth set of ImaGenome [6] only contains 500 studies, bounding-box regions annotate anatomical regions rather than radiological findings, and its sentence annotations are not curated for grounding evaluation.


We first parse original MIMIC reports and REFLACX [1] radiology transcripts by extracting sentences to form a large pool of text descriptions of pathologies. These candidates are later filtered by deploying the CheXbert [9] text classifier, in order to only keep phrases associated with the target pathologies whilst ensuring the following two criteria: (I) For a given study, there is only one sentence describing the target pathology, and (II) the sentence does not mention more than one finding that are irrelevant to each other. After extracting the text descriptions, they are paired with image annotations on a study level. At the final stage, a review process is conducted with two board-certified radiologists mainly to verify the match between the text and bounding box candidates. Moreover, in this review process, we also assessed the suitability of the annotation pairs for the phrase grounding task whilst ensuring clinical accuracy. 

In detail, the phrase-image samples are filtered out if at least one of the following conditions is met: 

  1. Text describing a finding is not present in the image
  2. Phrase/sentence does not describe a clinical finding or describes multiple unrelated abnormalities that appear in different lung regions. 
  3. There is a mismatch between the bounding box and phrase, such as image annotations are placed incorrectly or do not capture the true extent of the abnormality. 
  4. High uncertainty is expressed regarding reported findings, e.g. “there is questionable right lower lobe opacity”.
  5. Chest X-ray is not suitable for assessment of the finding or has poor image quality.
  6. Text contains differential diagnosis or longitudinal information that prohibits correct grounding via the single paired image.
  7. Long sentences (>30 tokens), which often contain patient meta-information that is not shared between the two modalities (e.g. de-identified tokens). 

Note that we only filter out phrases containing multiple findings, not images with multiple findings. For instance, if an image contains both pneumonia and atelectasis, with separate descriptions for each in the report, then we create two instances of phrase-bounding box pairs. 

To further increase the size of our dataset, and to balance samples across classes, additional CXR studies are sampled at random, conditioned on the underrepresented pathologies. The following procedure is applied to create the pairs of image and text annotations for these selected studies: Text descriptions are extracted using the same methodology outlined above, using MIMIC-CXR and ImaGenome datasets [6], where the latter provides sentence extracts from a subset of MIMIC-CXR dataset for clinical findings. However, differently from the initial step, the corresponding bounding box annotations (either one or more per sentence) are created from scratch by radiologists for the finding described in the text, and the same filtering as above is applied by the annotator to discard candidates if the image and/or sentence is found unsuitable for the grounding task. 

Data Description

We provide bounding box and sentence pair annotations describing clinical findings visible in a given chest X-ray image. Each sentence describes a single pathology present in the image, and there could be multiple manually annotated bounding boxes corresponding to the description of the single radiological finding. Additionally, an image may have more than one pathology present, and we provide separate sets of bounding boxes for each phrase describing a unique pathology associated with an image. The annotations were collected on a subset of MIMIC-CXR images, which additionally contains labels across eight different pathologies: atelectasis, cardiomegaly, consolidation, edema, lung opacity, pleural effusion, pneumonia, and pneumothorax. These pathologies were chosen based on the overlap between pathology classes present in the existing datasets and the CheXbert classifier [9].

Folder structure

This project contains 3 files:

  • MS_CXR_Local_Alignment_v1.0.0.json: Phrase grounding annotations in MS-COCO JSON format.
  • MS_CXR_Local_Alignment_v1.0.0.csv: The same annotations in a tabular format.
  • Python script used to read and convert the COCO annotations.

Annotation schema

The dataset annotations are provided in MS-COCO JSON format. We also provide the annotations in CSV format for convenience. The documents contain the following fields:

  • Categories: List of conditions/pathologies
  • Images: Metadata of the original chest X-ray images. The images need to be separately downloaded from MIMIC-CXR / MIMIC-CXR-JPG projects.
  • Annotations: Each entry in the annotations field represents a bounding box with an associated sentence describing a condition/pathology. Images may have multiple associated annotations.

An example annotation in MS-COCO JSON format is shown below:

    "info": {
        "year": "2022,
        "version": "1.0.0"
        "description": "MS-CXR Locally Aligned Phrase Grounding Annotations",
        "contributor": "Microsoft",
        "date_created": "2022-04-21",
        "url": ""
    "licenses": [
            "url": "",
            "id": 1,
            "name": "PhysioNet Credentialed Health Data License 1.5.0"
    "categories": [
            "id": 0,
            "name": "Pneumothorax",
            "supercategory": "disease"
    "images": [
            "id": 16,
            "file_name": "c436cddb-4126f15e-59c0733c-34b5a4b5-bbda7ffd.jpg",
            "width": 2539,
            "height": 3050,
            "num_annotations": 3,
            "path": "/datasetdrive/MIMIC-CXR-V2/",
    "annotations": [
            "id": 18,
            "image_id": 16,

Patient Demographics

The average age of subjects in MS-CXR is higher than the average for all subjects in MIMIC-CXR. These findings are concordant with prior work [10] and we explain this observation with the fact that we do not sample studies from healthy subjects that do not display any anomalous findings and who are statistically likely to be younger. Similarly, we do not expect gender bias to be present due to our sampling, as none of the pathologies we sample are gender specific. Overall, MS-CXR does not deviate far from the MIMIC-CXR distribution. 

Distribution of the annotation pairs (image bounding-box and sentence) across different clinical findings. The demographic statistics (e.g., gender, age) of the subjects are collected from MIMIC-IV dataset for MS-CXR and all MIMIC-CXR.
Findings # of annotation pairs # subjects Gender - F (%) Avg Age (std)




28 (45.90%)

64.52 (15.95)




135 (47.87%)

68.10 (14.81)




40 (36.70%)

60.08 (17.67)




18 (42.86%)

68.79 (14.04)

Lung opacity



33 (40.24%)

62.07 (17.20)

Pleural effusion



41 (43.16%)

66.36 (15.29)




65 (44.52%)

64.32 (17.17)




66 (43.71%)

60.71 (18.04)




382 (44.89%)

64.37 (16.61)

Background (all MIMIC-CXR)



34134.0 (52.39%)

56.85 (19.47)

Usage Notes

We are releasing the MS-CXR dataset to encourage reproducible evaluation of joint latent semantics learnt by biomedical image-text models. Accurate local alignment between these two modalities is an essential characteristic of successful joint image-text training in healthcare since image and report samples often contain multiple clinical findings. In an associated paper, we provide comprehensive evaluations of current state-of-the-art multi-modal models and a promising approach to improve the models further. 

The dataset annotations are provided in MS-COCO format. Any library/API (e.g. cocoapi)  supporting the MS-COCO format can be used to load the annotations. The annotations are also provided in CSV format for convenience. 


MS-CXR is as a research artifact of the corresponding work, Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing. In this capacity, MS-CXR facilitates reproducibility and serves as an addition to the benchmarking landscape. The dataset is released with instances chosen from the public MIMIC-CXR v2 image-text dataset. As such, ethical considerations of that project should be taken into considerations in addition to those provided below.

MS-CXR contains a large number of samples covering 8 findings, which were balanced to ensure coverage for all findings and curated to ensure gold-standard evaluation of phrase grounding. To ensure a high quality, consistent benchmark, the phrase-image samples that do not adhere to guidelines (detailed in the corresponding work) are filtered out, including phrases containing multiple abnormalities in distinct lung regions.

In concordance with existing research [10], the application of filters results in a dataset that is both slightly older (average age 64.37 vs 56.85 in all MIMIC-CXR v2) and slightly less female (percentage female 44.89% vs 52.39% in all MIMIC-CXR). While these are relatively small shifts and the primary intention of this dataset is to facilitate reproducibility as a benchmark, we have disclosed this both alongside the dataset and in the corresponding work. 


The authors would also like to thank Hannah Murfet for the guidance offered as part of the compliance review of the datasets used in this study, and Dr Maria Wetscherek and Dr Matthew Lungren for their clinical input and data annotations provided to this study. 

Lastly, the released MS-CXR dataset has been built upon the following public data and benchmarks, thus the authors would like to thank the contributors of these datasets: 

Conflicts of Interest

The authors have no conflicts of interest to declare.


  1. Ricardo Bigolin Lanfredi, Mingyuan Zhang, William F Auffermann, Jessica Chan, Phuong-Anh T Duong, Vivek Srikumar, Trafton Drew, Joyce D Schroeder, and Tolga Tasdizen. REFLACX, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays. arXiv preprint arXiv:2109.14187, 2021.
  2. Ha Q Nguyen, Khanh Lam, Linh T Le, Hieu H Pham, Dat Q Tran, Dung B Nguyen, Dung D Le, Chi M Pham, Hang TT Tong, Diep H Dinh, et al. VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. arXiv preprint arXiv:2012.15029, 2020
  3. George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, Veronica Arteaga, Maya Galperin-Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 1(1) :e180041, 2019.
  4. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. ChestX-Ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2097–2106. IEEE Computer Society, 2017.
  5. L.K. Tam, X. Wang, E. Turkbey, K. Lu, Y. Wen, and D. Xu. Weakly supervised one-stage vision and language disease detection using large scale pneumonia and pneumothorax studies. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2020, March 2020.
  6. Joy T Wu, Nkechinyere Nneka Agu, Ismini Lourentzou, Arjun Sharma, Joseph Alexander Paguio, Jasper Seth Yao, Edward Christopher Dee, William G Mitchell, Satyananda Kashyap, Andrea Giovannini, et al. Chest imagenome dataset for clinical reasoning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  7. Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation, 101(23): e215–e220, 2000.
  8. Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2019). MIMIC-CXR Database (version 2.0.0). PhysioNet.
  9. Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y Ng, and Matthew Lungren. Com- bining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1500–1519. Association for Computational Linguistics, 2020.
  10. Weber GM, Adams WG, Bernstam EV, Bickel JP, Fox KP, Marsolo K, Raghavan VA, Turchin A, Zhou X, Murphy SN, Mandl KD. Biases introduced by filtering electronic health records for patients with "complete data". J Am Med Inform Assoc. 2017 Nov 1;24(6):1134-1141. doi: 10.1093/jamia/ocx071. PMID: 29016972; PMCID: PMC6080680.

Parent Projects
MS-CXR: Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.