Database Credentialed Access

Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images

Xinyue Hu Lin Gu Qiyuan An Mengliang Zhang liangchen liu Kazuma Kobayashi Tatsuya Harada Ronald Summers Yingying Zhu

Published: Sept. 15, 2023. Version: 1.0.0

When using this resource, please cite: (show more options)
Hu, X., Gu, L., An, Q., Zhang, M., liu, l., Kobayashi, K., Harada, T., Summers, R., & Zhu, Y. (2023). Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, Ronald M. Summers, and Yingying Zhu. 2023. Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '23). Association for Computing Machinery, New York, NY, USA, 4156–4165.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


The task of Difference Visual Question Answering involves answering questions about the difference between a pair of main and reference images. This process is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. We've assembled a new dataset, called Medical-Diff-VQA, for this purpose. Unlike previous medical VQA datasets, ours is the first one designed specifically for the Difference Visual Question Answering task, with questions crafted to suit the Assessment-Diagnosis-Intervention-Evaluation treatment procedure employed by medical professionals. The Medical-Diff-VQA dataset, a derivative of the MIMIC-CXR dataset, consists of questions categorized into seven categories: abnormality (145,421), location (84,193), type (27,478), level (67,296), view (56,265), presence (155,726), and difference(164,324). The 'difference' questions are specifically for comparing two images. In total, the Medical-Diff-VQA dataset contains 700,703 question-answer pairs derived from 164,324 pairs of main and reference images.


The medical informatics community has been working on feeding data-hungry deep learning algorithms by fully exploiting hospital databases with invaluable loosely labeled imaging data. Among diverse attempts, Chest X-ray datasets such as MIMIC-CXR [1], NIH14 [2], and Chexpert [3] have received particular attention. During this arduous journey on vision-language (VL) modality, the community either mines per-image common disease labels through Natural Language Processing (NLP) or endeavors on report generation generated from a certain source. Despite significant progress achieved on these tasks, the heterogeneity, systemic biases, and subjective nature of the report still pose many technical challenges. For example, the automatically mined labels from reports are problematic because the rule-based approach that was not carefully designed did not process all uncertainties and negations well [4]. Training an automatic radiology report generation system to match the report appears to avoid the inevitable bias in the standard NLP-mined thoracic pathology labels. However, radiologists tend to write more obvious impressions with abstract logic. For example, a radiology report excludes many diseases (either commonly diagnosed or intended by the physicians) using negation expressions, e.g., no, free of, without, etc. However, the artificial report generator could hardly guess which disease is excluded by radiologists.

Instead of thoroughly generating all of the descriptions, Visual Question Answering (VQA) is more plausible as it only answers the specific question. However, the questions in the existing VQA dataset ImageCLEF-VQA-Med [5] concentrate on very few general ones, such as "is there something wrong in the image? what is the primary abnormality in this image?", lacking the specificity for the heterogeneity and subjective texture. It not only degrades VQA into classification but, more unexpectedly, provides little helpful information for clinics. While VQA-RAD [6] has more heterogeneous questions covering 11 question types, its 315-image dataset is relatively too small.

To bridge the aforementioned gap in the visual language model, we propose a novel medical image difference VQA task more consistent with radiologists' practice. When radiologists make diagnoses, they compare current and previous images of the same patients to check the disease's progress. Actual clinical practice follows a patient treatment process (assessment - diagnosis - intervention - evaluation). A baseline medical image is used as an assessment tool to diagnose a clinical problem, usually followed by therapeutic intervention. Then, another follow-up medical image is retaken to evaluate the effectiveness of the intervention in comparison with the past baseline.  In this framework, every medical image has its purpose of clarifying the doctor's clinical hypothesis depending on the unique clinical course (e.g., whether the pneumothorax is mitigated after therapeutic intervention). However, existing methods cannot provide a straightforward answer to the clinical hypothesis since they do not compare the past and present images. Therefore, we present a chest X-ray image difference Visual Question Answering (VQA) dataset, Medical-Diff-VQA, to fulfill the need of the medical image difference task. This enables the development of a diagnostic support system that realizes the inherently interactive nature of radiology reports in clinical practice.


We constructed the Medical-Diff-VQA dataset from free-text reports in the source database MIMIC-CXR. To ensure the availability of a second image for differential comparison, we excluded patients with only one radiology visit before constructing our dataset. The overall process of dataset construction involves three steps: collecting keywords, building the Intermediate KeyInfo dataset, and generating questions and answers.

Collecting keywords

Under the guidance of professional clinicians, we have collected a list of common abnormality keywords and their corresponding attribute keywords, such as location, levels, and types. These keywords have been compiled into library files for use in the next step of building the intermediate KeyInfo dataset. Please refer to the data description section for the details of the keywords.

Specifically, we utilized ScispaCy, a SpaCy model for biomedical text processing, to extract entities from random reports. Then, we manually reviewed all the extracted entities to identify common, frequently occurring keywords that align with radiologists' interests and added them to our lists of abnormality keywords and attribute keywords. We also consider different variants of the same abnormality during this process.

Intermediate KeyInfo dataset

We followed an Extract-Check-Fix cycle to construct the intermediate KeyInfo dataset and ensure the quality of our dataset through extensive manual verification.


To begin, we employed regular expressions to identify abnormality/disease keywords in the free-text reports for each patient visit. These keywords were used as anchor words to divide the sentences, and the neighboring text snippets were searched for the corresponding attribute keywords. Due to the unique characteristics of each attribute, certain attribute keywords need to be found before the anchor words (such as "large" or "right"), while others need to be found after the anchor words (such as "in the lower lobe"). We devised specific rules for these attribute keywords accordingly. Ultimately, we established the relationship between the abnormality/disease keywords and their corresponding attribute keywords.


In order to guarantee the accuracy and completeness of the extracted information, we carried out both manual and automated checks, utilizing tools such as Part-of-Speech, entity detection with ScispaCy, and MIMIC-CXR-JPG labels as references. These checks were used to identify any missing or potentially incorrect information that may have been extracted and refined the rules accordingly.


Finally, we addressed errors and repeated the Extract-Check-Fix cycle until minimal errors were detected.

As a result, we have created the Key-Info dataset, consisting of individual study details. For each study, the Key-Info dataset includes information on all positive findings, their attributes, and negative findings.

Study pairing and question generation

After constructing the KeyInfo dataset, we were able to obtain all the information necessary to generate questions based on the clinicians’ interests. We generated seven types of questions including abnormality, location, type, level, view, presence, and difference, as shown in the table below:

Question type Example
Abnormality what abnormalities are seen in the image?
  what abnormalities are seen in the [location]?
  is there any evidence of any abnormalities?
  is this image normal?
Presence Is there any evidence of <abnormality>?
  is there <abnormality>?
  is there <abnormality> in the <location>?
View which view is this image taken?
  is this PA view?
  is this AP view?
Location where in the image is the <abnormality> located?
  where is the <abnormality>?
  is the <abnormality> located on the left side or right side?
  is the <abnormality> in the <location>?
Level what level is the <abnormality>?
Type what type is the <abnormality>?
Difference what has changed compared to the reference image?
  what has changed in the <location> area?

Each image pair contains the main(current) image and a reference(past) image, which are extracted from different studies of the same patient. The reference and main visits are chosen strictly based on the earlier visit as the "reference" and the later visit as the "main" image. Among all the question types, the first six question types are for the main image only, and the "difference" question is for both images.

To further verify the reliability of our constructed dataset, we had three human verifiers evaluate 1700 randomly sampled question-answer pairs along with the associated reports. Each verifier annotated each sample as "correct" or "incorrect". The evaluation resulted in a correctness rate of 97.33%, which is considered acceptable for training neural networks. The table below shows the evaluation results for each verifier. These results demonstrate that our approach of constructing the dataset through an Extract-Check-Fix cycle is effective in minimizing mistakes.

Validation results by human verifiers
Verifier example # correctness # Correctness rate
Verifier 1 500 475 95.0%
Verifier 2 1000 989 98.9%
Verifier 3 200 193 96.5%
Total 1700 1657 97.4%

Data Description

The Medical-Diff-VQA dataset comprises 700,703 pairs of question-answer, categorized into seven types: 145421 for abnormality, 84,193 for location, 27,478 for type, 67,296 for level, 56,265 for view, 155,726 for presence, and 164,324 for difference. These pairs were extracted from a total of 164,324 study pairs.


The ‘libs’ directory contains seven library CSV files that are utilized for extracting the intermediate KeyInfo dataset. By modifying the library keywords, the final dataset can be customized to focus on different abnormality, type, location, and level keywords. These files are listed below:

  • disease_lib.csv: library of disease/abnormality keywords. It defines the disease/abnormality keywords to be extracted for the intermediate KeyInfo dataset. The columns include the following:
    • id: Integer assigned in sequence
    • report_name: Possible disease names that appear in the reports. Variants with the same meanings are separated by a ‘;’.
    • official_name: The standardized names that were assigned to the variants.
    • location: The anatomical structure where the disease could appear.
  • location_lib.csv: library of location keywords. It defines the location keywords to be extracted for the intermediate KeyInfo dataset.
  • postlocation_lib.csv: library of location keywords that appears after the anchor word, which is the target disease keyword.
  • type_lib.csv: library of type keywords. It defines the type keywords to be extracted for the intermediate KeyInfo dataset.
  • level_lib.csv: library of level Keywords. It defines the level keywords to be extracted for the intermediate KeyInfo dataset.
  • parts_of_speech.csv: library of parts of speech. It is used for reference only.
  • position_change.csv: A table used to standardize keywords. It includes two columns: "from" and "to". The "from" column lists the original word, and the "to" column lists the standardized word.


We provide three data files, namely mimic_pair_questions.csv, mimic_all.csv, and all_diseases.json.


This file includes the metadata from MIMIC-CXR and labels from MIMIC-CXR-JPG. The labels are for reference purposes only. The columns are:

  • subject_id: subject id in MIMIC-CXR
  • study_id: study_id in MIMIC-CXR
  • labels: This section contains the labels that were extracted from MIMIC-CXR-JPG, but they are only for reference purposes.
    • Atelectasis
    • Cardiomegaly
    • Consolidation
    • Edema
    • Enlarged Cardiomediastinum
    • Fracture
    • Lung Lesion
    • Lung Opacity
    • Pleural Effusion
    • Pneumonia
    • Pneumothorax
    • Pleural Other
    • Support Devices
    • No Finding
  • dicom_id: dicom_id in MIMIC-CXR
  • view: ‘PA’ or ‘AP’ view of the image
  • split: train/val/test split used in MIMIC-CXR-JPG. It is only for reference purposes.
  • study_date: StudyDate in MIMIC-CXR-JPG
  • study_order: an integer that indicates the order number of a patient's entire visit history. Each patient may have multiple visits, but we select only two visits for pair comparison in order to create the final difference VQA dataset.


The KeyInfo intermediate dataset, which comprises key information such as abnormalities, location, type, and level needed for the extraction of the final VQA difference dataset. For “view” questions, this information will be directly retrieved from the mimic_all.csv. For “difference” questions, answers are acquired by comparing the KeyInfo of two studies (visits). The columns include the following:

  • study_id: study id in MIMIC-CXR
  • subject_id: subject id in MIMIC-CXR
  • entity: The identified disease name.
    • Disease name: a string represents the disease name
      • entity_name: same as the disease name
        • Location: The word indicating the location of the disease.
        • Type: The word describing the type or category of the disease.
        • Level: The word indicating the level or severity of the disease.
        • post_location: The word indicating the location of the disease that appears after the disease name.
        • Location2: Same as "Location". Provided as a backup in case there are multiple locations.
        • Type2: Same as "Type". Provided as a backup in case there are multiple types or categories.
        • Level2: Same as "Level". Provided as a backup in case there are multiple levels or severities.
        • post_location2: Same as "post_location". Provided as a backup in case there are multiple location words that appear after the disease name.
  • no_entity: a list containing the names of all diseases that are identified as not existing.


This is the final generated question-answer pairs of the difference VQA dataset. The columns include the following:

  • study_id: main study_id in MIMIC-CXR.
  • subject_id: subject id in MIMIC-CXR.
  • ref_id: reference study_id in MIMIC-CXR.
  • question_type: abnormality/location/level/view/type/presence/difference.
  • question
  • answer
  • split: train/val/test split.

Folder structure

Here, we present the structure of all the files that have been uploaded.

├── libs/
│   ├── parts_of_speech.csv
│   ├── disease_lib.csv
│   ├── level lib.csv
│   ├── location_ lib.csv
│   ├── position_change.csv
│   ├── postlocation_ lib.csv
│   └── type_lib.csv
├── mimic_vqa_pairs.csv
├── mimic_all.csv
└── all_diseases.json

Usage Notes

Please refer to [7] for the code used to generate the Medical-Diff-VQA dataset. We have provided a step-by-step guide for generating the Medical-Diff-VQA dataset. To prepare the dataset, you will need the MIMIC-CXR and MIMIC-CXR-JPG datasets. The MIMIC-CXR dataset is used for extracting KeyInfo from the reports, while only the metadata in the MIMIC-CXR-JPG dataset is needed.

Additionally, It is worth noting that for each study pair, we performed question extraction for each question type once. The selected question and the target information being asked may have been chosen randomly and can vary each time the code is run. Therefore, we provide the extracted version of the dataset that we used here.

Please note that despite our efforts to ensure the accuracy of the dataset, it still contains errors in scenarios involving uncommon expressions for particular abnormalities, location, level, type, and negations that are unable to be accurately extracted. Moreover, the answers are currently generated solely based on the intermediate KeyInfo dataset. The content of the intermediate KeyInfo dataset determines the richness of the questions and answers in the final difference VQA dataset.


The dataset is derived from the MIMIC-CXR database, which is a de-identified dataset that we have been granted access to via the PhysioNet Credentialed Health Data Use Agreement (v1.5.0).


This research received support from the JST Moonshot R&D Grant Number JPMJMS2011 and the Japan Society for the Promotion of Science Grant Number 22K07681. Additionally, it was partially supported by the Intramural Research Program of the National Institutes of Health Clinical Center.

Conflicts of Interest

The authors declare no conflicts of interest.


  1. Johnson, A. E., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., ... & Horng, S. (2019). MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1), 317.
  2. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., & Summers, R. M. (2017). Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2097-2106).
  3. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., ... & Ng, A. Y. (2019, July). Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 590-597).
  4. Johnson, A. E., Pollard, T. J., Greenbaum, N. R., Lungren, M. P., Deng, C. Y., Peng, Y., ... & Horng, S. (2019). MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042.
  5. Ben Abacha, A., Sarrouti, M., Demner-Fushman, D., Hasan, S. A., & Müller, H. (2021). Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. In Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021.
  6. Lau, J. J., Gayen, S., Ben Abacha, A., & Demner-Fushman, D. (2018). A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1), 1-10.
  7. Hu, X. (n.d.). MIMIC-Diff-VQA. GitHub. [Accessed 8/23/2023]

Parent Projects
Medical-Diff-VQA: A Large-Scale Medical Dataset for Difference Visual Question Answering on Chest X-Ray Images was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.