Database Credentialed Access

RaDialog Instruct Dataset

Chantal Pellegrini Ege Özsoy Benjamin Busam Nassir Navab Matthias Keicher

Published: March 25, 2024. Version: 1.0.0


When using this resource, please cite: (show more options)
Pellegrini, C., Özsoy, E., Busam, B., Navab, N., & Keicher, M. (2024). RaDialog Instruct Dataset (version 1.0.0). PhysioNet. https://doi.org/10.13026/zecj-bh52.

Additionally, please cite the original publication:

Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Nassir Navab, & Matthias Keicher. (2023). RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance. arXiv preprint arXiv:2311.18681.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Conversational AI tools that can generate and discuss clinically correct radiology reports for a given medical image have the potential to transform radiology. Such a human-in-the-loop radiology assistant could facilitate a collaborative diagnostic process, thus saving time and improving the quality of reports. Towards this goal, we introduce RaDialog, the first thoroughly evaluated and publicly available large vision-language model for radiology report generation and interactive dialog. To keep the conversational abilities of the underlying LLM, we propose a comprehensive, semi-automatically labeled, image-grounded instruct dataset for chest X-ray radiology tasks. The dataset includes a variety of tasks, such as report correction, summarization or finding prediction. By training with this dataset, our method achieves state-of-the-art clinical correctness in report generation and shows impressive abilities in interactive tasks such as correcting reports and answering questions, serving as a foundational step toward clinical dialog systems.


Background

Radiology plays a key role in clinical decision-making, with radiology reports acting as the major way of communication between radiologists and other clinicians [1]. Within radiology, chest X-rays are the most frequent imaging exam and are crucial for diagnosing thoracic diseases [2]. However, writing accurate and concise chest X-ray reports is time-intensive and demands significant expertise, while the daily amount of images to be examined is rising [3]. In this context, automated report generation is a potential solution to reduce radiologists' workload and support fast and accurate diagnostic decision-making [4]. Further, with the rise of conversational chatbots, there lies an unexplored potential beyond mere report generation: interactive conversational assistance. Such interactivity could revolutionize the radiology workflow, enabling a collaborative diagnostic process between expert radiologists and AI-based tools.

Large Vision-Language models (VLLMs) aim to pair powerful large language models with image information, building a bridge between the visual and the textual domain [5-8]. As medical imaging forms a core part of diagnosis and treatment, the potential for VLLMs in radiology is immense. However, applying such models to medical images poses unique challenges due to the domain shift from natural images. While some very recent works propose medical VLLMs [9-12], they are either private models trained with proprietary data or focus on general medical visual question answering (VQA). 

At the same time, while state-of-the-art radiology report generation methods perform well in generating coherent reports, their factual correctness is limited, and no conversational assistance is possible [13-18]. We hypothesize that LLM-based interactive dialog systems can improve factual correctness in report generation and enhance the radiology workflow through quick clarifications, report refinements, collaborative insights for complex cases, and reduced mental load for routine tasks. Moreover, such a model could also be used for more general tasks, such as asking knowledge questions or explaining a report to a patient with limited medical knowledge.

In order to build and train such an interactive and domain-specific VLLM, we need a variable dataset focusing not only on radiology report generation but also on conversational downstream tasks. To this end, together with our RaDialog model, we propose an x-ray-specific instruct dataset.

RaDialog integrates both image features and structured pathology findings with an LLM, significantly improving the clinical correctness of generated reports over previous methods. Furthermore, our model can provide interactive assistance and human-AI collaboration, which we demonstrate on a wide range of downstream tasks. We achieve this by parameter-efficient fine-tuning on our proposed instruct dataset. The diverse dataset allows us to keep the general capacities of LLMs while learning radiology-specific knowledge and style.


Methods

Overview of RaDialog model

We propose to adopt a large language model for interactive radiology report generation and conversational assistance using the proposed instruct dataset. Our model architecture consists of several components:

Image Encoder

Given a chest X-ray image as input, we first extract patch-wise image embeddings using BioViL-T [24], a pre-trained domain-specific X-ray encoder. BioVil-T is pre-trained using contrastive language-image learning on chest X-rays paired with radiology reports, making it a useful foundation model for understanding X-ray images.

Alignment module

The patch-based features are passed to an alignment module, transforming them into 32 embedded language model tokens. Inspired by BLIP-2 [6], we use a BERT [24] model as an alignment module to get text-aligned image features.

CheXpert Classifier

Unlike the visual feature encoder, our CheXpert Classifier is specifically designed to provide structured findings for the medical image, ensuring our model's clinical efficacy. Concretely, our model solves the task of multi-label classification given a chest X-ray input image, where each class corresponds to one pathology. We train this model separately using CheXbert [21] labels predicted on the findings section of the MIMIC-CXR [19] reports as ground-truth.

LLM

We utilize an LLM to process the prompt and produce an instruct-specific response. As the training data of generalist LLMs usually consists of only limited medical information, we choose to fine-tune our language model on radiology reports as well as instructions, improving both its medical knowledge and aligning its writing style with that of radiologists. Furthermore, this fine-tuning also teaches it to work with image features and structured finding labels. We use the vicuna-7b model and initialize it with pre-trained weights and fine-tune it on our instruct dataset, which is described in the following section.

Overview of the RaDialog dataset creation

To ensure our model is capable of many diverse downstream tasks and keeps general conversation abilities, we design a new instruct dataset consisting of eight tasks, including report generation and seven instruct tasks described below. For each of the seven instruct tasks, excluding report generation, we formulate ten different prompts, from which we choose randomly to generate samples. These ten prompts were defined manually and can be found in our Github [22] at the path "data/instruct_prompts". Examples for the used prompts can be found in the Data Description.

We use two different ways to define the ground-truth answer for a given prompt.

1) Dataset-based

For the tasks of report generation, complete and binary CheXpert QA, and natural language explanations, we utilize existing datasets to retrieve corresponding ground truth. For report generation and CheXpert QA we use the MIMIC-CXR dataset [2], which consists of radiology reports paired with X-Ray images and structured finding labels. For natural language explanations, we use the MIMIC-NLE  [19] dataset, which consists of question-answer pairs, asking for the reason of certain diagnoses.

2) LLM-based

For the remaining tasks,  namely correction, summarization, easy language, and region QA, we use a non-fine-tuned vicuna-13b model, which we prompt in a zero-shot manner to generate pseudo ground truth answers, similar to replay-based continual learning [20]. All answers are generated based on a free-text radiology report and a sampled instruction prompt, which can be answered given the radiology report. This model is prompted with the following prompt format:

<SYSTEM_PROMPT>
USER: Report: <RADIOLOGY_REPORT>
<INSTRUCTION_PROMPT>
ASSISTANT:

<RADIOLOGY REPORT> is usually a ground-truth report from the MIMIC-CXR training set. For the correction task it is a predicted report which contains errors as described in the Data Description.


Data Description

Prompt Construction

The image features, structured findings, and instructions are converted into one prompt as input for the LLM. First, the 32 image tokens from the alignment module are added to the prompt as "Image Information: <IMG_FEATURES>", providing the LLM with contextual image features. Next, the structured findings from our CheXpert Classifier are introduced with "Predicted Findings: <LIST_OF_FINDINGS>". This gives the LLM a clear understanding of the image’s key observations, improving clinical accuracy. The prompt concludes with an instruction, like "Write a radiology report." to specify the expected output. This method ensures that the generated answer is relevant, precise, and meets user specifications. 

In the provided dataset, all instruct tasks (except report generation) are constructed as multi-turn conversations as follows:  

<SYSTEM_PROMPT> 
USER: <REPORT_GENERATION_PROMPT>
ASSISTANT: <MIMIC-CXR_GROUND-TRUTH_REPORT>
USER: <TASK_INSTRUCTION> 
ASSISTANT: 

As the first assistant answer to the report generation prompt, we include the ground-truth report from the MIMIC-CXR dataset instead of a predicted report in order to reduce noise in the training process. With this setup the model is able to learn how to interpret the image together with the report and answer follow-up questions. At inference time, also the first answer will be predicted instead of using a ground-truth report.

In the following, all tasks are detailed, and example prompts are provided. For the instruction tasks, we show the final user instructions. In the dataset, they all follow the format described above, where the user instruction follows a simulated report generation conversation. 

Tasks included in the dataset

Dataset-based tasks

  • Report Generation (RG): Produce a free text radiology report given an X-ray. We use the image-report pairs from the MIMIC-CXR dataset [2] as ground truth. The instruction for report generation includes a placeholder for embedded image features, a list of pathology labels predicted by the CheXpert Classifier module given the current X-ray image, and a natural language prompt. The natural language prompt was created and selected manually by testing different prompts report generation prompts in a zero-shot manner on the original vicuna-7b language model. The final prompt looks as follows:
    <SYSTEM_PROMPT>
    USER: Image Information: <IMAGE_FEATURES>. Predicted Findings: <LIST_OF_FINDINGS>. You are to act as a radiologist and write the finding section of a chest x-ray radiology report for this X-ray image and the given predicted findings. Write in the style of a radiologist, write one fluent text without enumeration, be concise and do not provide explanations or reasons. 
    ASSISTANT:
    
  • Findings QA (FQA): Answer a question about the pathology labels as defined in the CheXpert dataset [23]. The ground-truth labels for this task are generated by the Chexbert labeler [21], which extracts pathology labels from the finding section of the reports. We include two variants of this tasks. In the first variant the model is tasked to list all findings (complete mode) in the image, while the second variant includes binary questions where the model needs to provide a straightforward yes/no answer about a specific finding (binary mode). Example: complete: "List all the finding in this report." binary: "Is there evidence of <PATHOLOGY>in the report?"
  • Natural Language Explanation / Reasoning (RE): Clarify and explain which part of the report indicates a specific pathology. We utilize the Mimic-NLE dataset [19] as ground truth. The questions ask for the reasoning behind a certain diagnosis, while the answers specify the part of the report that provides this reasoning. Example: Why do you think the patient has <PATHOLOGY>?

LLM-based tasks

  • Region QA (RQA): Answer a question about a specific region, such as the heart or lung, which can be binary as well as open-ended. The supervision signal is LLM-generated. Example: Is the patient’s heart healthy?
  • Easy Language (EL): Reformulate the produced report into simpler, more understandable language. The supervision signal is LLM-generated. Example: Explain this report in very easy terms, such that a child would understand.
  • Summarization (SU): Summarize the report as bullet points or a short text. The supervision signal is LLM-generated. Example: Summarize this report with bullet points.
  • Correction (CO): Correct an error in the produced report. The training prompts are generated by detecting wrongly predicted CheXpert labels on reports predicted by the non-fine-tuned LLM. This task is therefore the only instruct task where not the ground-truth, but a predicted report is integrated in the conversation. The supervision signal is LLM-generated by tasking the non-fine-tuned vicuna model to predict a radiology report only given the predicted finding labels of the image. No image features are used to predict the incorrect reports, as the original vicuna model cannot handle image input. Given this report the Chexbert labeler is applied on the prediction and the labels are automatically compared to the ground-truth in order to understand which pathologies the model missed as well as added wrongly. This is used to contruct the correction instruction. The complete prompt looks as follows:
    Example:
    <SYSTEM_PROMPT>
    USER: <REPORT_GENERATION_PROMPT> 
    ASSISTANT: <PREDICTED_REPORT_WITH_MISTAKES> 
    USER: I disagree with the generated report, I think the patient has <MISSING_PATHOLOGIES>, but does not have <ADDED_PATHOLOGIES>. Please adapt the report.
    ASSISTANT: 
    

The entire dataset is constructed using only the train set of the MIMIC-CXR and MIMIC-NLE datasets and can thus safely be used to train models that will be evaluated on the test sets of these datasets or on other non-overlapping datasets.

The dataset is provided as a compressed json file. Open it using:

with bz2.open('mimic_cxr_instruct_stratified.json.gz', 'rt', encoding='UTF-8') as f:
    data = json.load(f)

All elements consist of the following elements:

  • "output": A natural language ground-truth answer (either LLM-generated or from existing datasets)
  • "instruction": natural language prompt, which can contain a predicted report and a question or instruction. For the task of report generation, the prompt only contains instructions on how to write a report. This field contains the instruction for one of the tasks defined above. For all tasks the instruction includes <IMG> placeholders at the position where the embedded image features will be placed.
  • "dicom": dicom_id of the corresponding X-ray image from the MIMIC-CXR-JPG dataset

The provided dataset includes report generation samples for a stratified set of images with a non-empty findings section from the training set of the MIMIC-CXR-JPG dataset. Further, it includes the same number of instruct samples in order to ensure a balanced data distribution between report generation and instruction tasks during training. Therefore, for every image, only one instruct task is included. We randomly split the training samples among all instruct tasks where each task gets the same number of samples. The samples of the different tasks are randomly mixed in the dataset.

Stratification

The findings distribution between the MIMIC-CXR test set and the MIMIC-CXR train and validation set differs significantly. Especially the fraction of healthy patients included in the train and validation set is much higher than for the test set. Therefore, we sample a more balanced subset of the training set for dataset construction. In detail, we include all samples with positive findings, but reduce the number of samples without any findings to 1/4th of the dataset to approximately match the test set distribution.


Usage Notes

The RaDialog dataset was partially automatically generated and therefore is limited by the performance of the large language model (vicuna) used for generation. Further, some tasks build upon the results of the Chexbert labeler [21], which automatically extracts pathology labels from free-text radiology reports and therefore can contain errors. For these reasons we do not claim the instruct data to be completely medically correct, instead the main purpose of the provided dataset is to be able to mitigate catastrophic forgetting in language models by providing a high variability of tasks, which rehearse and refine the conversational abilities of the model. As such the RaDialog dataset can serve as a important step toward conversational radiology assistance. This dataset was used to train the RaDialog model, which shows impressive abilities in clinically correct report generation and at the same time has great conversational skills and can reply to radiology-specific follow-up questions and prompts.


Release Notes

This is version 1.0.0 of the RaDialog dataset. The code for generating this dataset and the corresponding model can be found at Github [22]. If you have any questions, feel free to reach out to chantal.pellegrini@tum.de


Ethics

The authors declare no ethics concerns.


Acknowledgements

The authors gratefully acknowledge the financial support by the Federal Ministry of Education and Research of Germany (BMBF) under project DIVA (13GW0469C) and the Bavarian Ministry of Economic Affairs, Regional Development and Energy (StMWi) under project ThoraXAI (DIK-2302-0002).


Conflicts of Interest

The authors have no conflicts of interest to declare.


References

  1. Goergen, S. K., Pool, F. J., Turner, T. J., Grimm, J. E., Appleyard, M. N., Crock, C., ... & Wriedt, C. (2013). Evidence‐based guideline for the written radiology report: Methods, recommendations and implementation challenges. Journal of medical imaging and radiation oncology, 57(1), 1-7.
  2. Wang R, Chen LC, Moukheiber L, Seastedt KP, Moukheiber M, Moukheiber D, Zaiman Z, Moukheiber S, Litchman T, Trivedi H, Steinberg R. Enabling chronic obstructive pulmonary disease diagnosis through chest X-rays: A multi-site and multi-modality study. International Journal of Medical Informatics. 2023 Oct 1;178:105211.
  3. Rimmer, A. (2017). Radiologist shortage leaves patient care at risk, warns royal college. BMJ: British Medical Journal (Online), 359.
  4. Kaur, N., Mittal, A., & Singh, G. (2022). Methods for automatic generation of radiological reports of chest radiographs: a comprehensive survey. Multimedia Tools and Applications, 81(10), 13409-13439.
  5. Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... & Simonyan, K. (2022). Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35, 23716-23736.
  6. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  7. Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning, 2023.
  8. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  9. Tu, T., Azizi, S., Driess, D., Schaekermann, M., Amin, M., Chang, P. C., ... & Natarajan, V. (2023). Towards generalist biomedical AI. arXiv preprint arXiv:2307.14334.
  10. Xu, S., Yang, L., Kelly, C., Sieniek, M., Kohlberger, T., Ma, M., ... & Sellergren, A. (2023). ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317.
  11. Moor, M., Huang, Q., Wu, S., Yasunaga, M., Zakka, C., Dalmia, Y., ... & Leskovec, J. (2023). Med-flamingo: a multimodal medical few-shot learner. arXiv preprint arXiv:2307.15189.
  12. Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., ... & Gao, J. (2023). Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890.
  13. Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory-driven trans- former. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1439–1449, 2020.
  14. Farhad Nooralahzadeh, Nicolas Perez Gonzalez, Thomas Frauenfelder, Koji Fujimoto, and Michael Krauthammer. Progressive transformer-based generation of radiology re- ports. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2824–2832, 2021.
  15. An Yan, Zexue He, Xing Lu, Jiang Du, Eric Chang, Amil- care Gentili, Julian McAuley, and Chun-nan Hsu. Weakly supervised contrastive learning for chest x-ray report generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4009–4015, 2021.
  16. Wang, L., Ning, M., Lu, D., Wei, D., Zheng, Y., & Chen, J. (2022, September). An inclusive task-aware framework for radiology report generation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 568-577). Cham: Springer Nature Switzerland.
  17. Wang, Z., Liu, L., Wang, L., & Zhou, L. (2023). METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11558-11567).
  18. Huang, Z., Zhang, X., & Zhang, S. (2023). KiUT: Knowledge-injected U-Transformer for Radiology Report Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 19809-19818).
  19. Kayser, M., Emde, C., Camburu, O. M., Parsons, G., Papiez, B., & Lukasiewicz, T. (2022, September). Explaining chest x-ray pathologies in natural language. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 701-713). Cham: Springer Nature Switzerland.
  20. Robins, A. (1995). Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2), 123-146.
  21. Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A. Y., & Lungren, M. P. (2020). CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv preprint arXiv:2004.09167.
  22. https://github.com/ChantalMP/RaDialog
  23. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., ... & Ng, A. Y. (2019, July). Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 590-597).
  24. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Parent Projects
RaDialog Instruct Dataset was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files