Name: RadVLM model
Published: Oct. 8, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Model Credentialed Access

Nicolas Deperrois , Hidetoshi Matsuo , Samuel Ruiperez-Campillo , Moritz Vandenhirtz , Sonia Laguna , Alain Ryser , Koji Fujimoto , Mizuho Nishio , Thomas Sutter , Julia Vogt , Jonas Kluckert , Thomas Frauenfelder , Christian Bluethgen , Farhad Nooralahzadeh , Michael Krauthammer

Published: Oct. 8, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Deperrois, N., Matsuo, H., Ruiperez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T., Vogt, J., Kluckert, J., Frauenfelder, T., Bluethgen, C., Nooralahzadeh, F., & Krauthammer, M. (2025). RadVLM model (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/50kn-p490

MLA	Deperrois, Nicolas, et al. "RadVLM model" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/50kn-p490
APA	Deperrois, N., Matsuo, H., Ruiperez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T., Vogt, J., Kluckert, J., Frauenfelder, T., Bluethgen, C., Nooralahzadeh, F., & Krauthammer, M. (2025). RadVLM model (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/50kn-p490
Chicago	Deperrois, Nicolas, Matsuo, Hidetoshi, Ruiperez-Campillo, Samuel, Vandenhirtz, Moritz, Laguna, Sonia, Ryser, Alain, Fujimoto, Koji, Nishio, Mizuho, Sutter, Thomas, Vogt, Julia, Kluckert, Jonas, Frauenfelder, Thomas, Bluethgen, Christian, Nooralahzadeh, Farhad, and Michael Krauthammer. "RadVLM model" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/50kn-p490
Harvard	Deperrois, N., Matsuo, H., Ruiperez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T., Vogt, J., Kluckert, J., Frauenfelder, T., Bluethgen, C., Nooralahzadeh, F., and Krauthammer, M. (2025) 'RadVLM model' (version 1.0.0), PhysioNet. RRID:SCR_007345. Available at: https://doi.org/10.13026/50kn-p490
Vancouver	Deperrois N, Matsuo H, Ruiperez-Campillo S, Vandenhirtz M, Laguna S, Ryser A, Fujimoto K, Nishio M, Sutter T, Vogt J, Kluckert J, Frauenfelder T, Bluethgen C, Nooralahzadeh F, Krauthammer M. RadVLM model (version 1.0.0). PhysioNet. 2025. RRID:SCR_007345. Available from: https://doi.org/10.13026/50kn-p490

Additionally, please cite the original publication:

Deperrois, N., Matsuo, H., Ruipérez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., ... & Krauthammer, M. (2025). RadVLM: A Multitask Conversational Vision-Language Model for Radiology. arXiv preprint arXiv:2502.03333.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

We present RadVLM, a compact (7B) multitask conversational foundation model designed for CXR interpretation. Its development relies on the curation of a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks - such as report generation, abnormality classification, and visual grounding - and multi-turn, multi-task conversational interactions. Our experiments show that RadVLM, fine-tuned on this instruction dataset, achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks (report generation, classification). Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of the RadVLM model as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

Background

The shortage of trained personnel for CXR interpretation has led to the exploration of automated agents to assist physicians in diagnostic tasks. In recent years, various transformer-based models have shown promise in providing preliminary drafts summarizing key observations from the CXR, offering a potential enhancement to the diagnostic workflow. However, there is a need to expand the scope of these tools beyond report generation, toward the capability to answer questions about the CXR technique, findings in a region of interest, location of specific abnormalities, and definitions of medical terms. In addition, physicians should be allowed to formulate their queries flexibly and in any order, potentially within a multi-turn conversational interaction with the assistant [1]. In this direction, models such as CheXagent [2], RaDialog [3], and MAIRA-2 [4] were developed, extending beyond report generation to tasks such as observation grounding and visual question answering, covering a larger part of the clinical workflow. However, their capacity to handle diverse and complex user queries, or to respond accurately to multiple prompts within an arbitrary conversational framework, remains limited. Adding these capabilities is critical for comprehensively supporting clinicians’ daily work.

Conversational vision-language models hold promise for improving clinical workflows by enabling interactive, context-aware CXR interpretation. Such systems can assist radiologists and trainees by verifying findings, clarifying anatomical details, and explaining medical terminology on demand, potentially reducing reporting workload and diagnostic variability. Moreover, integrating multimodal dialogue into radiology practice could enhance accessibility in low-resource settings and facilitate training through natural question-answer exchanges. Related developments in the broader medical domain, such as LLaVA-Med [5] and Med-Gemma [6], have demonstrated the potential of conversational multimodal models to improve clinical reasoning and information accessibility. RadVLM builds on these principles while focusing specifically on chest X-ray interpretation.

Model Description

In this project, we build upon state-of-the-art visual instruction-tuning techniques inspired by general-domain applications [7-8] to construct a compact, multitask conversational foundation model specialized in CXR interpretation, named RadVLM. To achieve this aim, we create comprehensive CXR datasets, each featuring diverse modalities including free-text reports, abnormality labels, and visual coordinates, and organize them into a unified instruction dataset. This dataset is composed of single-turn image-instruction pairs for different tasks and image-conversation pairs designed for more flexible and multi-turn interactions. We then fine-tune a vision-language architecture [9] on this instruction dataset, naming the resulting model RadVLM, and develop an evaluation pipeline to assess its performance across multiple tasks, systematically comparing it to state-of-the-art generalist and CXR-specific foundation models. Our results show that, despite its relatively compact size, RadVLM achieves competitive performance on individual tasks relevant to clinical practice, providing conversational capabilities within a simple and flexible interface and offering a reliable and user-friendly tool for physicians.

RadVLM is a single-encoder, single-decoder vision-language model that chats about a frontal chest X-ray. Starting from the open-weight LLaVA-OneVision-7B checkpoint [10], we fine-tuned all 7B parameters so the model can:

draft a concise, free-text radiology report,
list the presence or absence of 14 common abnormalities,
return bounding-box coordinates for a requested structure or pathology,
follow a multi-turn conversation that mixes the above skills.

The model treats every task as next-token prediction, so no task-specific heads are required; bounding boxes are emitted as the literal string "[x1, y1, x2, y2]" (values normalized to [ $0, 1$ ]).

Inputs and Outputs

Input:
- Image: A frontal chest X-ray (PIL Image or NumPy array).
- Text: A user prompt (free-text query about the image).
- Chat History (optional): Multi-turn interaction history.
Output:
- Text Response: A natural language answer to the user's query.
- Bounding Boxes (if applicable): Coordinates indicating the location of anatomical structures or abnormalities.

Supporting Data

We constructed an approximately 1M-turn synthetic instruction corpus by pairing public CXR images with reports, labels, bounding-box annotations, and conversations drawn from MIMIC-CXR [11], VinDr-CXR [12], Chest Imagenome [13], MS-CXR [14], PadChest-GR [15], and CheXpert [16]. A subset of this dataset is deposited as a separate PhysioNet entity [17].

Data Preprocessing

- Views: frontal CXRs only; lateral images removed.
- Reports: from MIMIC-CXR and CheXpert-Plus. Keep Findings (or Impression if Findings are empty). Mentions of priors are removed using an Azure-hosted GPT-4o workflow compliant with PhysioNet guidance.
- Labels: CheXpert/MIMIC label extraction via CheXbert; treat uncertain as absent.
- Grounding:
  - Chest Imagenome anatomy boxes; one region sampled per image per data point.
  - VinDr-CXR abnormalities merged across annotators via Weighted Box Fusion before use.
  - MS-CXR and PadChest-GR for phrase grounding (text spans ↔ boxes).
  - All boxes expressed as normalized [x1, y1, x2, y2] in [0,1], emitted as text.
- Conversations: multi-turn image-dialog pairs generated by prompting a text LLM (GPT-4o) with CXR attributes (report, labels, boxes, view), producing standard and grounded dialogues that reference prior turns.

Model Files

- config.json - Model architecture and hyperparameters
- model-00001-of-00004.safetensors ... model-00004-of-00004.safetensors - Sharded weight files (together they form the full model weights)
- tokenizer.json / tokenizer_config.json / special_tokens_map.json / vocab.json - Tokenizer vocabulary and settings
- generation_config.json - Default text generation parameters
- preprocessor_config.json - Image preprocessing (resize, normalization, etc.)
- processor_config.json - Bundles image and text processors for convenience
- added_tokens.json - Custom tokens added beyond the base vocabulary
- chat_template.json - Defines the dialogue formatting (user/assistant roles, system prompts)
- model.safetensors.index.json - Index file pointing to the individual sharded weight files
- README.md - Documentation

Technical Implementation

Pretraining (inherited)

We adopt the LLaVA-OneVision-7B backbone [10]: a SigLIP vision encoder [18] connected to a Qwen-2 language model [19] via a two-layer MLP projector. The base model is pretrained and instruction-tuned in the general domain. It also follows Higher AnyRes [20], encoding multiple resolution patches (plus the full image) and concatenating vision features before the LLM.

RadVLM fine-tuning (ours)

We fine-tune the full architecture end-to-end on a CXR instruction dataset (>1M image-instruction pairs) spanning report generation, abnormality classification (14 labels), visual grounding (anatomy/abnormality/phrase), and multi-turn conversations (including grounded turns). Training minimizes the standard autoregressive next-token loss over assistant tokens, incorporating chat history in multi-turn settings. Learning rates: 2e-6 for the vision encoder and 1e-5 for the projector + LLM. We train for one epoch with full fine-tuning on 128 GH GPUs (96 GB each) for approximately 12 hours [21]. Code and scripts are provided at [22].

Quantitative evaluation

We evaluate on held-out sets with standard metrics and compare against reimplemented baselines (LLaVA-OneVision, LLaVA-Med, RaDialog, CheXagent, MAIRA-2) under a unified pipeline.

Report generation (MIMIC-CXR, single image):
- RadVLM: BERTScore 51.9, ROUGE-L 25.4, RadGraph-F1 18.2, GREEN 27.7
- Best/second-best across models: RadVLM leads lexical metrics; clinical metrics close to CheXagent (20.1 / 29.9).

Abnormality classification (CheXpert 14-label, macro-F1):
- RadVLM achieves the highest macro-F1 overall, with strong per-class gains (e.g., atelectasis, edema, lung opacity, pneumonia, pleural effusion, pneumothorax).

Visual grounding (mAP @ IoU 0.5):
- Anatomy: 85.8
- Abnormality: 34.6
- Phrase: 81.8
RadVLM outperforms MAIRA-2 and CheXagent across all three grounding tasks.

Conversations (LLM-as-judge, 0–10):
- Standard: 6.66
- Grounded: 6.60
Substantially above conversational baselines under the same prompts.

Installation and Requirements

To perform inference with RadVLM, create a conda (or virtual) environment and install the required dependencies.
Download the dependency file requirements.txt and run the following commands:

conda create -n radvlm python=3.10
conda activate radvlm
pip install -r requirements.txt

Tested on: Ubuntu 22.04 LTS with CUDA 12.4.

Then, ensure that you have downloaded the RadVLM model files (Model_files) to a local directory. This directory path should be assigned to the variable FOLDER_PATH_TO_MODEL in the code.

FOLDER_PATH_TO_MODEL = "your/local/folder/with/RadVLM/weights"

Usage Notes

Intended Use

Primary Use Cases
- Medical Education: Supporting radiology trainees in learning CXR interpretation through interactive Q&A.
- Preliminary Findings: Generating structured observations from CXRs to complement radiology reports.
Out-of-Scope Uses
- Clinical Decision Making: RadVLM is not a replacement for a licensed radiologist and should not be used as the sole basis for medical decisions.
- Automated Diagnosis: The model does not provide definitive diagnoses and should be used as a supplementary tool.
- Use Outside of CXR Interpretation: The model has been trained specifically for chest X-rays and is not designed for other medical imaging modalities.

Helper Scripts

radvlm_helpers.py
- load_radvlm(model_path, device=None, torch_dtype="auto"): Loads the RadVLM model and processor from the local folder where the PhysioNet submission files are stored.
- inference_radvlm(model, processor, image, prompt, chat_history=None, max_new_tokens=512, do_sample=False): Runs single-turn or multi-turn inference. Keeps track of chat history so follow-up questions use context. Accepts a PIL image or NumPy array and outputs plain-text responses.
demo_radvlm.py
- Minimal command-line interface for multi-turn Q&A on one chest X-ray image. Calls the helper functions above. Intended as a quick and reproducible demo.

Quick Start

1) Ensure the PhysioNet RadVLM model files are available locally under FOLDER_PATH_TO_MODEL.
2) Provide the path to that folder and an image path:
python demo_radvlm.py --model-path $FOLDER_PATH_TO_MODEL --image /path/to/example_cxr.png

Minimal Python Usage

from PIL import Image
from radvlm_helpers import load_radvlm, inference_radvlm

# Load from local PhysioNet folder
model, processor, device = load_radvlm("/path/to/RadVLM", torch_dtype="float16")

image = Image.open("example_cxr.png").convert("RGB")

chat_history = []
resp, chat_history = inference_radvlm(model, processor, image, "What are the main findings?")
print(resp)

# Follow-up question
resp, chat_history = inference_radvlm(model, processor, image, "Any signs of pleural effusion?", chat_history)
print(resp)

Release Notes

Version 1.0.0: Initial public release of the model.

Ethics

RadVLM was developed exclusively from publicly available, fully de-identified chest radiograph resources—MIMIC-CXR, CheXpert / CheXpert-Plus, Chest Imagenome, VinDr-CXR, MS-CXR, and PadChest-GR—together with synthetic dialogues generated by GPT-4o.

Human-subjects oversight. Access to MIMIC datasets was obtained under an active PhysioNet credential and Data Use Agreement. These data were created under Beth Israel Deaconess Medical Center and MIT IRB approvals, so no new review was required. CheXpert, PadChest, and related derivatives were each released with local ethics board clearance and contain no protected health information (PHI).
Synthetic content. Dialogue and filtered report examples were produced with GPT-4o via a secured Azure OpenAI endpoint configured in accordance with the PhysioNet “Responsible Use of GPT” guidance.
Privacy risk. Large language models can memorize fragments of their training data. Although no verbatim leakage was found during spot checks, absolute prevention cannot be guaranteed.
Intended use. RadVLM is a research prototype for computational imaging, machine learning benchmarking, and medical education scenarios. It is not a regulated medical device and must not be used for primary diagnosis or clinical decision-making without rigorous, task-specific validation and the requisite regulatory approvals. Model outputs are synthetic, may be inaccurate, and should always be interpreted by qualified professionals.
Licensing and liability. The RadVLM model follows the Attribution-NonCommercial 4.0 International license. It is a derivative work of LLaVA-OneVision-7B and Qwen-2, redistributed under their original Apache-2.0 terms. Users must comply with those licenses and with the data-use restrictions of each underlying dataset. The authors and their institutions disclaim all liability for any direct, indirect, or consequential damages arising from use of the software.

The authors declare no additional ethics concerns.

Acknowledgements

This work was supported as part of the Swiss AI Initiative by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID a02 on Alps, and by the LOOP Zurich as part of the application driver project supporting the LOOP Zurich Biomedical Informatics Platform (BMIP). ND and FN received research support from the Digitalization Initiative of the Zurich Higher Education Institutions (DIZH)- Rapid Action Call - under TRUST-RAD project. CB received research support from the Promedica Foundation, Chur, CH. TS is supported by the grant #2021-911 of the Strategic Focal Area “Personalized Health and Related Technologies (PHRT)” of the ETH Domain (Swiss Federal Institutes of Technology). HM, MN and KF are supported by JSPS KAKENHI (Grant Number: 23KK0148). AR is supported by the StimuLoop grant #1-007811-002 and the Vontobel Foundation. MV and SL are supported by the Swiss State Secretariat for Education, Research, and Innovation (SERI) under contract number MB22.00047. MK is supported by the UZH Global Strategy and Partnerships Funding Scheme and a Research Partnership Grant with China, Japan, South Korea and the ASEAN region (RPG 072023 18).

Conflicts of Interest

The authors have no conflicts of interest to declare

References

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80.
Chen Z, Varma M, Xu J, Paschali M, Van Veen D, Johnston A, et al. A vision–language foundation model to enhance efficiency of chest X-ray interpretation. arXiv preprint arXiv:2401.12208. 2024. Available from: https://arxiv.org/abs/2401.12208
Pellegrini C, Özsoy E, Busam B, Navab N, Keicher M. RaDialog: A large vision–language model for radiology report generation and conversational assistance. arXiv preprint arXiv:2311.18681. 2023.
Bannur S, Bouzid K, Castro DC, Schwaighofer A, Thieme A, Bond-Taylor S, et al. MAIRA-2: Grounded radiology report generation. arXiv preprint arXiv:2406.00000. 2024 Jun.
Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, et al. LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890. 2023.
Sellergren A, Kazemzadeh S, Jaroensri T, Kiraly A, Traverse M, Kohlberger T, et al. MedGemma technical report. arXiv preprint arXiv:2507.05201. 2025.
Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. arXiv preprint arXiv:2304.08485. 2023.
Wang P, Bai S, Tan S, Wang S, Fan Z, Bai J, et al. Qwen2-VL: Enhancing vision–language models’ perception of the world at any resolution. arXiv preprint arXiv:2409.12191. 2024.
Li B, Zhang Y, Guo D, Zhang R, Li F, Zhang H, et al. LLaVA-OneVision: Easy visual task transfer. arXiv preprint arXiv:2408.03326. 2024.
LLaVA-OneVision Qwen2-7B SI model. Available at: https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-si (Accessed: October 6, 2025).
Johnson AE, Pollard TJ, Greenbaum NR, Lungren MP, Deng CY, Peng Y, et al. MIMIC-CXR-JPG: A large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042. 2019.
Nguyen HQ, Lam K, Le LT, Pham HH, Tran DQ, Nguyen DB, et al. VinDr-CXR: An open dataset of chest X-rays with radiologists’ annotations. Scientific Data. 2022;9(1):429.
Wu JT, Agu NN, Lourentzou I, Sharma A, Paguio JA, Yao JS, et al. Chest ImaGenome dataset for clinical reasoning. arXiv preprint arXiv:2108.00316. 2021.
Boecking B, Usuyama N, Bannur S, Castro DC, Schwaighofer A, Hyland S, et al. Making the most of text semantics to improve biomedical vision–language processing. In: European Conference on Computer Vision. Springer; 2022. p. 1–21.
Castro DC, Bustos A, Bannur S, Hyland SL, Bouzid K, Wetscherek MT, et al. PadChest-GR: A bilingual chest X-ray dataset for grounded radiology report generation. arXiv preprint arXiv:2411.05085. 2024.
Chambon P, Delbrouck JB, Sounack T, Huang SC, Chen Z, Varma M, et al. CheXpert Plus: Augmenting a large chest X-ray dataset with text radiology reports, patient demographics, and additional image formats. arXiv preprint arXiv:2405.19538. 2024. Available from: https://arxiv.org/abs/2405.19538
PhysioNet. RadVLM Instruction Dataset (version 1.0.0). Available at: https://physionet.org/content/radvlm-instruction-dataset/1.0.0/ (Accessed: October 3, 2025).
Zhai X, Mustafa B, Kolesnikov A, Beyer L. Sigmoid loss for language–image pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023. p. 11975–86.
Yang A, Yang B, Hui B, Zheng B, Yu B, Zhou C, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671. 2024.
Chai L, Gharbi M, Shechtman E, Isola P, Zhang R. Any-resolution training for high-resolution image synthesis. In: European Conference on Computer Vision. Springer; 2022. p. 170–88.
Fusco L, Khalilov M, Chrapek M, Chukkapalli G, Schulthess T, Hoefler T. Understanding data movement in tightly coupled heterogeneous systems: A case study with the Grace Hopper superchip. arXiv preprint arXiv:2408.11556. 2024.
RadVLM GitHub Repository. Available at: https://github.com/uzh-dqbm-cmi/RadVLM (Accessed: October 3, 2025).