Resources


Database Contributor Review

COVID Data for Shared Learning (CDSL): A comprehensive, multimodal COVID-19 dataset from HM Hospitales

Álvaro Ritoré, Andreea M Oprescu, Alberto Estirado Bronchalo, Miguel Ángel Armengol de la Hoz

COVID Data for Shared Learning (CDSL) is a multimodal database comprising de-identified structured health data and radiological images from 4,479 patients with COVID-19, as a comprehensive toolkit for developing predictive models.

covid-19 multimodal database radiological images open data healthcare data machine learning and ai

Published: Oct. 25, 2024. Version: 1.0.0


Database Credentialed Access

AMR-UTI: Antimicrobial Resistance in Urinary Tract Infections

Michael Oberst, Soorajnath Boominathan, Helen Zhou, Sanjat Kanjilal, David Sontag

AMR-UTI is a freely accessible dataset, derived from electronic health record (EHR) information on over 100,000 urinary tract infections (UTI) treated at Massachusetts General Hospital and Brigham & Women's Hospital in Boston, MA, USA.

antibiotic resistance causal inference policy learning antimicrobial resistance urinary tract infection clinical decision support machine learning

Published: Nov. 4, 2020. Version: 1.0.0


Database Credentialed Access

MIMIC-Ext-MIMIC-CXR-VQA: A Complex, Diverse, And Large-Scale Visual Question Answering Dataset for Chest X-ray Images

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, Edward Choi

We introduce MIMIC-Ext-MIMIC-CXR-VQA, a complex, diverse, and large-scale dataset designed for Visual Question Answering (VQA) tasks within the medical domain, focusing primarily on chest radiographs.

question answering machine learning evaluation chest x-ray radiology benchmark electronic health records multimodal deep learning visual question answering

Published: July 19, 2024. Version: 1.0.0


Database Credentialed Access

Predictors of Hospital Onset Infection: A Matched Retrospective Cohort Dataset

Ziming Wei, Luke Sagers, Caroline McKenna, Ted Pak, Chanu Rhee, Michael Klompas, Sanjat Kanjilal

NPA-CP is a freely accessible dataset derived from electronic health record (EHR) information at MGB between 2015 and 2024. The dataset includes 11 different pathogens and can be used to predict hospital-onset infections for these pathogens.

electronic health records infection control clinical machine learning infectious diseases hospital onset infection colonization pressure

Published: Nov. 4, 2025. Version: 1.0.0


Database Credentialed Access

MIMIC-IV-Ext Triage Instruction Corpus

Qingyang Shen, Quan Guo

MIMIC-IV-Ext Triage Instruction Corpus includes 9,629 ED triage cases organized by the five-level ESI, enabling LLMs to improve triage accuracy. It provides CSV data, generation prompts, expert validation samples, and SQL QC scripts.

nlp clinical decision support machine learning large language models emergency severity index emergency triage

Published: March 4, 2025. Version: 1.0.0


Database Open Access

Synthetic Mention Corpora for Disease Entity Recognition and Normalization

Kuleen Sasse, John David Osborne

We present the Synthetic Mention Corpora for Disease Entity Recognition and Normalization, containing 128000 disease mentions from the UMLS disorder group, generated by an LLM. This corpus aims to improve these tasks in biomedical and clinical texts.

nlp machine learning named entity recognition data augmentation entity normalization

Published: Feb. 3, 2025. Version: 1.0.0


Database Credentialed Access

MIMIC-IV-ECG-Ext-ICD: Diagnostic labels for MIMIC-IV-ECG

Nils Strodthoff, Juan Miguel Lopez Alcaraz, Wilhelm Haverkamp

Dataset that links ECG records from MIMIC-IV-ECG to ED discharge and hospital discharge diagnoses, which enables to train general ECG prediction models based on clinical labels and facilitates the retrieval of further clinical metadata from MIMIC-IV.

machine learning electrocardiography mimic

Published: Aug. 30, 2024. Version: 1.0.1


Database Credentialed Access

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, Edward Choi

We present EHRXQA, the first multi-modal EHR QA dataset combining structured patient records with aligned chest X-ray images. EHRXQA contains a comprehensive set of QA pairs covering image-related, table-related, and image+table-related questions.

question answering machine learning evaluation chest x-ray multi-modal question answering ehr question answering semantic parsing benchmark electronic health records deep learning visual question answering

Published: July 23, 2024. Version: 1.0.0


Database Credentialed Access

MIMIC-Ext-MIMIC-CXR-VQA: A Complex, Diverse, And Large-Scale Visual Question Answering Dataset for Chest X-ray Images

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, Edward Choi

We introduce MIMIC-Ext-MIMIC-CXR-VQA, a complex, diverse, and large-scale dataset designed for Visual Question Answering (VQA) tasks within the medical domain, focusing primarily on chest radiographs.

question answering machine learning evaluation chest x-ray radiology benchmark electronic health records multimodal deep learning visual question answering

Published: July 19, 2024. Version: 1.0.0


Database Credentialed Access

AMR-UTI: Antimicrobial Resistance in Urinary Tract Infections

Michael Oberst, Soorajnath Boominathan, Helen Zhou, Sanjat Kanjilal, David Sontag

AMR-UTI is a freely accessible dataset, derived from electronic health record (EHR) information on over 100,000 urinary tract infections (UTI) treated at Massachusetts General Hospital and Brigham & Women's Hospital in Boston, MA, USA.

antibiotic resistance causal inference policy learning antimicrobial resistance urinary tract infection clinical decision support machine learning

Published: Nov. 4, 2020. Version: 1.0.0