Resources


Database Credentialed Access

BOLD, a blood-gas and oximetry linked dataset

João Matos, Tristan Struja, Jack Gallifant, Luis Filipe Nakayama, Marie Charpignon, Xiaoli Liu, Jaime dos Santos Cardoso, Leo Anthony Celi, An Kwok Wong

An open-source pulse oximetry and arterial blood gas dataset, derived from MIMIC-III, MIMIC-IV, and eICU-CRD

pulse oximetry intensive care unit health equity electronic health records

Published: Nov. 8, 2023. Version: 1.0


Database Credentialed Access

PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Daeun Kyung, Hyunseung Chung, Seongsu Bae, Jiho Kim, Jae Ho Sohn, Taerim Kim, Soo Kim, Edward Choi

PatientSim is a patient simulator that simulates realistic and diverse personas for clinical scenarios, enabling robust training and evaluation of doctor-patient interactions in multi-turn dialogues.

electronic health records multi-turn dialogue llm simulation doctor-patient consultation

Published: Oct. 18, 2025. Version: 1.0.0


Database Credentialed Access

Annotated Social Determinants of Health Dataset for Adverse Pregnancy Outcomes

Nidhi Soley, MaKhaila Bentil, Jash Shah, Masoud Rouhizadeh, Casey Taylor

This project provides a manually annotated dataset of social determinants of health—social support, occupation, and substance use—linked to pregnancy outcomes, extracted from MIMIC-III and MIMIC-IV discharge summary notes.

Published: Aug. 4, 2025. Version: 1.0.0


Model Credentialed Access

Shareable Artificial Intelligence to Extract Cancer Outcomes from Electronic Health Records for Precision Oncology Research

Kenneth Kehl, Pavel Trukhanov, Christopher Fong, Justin Jee, Karl Pichotta, Morgan Paul, Chelsea Nichols, Michele Waters, Nikolaus Schultz, Deborah Schrag

The DFCI-imaging-student and DFCI-medonc-student AI models for extracting cancer outcomes from imaging reports and medical oncologist notes from electronic health records.

Published: Oct. 24, 2024. Version: 1.0.0


Database Credentialed Access

EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems

Konstantin Kotschenreuther

Dataset consisting of question and answer pairs synthetically generated from medical discharge summaries, designed to facilitate the training and development of large language models specifically tailored for healthcare applications

mimic-iv clinical question-answering medical discharge summaries large language models

Published: Jan. 11, 2024. Version: 1.0.0


Database Credentialed Access

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, Edward Choi

We present EHRXQA, the first multi-modal EHR QA dataset combining structured patient records with aligned chest X-ray images. EHRXQA contains a comprehensive set of QA pairs covering image-related, table-related, and image+table-related questions.

question answering chest x-ray electronic health records multi-modal question answering ehr question answering semantic parsing machine learning deep learning evaluation visual question answering benchmark

Published: July 23, 2024. Version: 1.0.0


Model Credentialed Access

Characterization of Stigmatizing Language in Medical Records

Keith Harrigian, Ayah Zirikly, Brant Chee, Alya Ahmad, Anne Links, Somnath Saha, Mary Catherine Beach, Mark Dredze

A suite of classifiers for detecting three types of stigmatizing language in electronic medical records. Trained on MIMIC-IV discharge notes.

clinical natural language processing domain transfer bias stigmatizing language large language models mimic

Published: Nov. 6, 2023. Version: 1.0.0


Challenge Credentialed Access

SNOMED CT Entity Linking Challenge

Will Hardman, Mark Banks, Rory Davidson, Donna Truran, Nindya Widita Ayuningtyas, Hoa Ngo, Alistair Johnson, Tom Pollard

272 discharge notes from the MIMIC-IV-Note dataset annotated with SNOMED CT concepts.

snomed entity linking clinical annotation

Published: July 22, 2025. Version: 1.1.0


Database Credentialed Access

MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters

Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb Smith, Jens Kleesiek, Julian Friedrich

MeDiSumQA is a dataset of patient-oriented QA pairs from MIMIC-IV discharge summaries, designed to evaluate LLMs in generating safe, patient-friendly medical responses for clinical QA and healthcare communication.

Published: May 5, 2025. Version: 1.0.0


Database Credentialed Access

A Temporal Dataset for Respiratory Support in Critically Ill Patients

Mira Moukheiber, Lama Moukheiber, Dana Moukheiber, Sicheng Hao, Leo Anthony Celi, Hyung-Chul Lee

A benchmark dataset offering hourly records over a 90-day period for 50,920 ICU subjects, including dynamic pulmonary function data and a spectrum of covariates for respiratory intervention analyses.

oberservational data time-series

Published: April 15, 2025. Version: 1.1.0