Name: ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room
Published: Oct. 23, 2025
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Contributor Review

Mel Molina , Nikita Mehandru , Niloufar Golchini , Ahmed Alaa

Published: Oct. 23, 2025. Version: 1.0.0

When using this resource, please cite: (show more options)
Molina, M., Mehandru, N., Golchini, N., & Alaa, A. (2025). ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/55s7-3c27

MLA	Molina, Mel, et al. "ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/55s7-3c27
APA	Molina, M., Mehandru, N., Golchini, N., & Alaa, A. (2025). ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/55s7-3c27
Chicago	Molina, Mel, Mehandru, Nikita, Golchini, Niloufar, and Ahmed Alaa. "ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room" (version 1.0.0). PhysioNet (2025). RRID:SCR_007345. https://doi.org/10.13026/55s7-3c27
Harvard	Molina, M., Mehandru, N., Golchini, N., and Alaa, A. (2025) 'ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room' (version 1.0.0), PhysioNet. RRID:SCR_007345. Available at: https://doi.org/10.13026/55s7-3c27
Vancouver	Molina M, Mehandru N, Golchini N, Alaa A. ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room (version 1.0.0). PhysioNet. 2025. RRID:SCR_007345. Available from: https://doi.org/10.13026/55s7-3c27

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000). RRID:SCR_007345.
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

The ER-Reason dataset is a benchmark designed to evaluate LLM-based clinical reasoning and decision-making in the emergency room (ER), a high-stakes setting where clinicians make rapid, consequential decisions across diverse patient presentations and medical specialties under time pressure. This longitudinal collection of de-identified clinical notes encompasses 3,437 patients admitted to the ER at a large academic medical center between March 1, 2022, and March 31, 2024. ER-Reason contains 25,174 notes spanning discharge summaries, progress notes, history and physical exams, consults, echocardiography reports, imaging notes, and ER provider documentation across 3,984 encounters. The benchmark includes evaluation tasks from key stages of the ER workflow: triage intake, initial assessment, treatment selection, disposition planning, and final diagnosis, each structured to reflect core clinical reasoning processes such as differential diagnosis via rule-out reasoning. We also collected 72 full physician-authored rationales explaining reasoning processes that mimic the teaching process used in residency training, and are typically absent from ER documentation. This retrospective dataset captures unstructured, multi-encounter clinical notes reflecting the real-world complexity of ER patient care.

Background

As more advanced LLMs such as Deepseek-R1 [1], OpenAI o1 [2], and o3-mini [3] that are designed for complex reasoning emerge, there is a need for new benchmark datasets that enable rigorous evaluation of the ability of LLMs to emulate clinician reasoning in real-world clinical settings.

We introduce ER-Reason, a benchmark designed to evaluate LLM-based clinical reasoning and decision-making in the emergency room (ER). The ER provides a unique setting for evaluating LLMs for multiple reasons. First, emergency physicians must make rapid, high-stakes decisions based on fragmented, heterogeneous clinical documentation, a setting where LLMs could support clinicians by synthesizing and contextualizing disparate sources of information. Second, ER decision-making follows a structured workflow that involves multiple interdependent stages—triage, assessment, treatment, and disposition—each influenced by both patient-specific factors and systemic constraints such as hospital capacity and acuity levels [4]. Third, unlike many other clinical settings, ER provider notes often contain explicit justifications for diagnostic and treatment choices. Due to the episodic nature of emergency care, clinicians adopt a "worst-first" approach that prioritizes the exclusion of life-threatening conditions over comprehensive care, and must clearly communicate their reasoning to downstream providers [5, 6]. The ER provider note thus serves as a communication tool that reflects clinical reasoning, including which differential diagnoses were considered, which were ruled out, and the rationale behind key decisions. However, due to the intense pace and workload of the ER, these notes can also under-document the full scope of clinician reasoning [7].

ER-Reason captures the full trajectory of patients in the ER at a large academic medical center, with data reflecting the unique characteristics of ER care described above. The benchmark includes:

LLM evaluation tasks aligned with key stages of the ER workflow, including triage intake, EHR review, initial assessment, treatment selection, disposition planning, and final diagnosis. Each task is grounded in realistic input derived from the patients’ longitudinal clinical documentation.
Annotations capturing clinical reasoning behind decision-making at each stage of the ER workflow. In addition to clinical rationale documented in ER provider notes, the benchmark includes 72 expert-authored rationales curated from practicing ER residents and physicians, with structured annotations of reasoning that mimic the teaching processes found in residency training.

Methods

Patient and Encounter Selection:
The dataset includes all adult patients (ages 18–85) who were admitted to the Emergency Room (ER) at UCSF between March 1, 2022, and March 31, 2024. Only encounters with a specified primary ED diagnosis were included (i.e., encounters with PrimaryEdDiagnosisName not equal to *Unspecified).

Note Selection and Filtering:
For each included encounter, the most recent note of each type created before the encounter’s discharge date was included in the dataset. The note types included were:

ED Provider Notes
Discharge Summaries
Progress Notes
History & Physical (H&P) Notes
Consults
Imaging Reports
Echocardiography (ECG) Reports

Social work notes and other note types were excluded to provide additional privacy protection. On average, there are seven notes per patient encounter.

Encounter-Level Inclusion Criteria:
Encounters were included in the final dataset only if at least one Discharge Summary and one ED Provider Note were present. Some patients may have multiple encounters within the study period.

Demographics and Clinical Data:
For each patient, demographic information includes sex, race, age at encounter (calculated from birthdate and encounter departure date), preferred language, highest level of education, and marital status. Chief complaint and admission/discharge information (year only) are also included, with dates shifted to preserve privacy.

Expert Rationales:
To support clinical reasoning evaluation, the dataset includes 72 expert-authored rationales from emergency medicine residents and attending physicians. Participants followed a structured workflow in a custom application to document step-by-step reasoning, including rule-out logic, relevant medical factors, and treatment planning. IRB approval was obtained, and participants were compensated.

Benchmark Tasks:
The dataset supports five benchmark tasks aligned with ER workflow stages:

Triage Intake: Assess patient acuity based on initial clinical notes.
Patient Case Summarization: Generate a one-line summary of the patient’s case using prior discharge notes and the current chief complaint.
Treatment Planning: Predict differential diagnoses, identify relevant medical factors (labs, imaging, clinical findings), and propose treatment plans.
Final Diagnosis: Assign definitive diagnoses using ICD-10 codes and HCC categories based on all available clinical information.
Disposition Decision: Recommend patient disposition (e.g., discharge, admit, transfer) informed by clinical presentation and history.

Evaluation Metrics:
Each task specifies inputs, outputs, and evaluation metrics. Metrics include ROUGE/F1 for summaries, ICD-10 and Hierarchical Condition Category (HCC) accuracy for diagnoses, and clinical concept recall using Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). No preprocessing beyond de-identification was applied before release.

De-Identification:
All notes were de-identified by the University of California, San Francisco (UCSF) Information Commons using both HIPAA Safe Harbor and HIPAA Expert Determination methods. Safe Harbor standards were applied to remove or mask standard identifiers, while the Expert Determination method was additionally employed to allow for date-shifting.

Data Description

Unlike standardized note formats such as SOAP (subjective, objective, assessment, and plan), the variety of note types and unstructured formats in ER-REASON present additional complexity for clinical reasoning tasks. The dataset includes 395 unique chief complaints among admitted patients, with the most common being abdominal pain, shortness of breath, and chest pain. For each admitted ER patient, we include clinical notes from their previous hospital encounters, capturing the complexity and temporal progression of patient care across multiple visits. To our knowledge, ER-REASON is the first publicly released dataset providing longitudinal, multi-encounter clinical notes for ER patients, enabling evaluation of LLMs on realistic clinical reasoning tasks.

Each row in er_reason.csv represents a single patient encounter. Multiple notes from the same encounter are linked via unique identifiers, and historical notes (discharge summaries, progress notes, H&P, imaging, ECGs, consults, echocardiograms) are included alongside the current visit’s ED provider note. Expert-authored rationales are captured in the Rule_Out, Decision_Factors, and Treatment_Plan columns, providing step-by-step clinical reasoning related to the chief complaint, demographics, and one-line summary (One_Sentence_Extracted). Chief complaints are standardized and stored in primarychiefcomplaintname.

Column Name	Description
`patientdurablekey`	Unique patient identifier
`encounterkey`	Unique encounter identifier associated with the current visit to the ER
`primarychiefcomplaintname`	Chief complaint when the patient came into the ER
`primaryeddiagnosisname`	Diagnosis given from the ER doctor at the end of current ER visit
`sex`	Patient's sex
`firstrace`	Patient's race
`preferredlanguage`	Patient's preferred language
`highestlevelofeducation`	Patient's highest level of education
`maritalstatus`	Patient's marital status
`Age`	Patient's age
`Discharge_Summary_Note_Key`	Unique identifier linking to the historical discharge summary note
`Progress_Note_Key`	Unique identifier linking to the historical progress note
`HP_Note_Key`	Unique identifier linking to the historical history and physical note
`Echo_Key`	Unique identifier linking to the historical echo note
`Imaging_Key`	Unique identifier linking to the historical imaging note
`Consult_Key`	Unique identifier linking to the historical consult note
`ED_Provider_Notes_Key`	Unique identifier for the current visit's ED provider note
`ECG_Key`	Unique identifier linking to the historical ECG note
`Discharge_Summary_Text`	Historical: Discharge summary text from patient's previous hospital encounter
`Progress_Note_Text`	Historical: Progress note text from patient's previous hospital encounter
`HP_Note_Text`	Historical: History and physical note from patient's previous hospital encounter
`Echo_Text`	Historical: Echocardiogram results and interpretation from patient's previous hospital encounter
`Imaging_Text`	Historical: Imaging reports and findings from patient's previous hospital encounter
`Consult_Text`	Historical: Specialist consultation notes from patient's previous hospital encounter
`ECG_Text`	Historical: Electrocardiogram results and interpretation from patient's previous hospital encounter
`ED_Provider_Notes_Text`	Current Visit: ED Provider note from the current ER visit (associated with this encounter, patient, chief complaint, and diagnosis)
`One_Sentence_Extracted`	Key one-liner summary extracted from the current ED provider note
`note_count`	Number of notes associated with the patient in this dataset (minimum 2: ED and discharge summary, increases based on availability)
`acuitylevel`	Assigned ESI (Emergency Severity Index) level at triage when patient arrived at ER
`eddisposition`	Assigned disposition when patient left ER (e.g., discharged, admitted, transferred)
`ArrivalYearKey`	Year patient arrived at the ER for current visit
`DepartureYearKeyValue`	Year patient departed from the ER for current visit
`DepartureYearKey`	Year patient departed from the ER (key format)
`DispositionYearKeyValue`	Year the disposition was assigned
`birthYear`	Year when patient was born
`Discharge_Summary_Year`	Year the historical discharge summary was created
`Progress_Note_Year`	Year the historical progress note was created
`HP_Note_Year`	Year the historical history and physical note was created
`Echo_Year`	Year the historical echo was performed
`Imaging_Year`	Year the historical imaging was performed
`Consult_Year`	Year the historical consult was completed
`ED_Provider_Notes_Year`	Year the current ED provider notes were created
`ECG_Year`	Year the historical ECG was performed
`Rule_Out`	Differential diagnosis list made by the physician given the chief complaint, demographics, and one-liner (acts as a mental model for the pre-encounter)
`Decision_Factors`	Factors doctors would deploy to narrow down their differential list
`Treatment_Plan`	Factors and treatment plan the physician would choose given the history and physical

Special Notes on Year Fields
Some date-related fields (e.g., ArrivalYearKey, DepartureYearKey) may contain the value 1970. This value does not indicate an actual event year, but instead reflects a default or placeholder (commonly a Unix epoch fallback). Users should interpret 1970 as indicating missing or unavailable date information.

Usage Notes

The ER-REASON dataset can be used for research purposes such as:

Benchmarking LLMs on ER note summarization and multi-encounter reasoning.
Studying clinical decision-making workflows in the emergency setting.
Training models to predict diagnoses, treatment plans, and patient disposition from sequential clinical notes.

Limitations:

The dataset is intended for research use only and not for clinical decision-making.
Users must comply with the dataset license and maintain data privacy standards.

While the notes are de-identified, they should not be used in any context that could attempt to re-identify patients.

For code, examples, and discussion of the dataset, see the associated GitHub repository [8].

Release Notes

The current version of ER-Reason is v1.0.0. This is the stable release, and the schema and structure are not expected to change. ER-Reason v1.0.0 follows semantic versioning guidelines, and represents the finalized version of the project.

Ethics

The collection and rigorous de-identification of patient information were conducted by the Information Commons team at the University of California, San Francisco (UCSF). Approval to share this dataset was granted by the institution’s compliance team.

Acknowledgements

We would like to thank the University of California, San Francisco (UCSF) Information Commons for their continued support of the ER-Reason project. In particular, we are grateful to Albert Lee for his invaluable assistance, as well as to Helena Mezgova and Ariel Deardorff for their guidance and oversight on compliance matters.

Conflicts of Interest

None to declare.

References

Guo D, Yang D, Zhang H, Song J, Zhang R, Xu R, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv [Preprint]. 2025 Jan 22. doi:10.48550/arXiv.2501.12948.
Zhong T, Liu Z, Pan Y, Zhang Y, Zhou Y, Liang S, et al. Evaluation of OpenAI o1: Opportunities and Challenges of AGI. arXiv [Preprint]. 2025 Jul 7. doi:10.48550/arXiv.2409.18486.
Mondillo G, Masino M, Colosimo S, Perrotta A, Frattolillo V. Evaluating AI reasoning models in pediatric medicine: A comparative analysis of O3-mini and O3-mini-high. medRxiv [Preprint]. 2025 Feb 27. doi:10.1101/2025.02.27.25323028
Kanzaria HK, Brook RH, Probst MA, Harris D, Berry SH, Hoffman JR. Emergency physician perceptions of shared decision-making. Acad Emerg Med. 2015;22(4):399–405.
Croskerry P. A universal model of diagnostic reasoning. Acad Med. 2009;84(8):1022–8.
Kellermann AL, Hsia RY, Yeh C, Morganti KG. Emergency care: then, now, and next. Health Aff (Millwood). 2013;32(12):2069–74.
Hill RG Jr, Sears LM, Melanson SW. 4000 clicks: a productivity analysis of electronic medical records in a community hospital ED. Am J Emerg Med. 2013;31(11):1591–4.
AlaaLab. ER-Reason [Internet]. GitHub; 2025. Available from: https://github.com/AlaaLab/ER-Reason/ [Accessed 29 Sept 2025].

Access

Access Policy:
Only credentialed users who sign the DUA can access the files. In addition, users must have individual studies reviewed by the contributor.

License (for files):
PhysioNet Contributor Review Health Data License 1.5.0

Data Use Agreement:
PhysioNet Contributor Review Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research