Database Contributor Review

ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room

Mel Molina Nikita Mehandru Niloufar Golchini Ahmed Alaa

Published: Oct. 23, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Molina, M., Mehandru, N., Golchini, N., & Alaa, A. (2025). ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/55s7-3c27

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

The ER-Reason dataset is a benchmark designed to evaluate LLM-based clinical reasoning and decision-making in the emergency room (ER), a high-stakes setting where clinicians make rapid, consequential decisions across diverse patient presentations and medical specialties under time pressure. This longitudinal collection of de-identified clinical notes encompasses 3,437 patients admitted to the ER at a large academic medical center between March 1, 2022, and March 31, 2024. ER-Reason contains 25,174 notes spanning discharge summaries, progress notes, history and physical exams, consults, echocardiography reports, imaging notes, and ER provider documentation across 3,984 encounters. The benchmark includes evaluation tasks from key stages of the ER workflow: triage intake, initial assessment, treatment selection, disposition planning, and final diagnosis, each structured to reflect core clinical reasoning processes such as differential diagnosis via rule-out reasoning. We also collected 72 full physician-authored rationales explaining reasoning processes that mimic the teaching process used in residency training, and are typically absent from ER documentation. This retrospective dataset captures unstructured, multi-encounter clinical notes reflecting the real-world complexity of ER patient care.


Background

As more advanced LLMs such as Deepseek-R1 [1], OpenAI o1 [2], and o3-mini [3] that are designed for complex reasoning emerge, there is a need for new benchmark datasets that enable rigorous evaluation of the ability of LLMs to emulate clinician reasoning in real-world clinical settings.

We introduce ER-Reason, a benchmark designed to evaluate LLM-based clinical reasoning and decision-making in the emergency room (ER). The ER provides a unique setting for evaluating LLMs for multiple reasons. First, emergency physicians must make rapid, high-stakes decisions based on fragmented, heterogeneous clinical documentation, a setting where LLMs could support clinicians by synthesizing and contextualizing disparate sources of information. Second, ER decision-making follows a structured workflow that involves multiple interdependent stages—triage, assessment, treatment, and disposition—each influenced by both patient-specific factors and systemic constraints such as hospital capacity and acuity levels [4]. Third, unlike many other clinical settings, ER provider notes often contain explicit justifications for diagnostic and treatment choices. Due to the episodic nature of emergency care, clinicians adopt a "worst-first" approach that prioritizes the exclusion of life-threatening conditions over comprehensive care, and must clearly communicate their reasoning to downstream providers [5, 6]. The ER provider note thus serves as a communication tool that reflects clinical reasoning, including which differential diagnoses were considered, which were ruled out, and the rationale behind key decisions. However, due to the intense pace and workload of the ER, these notes can also under-document the full scope of clinician reasoning [7].

ER-Reason captures the full trajectory of patients in the ER at a large academic medical center, with data reflecting the unique characteristics of ER care described above. The benchmark includes:

  • LLM evaluation tasks aligned with key stages of the ER workflow, including triage intake, EHR review, initial assessment, treatment selection, disposition planning, and final diagnosis. Each task is grounded in realistic input derived from the patients’ longitudinal clinical documentation.

  • Annotations capturing clinical reasoning behind decision-making at each stage of the ER workflow. In addition to clinical rationale documented in ER provider notes, the benchmark includes 72 expert-authored rationales curated from practicing ER residents and physicians, with structured annotations of reasoning that mimic the teaching processes found in residency training.


Methods

Patient and Encounter Selection:
The dataset includes all adult patients (ages 18–85) who were admitted to the Emergency Room (ER) at UCSF between March 1, 2022, and March 31, 2024. Only encounters with a specified primary ED diagnosis were included (i.e., encounters with PrimaryEdDiagnosisName not equal to *Unspecified).

Note Selection and Filtering:
For each included encounter, the most recent note of each type created before the encounter’s discharge date was included in the dataset. The note types included were:

  • ED Provider Notes
  • Discharge Summaries
  • Progress Notes
  • History & Physical (H&P) Notes
  • Consults
  • Imaging Reports
  • Echocardiography (ECG) Reports

Social work notes and other note types were excluded to provide additional privacy protection. On average, there are seven notes per patient encounter.

Encounter-Level Inclusion Criteria:
Encounters were included in the final dataset only if at least one Discharge Summary and one ED Provider Note were present. Some patients may have multiple encounters within the study period.

Demographics and Clinical Data:
For each patient, demographic information includes sex, race, age at encounter (calculated from birthdate and encounter departure date), preferred language, highest level of education, and marital status. Chief complaint and admission/discharge information (year only) are also included, with dates shifted to preserve privacy.

Expert Rationales:
To support clinical reasoning evaluation, the dataset includes 72 expert-authored rationales from emergency medicine residents and attending physicians. Participants followed a structured workflow in a custom application to document step-by-step reasoning, including rule-out logic, relevant medical factors, and treatment planning. IRB approval was obtained, and participants were compensated.

Benchmark Tasks:
The dataset supports five benchmark tasks aligned with ER workflow stages:

  1. Triage Intake: Assess patient acuity based on initial clinical notes.
  2. Patient Case Summarization: Generate a one-line summary of the patient’s case using prior discharge notes and the current chief complaint.
  3. Treatment Planning: Predict differential diagnoses, identify relevant medical factors (labs, imaging, clinical findings), and propose treatment plans.
  4. Final Diagnosis: Assign definitive diagnoses using ICD-10 codes and HCC categories based on all available clinical information.
  5. Disposition Decision: Recommend patient disposition (e.g., discharge, admit, transfer) informed by clinical presentation and history.

Evaluation Metrics:
Each task specifies inputs, outputs, and evaluation metrics. Metrics include ROUGE/F1 for summaries, ICD-10 and Hierarchical Condition Category (HCC) accuracy for diagnoses, and clinical concept recall using Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). No preprocessing beyond de-identification was applied before release.

De-Identification:
All notes were de-identified by the University of California, San Francisco (UCSF) Information Commons using both HIPAA Safe Harbor and HIPAA Expert Determination methods. Safe Harbor standards were applied to remove or mask standard identifiers, while the Expert Determination method was additionally employed to allow for date-shifting.


Data Description

Unlike standardized note formats such as SOAP (subjective, objective, assessment, and plan), the variety of note types and unstructured formats in ER-REASON present additional complexity for clinical reasoning tasks. The dataset includes 395 unique chief complaints among admitted patients, with the most common being abdominal pain, shortness of breath, and chest pain. For each admitted ER patient, we include clinical notes from their previous hospital encounters, capturing the complexity and temporal progression of patient care across multiple visits. To our knowledge, ER-REASON is the first publicly released dataset providing longitudinal, multi-encounter clinical notes for ER patients, enabling evaluation of LLMs on realistic clinical reasoning tasks.

Each row in er_reason.csv represents a single patient encounter. Multiple notes from the same encounter are linked via unique identifiers, and historical notes (discharge summaries, progress notes, H&P, imaging, ECGs, consults, echocardiograms) are included alongside the current visit’s ED provider note. Expert-authored rationales are captured in the Rule_Out, Decision_Factors, and Treatment_Plan columns, providing step-by-step clinical reasoning related to the chief complaint, demographics, and one-line summary (One_Sentence_Extracted). Chief complaints are standardized and stored in primarychiefcomplaintname.

Column Name Description
patientdurablekey Unique patient identifier
encounterkey Unique encounter identifier associated with the current visit to the ER
primarychiefcomplaintname Chief complaint when the patient came into the ER
primaryeddiagnosisname Diagnosis given from the ER doctor at the end of current ER visit
sex Patient's sex
firstrace Patient's race
preferredlanguage Patient's preferred language
highestlevelofeducation Patient's highest level of education
maritalstatus Patient's marital status
Age Patient's age
Discharge_Summary_Note_Key Unique identifier linking to the historical discharge summary note
Progress_Note_Key Unique identifier linking to the historical progress note
HP_Note_Key Unique identifier linking to the historical history and physical note
Echo_Key Unique identifier linking to the historical echo note
Imaging_Key Unique identifier linking to the historical imaging note
Consult_Key Unique identifier linking to the historical consult note
ED_Provider_Notes_Key Unique identifier for the current visit's ED provider note
ECG_Key Unique identifier linking to the historical ECG note
Discharge_Summary_Text Historical: Discharge summary text from patient's previous hospital encounter
Progress_Note_Text Historical: Progress note text from patient's previous hospital encounter
HP_Note_Text Historical: History and physical note from patient's previous hospital encounter
Echo_Text Historical: Echocardiogram results and interpretation from patient's previous hospital encounter
Imaging_Text Historical: Imaging reports and findings from patient's previous hospital encounter
Consult_Text Historical: Specialist consultation notes from patient's previous hospital encounter
ECG_Text Historical: Electrocardiogram results and interpretation from patient's previous hospital encounter
ED_Provider_Notes_Text Current Visit: ED Provider note from the current ER visit (associated with this encounter, patient, chief complaint, and diagnosis)
One_Sentence_Extracted Key one-liner summary extracted from the current ED provider note
note_count Number of notes associated with the patient in this dataset (minimum 2: ED and discharge summary, increases based on availability)
acuitylevel Assigned ESI (Emergency Severity Index) level at triage when patient arrived at ER
eddisposition Assigned disposition when patient left ER (e.g., discharged, admitted, transferred)
ArrivalYearKey Year patient arrived at the ER for current visit
DepartureYearKeyValue Year patient departed from the ER for current visit
DepartureYearKey Year patient departed from the ER (key format)
DispositionYearKeyValue Year the disposition was assigned
birthYear Year when patient was born
Discharge_Summary_Year Year the historical discharge summary was created
Progress_Note_Year Year the historical progress note was created
HP_Note_Year Year the historical history and physical note was created
Echo_Year Year the historical echo was performed
Imaging_Year Year the historical imaging was performed
Consult_Year Year the historical consult was completed
ED_Provider_Notes_Year Year the current ED provider notes were created
ECG_Year Year the historical ECG was performed
Rule_Out Differential diagnosis list made by the physician given the chief complaint, demographics, and one-liner (acts as a mental model for the pre-encounter)
Decision_Factors Factors doctors would deploy to narrow down their differential list
Treatment_Plan Factors and treatment plan the physician would choose given the history and physical

Special Notes on Year Fields
Some date-related fields (e.g., ArrivalYearKey, DepartureYearKey) may contain the value 1970. This value does not indicate an actual event year, but instead reflects a default or placeholder (commonly a Unix epoch fallback). Users should interpret 1970 as indicating missing or unavailable date information.


Usage Notes

The ER-REASON dataset can be used for research purposes such as:

  • Benchmarking LLMs on ER note summarization and multi-encounter reasoning.
  • Studying clinical decision-making workflows in the emergency setting.
  • Training models to predict diagnoses, treatment plans, and patient disposition from sequential clinical notes.

Limitations:

  • The dataset is intended for research use only and not for clinical decision-making.
  • Users must comply with the dataset license and maintain data privacy standards.

While the notes are de-identified, they should not be used in any context that could attempt to re-identify patients.

For code, examples, and discussion of the dataset, see the associated GitHub repository [8].


Release Notes

The current version of ER-Reason is v1.0.0. This is the stable release, and the schema and structure are not expected to change. ER-Reason v1.0.0 follows semantic versioning guidelines, and represents the finalized version of the project.


Ethics

The collection and rigorous de-identification of patient information were conducted by the Information Commons team at the University of California, San Francisco (UCSF). Approval to share this dataset was granted by the institution’s compliance team.


Acknowledgements

We would like to thank the University of California, San Francisco (UCSF) Information Commons for their continued support of the ER-Reason project. In particular, we are grateful to Albert Lee for his invaluable assistance, as well as to Helena Mezgova and Ariel Deardorff for their guidance and oversight on compliance matters.


Conflicts of Interest

None to declare. 


References

  1. Guo D, Yang D, Zhang H, Song J, Zhang R, Xu R, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv [Preprint]. 2025 Jan 22. doi:10.48550/arXiv.2501.12948.
  2. Zhong T, Liu Z, Pan Y, Zhang Y, Zhou Y, Liang S, et al. Evaluation of OpenAI o1: Opportunities and Challenges of AGI. arXiv [Preprint]. 2025 Jul 7. doi:10.48550/arXiv.2409.18486.
  3. Mondillo G, Masino M, Colosimo S, Perrotta A, Frattolillo V. Evaluating AI reasoning models in pediatric medicine: A comparative analysis of O3-mini and O3-mini-high. medRxiv [Preprint]. 2025 Feb 27. doi:10.1101/2025.02.27.25323028
  4. Kanzaria HK, Brook RH, Probst MA, Harris D, Berry SH, Hoffman JR. Emergency physician perceptions of shared decision-making. Acad Emerg Med. 2015;22(4):399–405.
  5. Croskerry P. A universal model of diagnostic reasoning. Acad Med. 2009;84(8):1022–8.
  6. Kellermann AL, Hsia RY, Yeh C, Morganti KG. Emergency care: then, now, and next. Health Aff (Millwood). 2013;32(12):2069–74.
  7. Hill RG Jr, Sears LM, Melanson SW. 4000 clicks: a productivity analysis of electronic medical records in a community hospital ED. Am J Emerg Med. 2013;31(11):1591–4.
  8. AlaaLab. ER-Reason [Internet]. GitHub; 2025. Available from: https://github.com/AlaaLab/ER-Reason/ [Accessed 29 Sept 2025].

Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files. In addition, users must have individual studies reviewed by the contributor.

License (for files):
PhysioNet Contributor Review Health Data License 1.5.0

Data Use Agreement:
PhysioNet Contributor Review Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Discovery

DOI (version 1.0.0):
https://doi.org/10.13026/55s7-3c27

DOI (latest version):
https://doi.org/10.13026/jrvj-k081

Corresponding Author
You must be logged in to view the contact information.

Files