Database Credentialed Access

EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings

Sunjun Kweon Jiyoun Kim Heeyoung Kwak Dongchul Cha Hangyul Yoon Kwang Hyun Kim Seunghyun Won Edward Choi

Published: April 3, 2024. Version: 1.0.0

When using this resource, please cite: (show more options)
Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K. H., Won, S., & Choi, E. (2024). EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings (version 1.0.0). PhysioNet.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


We introduce EHRNoteQA, a patient-specific question answering benchmark tailored for evaluating Large Language Models (LLMs) in clinical environments. Based on MIMIC-IV Electronic Health Record (EHR), a team of three medical professionals has curated the dataset comprising 962 unique questions, each linked to a specific patient's EHR clinical notes. Our comprehensive evaluation on various large language models showed that their scores on EHRNoteQA correlate more closely with their performance in addressing real-world medical questions evaluated by clinicians than their scores from other LLM benchmarks. This emphasizes the importance of EHRNoteQA in assessing Large Language Models (LLMs) for medical purposes and underscores its contribution to incorporating LLMs into healthcare infrastructures.


Existing benchmarks for question answering based on EHR clinical notes [1-5] primarily rely on extracting textual spans from the notes for answers, assessing models via F1 and Exact Match scores. This approach, while effective for extractive models like BERT [6], falls short for generative LLMs that produce more nuanced and detailed responses. Limiting answers to specific text spans also hinders the development of complex questions vital for real medical contexts, which often require synthesizing information from multiple clinical notes. In our work, we propose EHRNoteQA, a patient-specific EHR QA dataset on MIMIC-IV discharge summaries [1], inspected by clinicians and reflecting real-world medical scenarios. Our dataset is unique in requiring references to two or more clinical notes to answer a single question. Moreover by employing a multi-choice format, our dataset serves as a clinical benchmark that enables accurate and consistent automatic evaluation of LLMs.


The EHRNoteQA dataset was constructed using a three-phase process: Document Sampling, Question-Answer Generation, and Clinician Modification. During the Document Sampling phase, we randomly selected patients' Electronic Health Record (EHR) clinical notes from the publicly available MIMIC-IV EHR database, focusing on discharge summaries across multiple admissions. In the Question-Answer Generation phase, leveraging the advanced capabilities of GPT-4 [7] (version gpt-4-0613), we generated a unique multi-choice question-answering dataset tailored to each patient's specific EHR. This process was conducted on Azure's HIPAA-compliant platform [8], employing GPT-4 with a temperature setting of 1, while adhering to the default settings for all other parameters. The final phase involved Clinician Modification, wherein a team of three medical professionals reviewed and refined the generated questions and answer choices for accuracy and relevance to the clinical context.

Data Description

The EHRNoteQA dataset is divided into two levels: Level1 and Level2, to cater to the varying processing capabilities of current language models (LLMs) concerning lengthy clinical notes. Level1 targets models supporting up to 4k context length, featuring cases where the total token length of a patient's discharge summaries is below 3500. This level includes one to two hospital admissions (one to two discharge summaries). Level2 is aimed at models that can handle up to 8k context length, with cases where the total token length is below 7500, accommodating one to three admissions (one to three discharge summaries). Cases requiring models to manage significantly longer contexts are not included in this study. For detailed information on the dataset distribution across levels, refer to the table below.

Category # of Discharge Summaries per patient # of Questions Total # of Discharge Summaries
Level1 1 264 264
  2 265 530
Level2 1 145 145
  2 144 288
  3 144 432
Total   962 1,659

Usage Notes

The EHRNoteQA dataset file, EHRNoteQA.jsonl, contains 962 records, each representing a unique patient. Each record of the dataset is a json line consisting of the following structure:

  • category : level1 or level2 (Indicates which category the record belongs to)
  • num_notes : 1, 2, or 3 (Specifies the total number of discharge summaries associated with the patient for this record)
  • patient_id : A unique patient identifier that is linked to the "subject_id" in MIMIC-IV notes. To use the EHRNoteQA data, one must utilize the discharge summaries from MIMIC-IV.
  • clinician : a, b, or c (Identifies which clinician reviewed and edited the data)
  • question : The question provided for this record
  • choice_A, choice_B, choice_C, choice_D, choice_E : Five answer choices for the question
  • Answer : A, B, C, D, or E

For detailed usage, please refer to GitHub repository [9].

Release Notes

1.0.0 - Initial Release


The authors have no ethics statement to declare.

Conflicts of Interest

The authors have no conflicts of interest to declare.


  1. Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet.
  2. Raghavan, P., Patwardhan, S., Liang, J. J., & Devarakonda, M. V. (2018). Annotating electronic medical records for question answering. arXiv preprint arXiv:1805.06816.
  3. Pampari, A., Raghavan, P., Liang, J., & Peng, J. (2018, October). EMRQA: A large corpus for question answering on electronic medical records. In Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  4. Yue, X., Zhang, X. F., Yao, Z., Lin, S., & Sun, H. (2021, December). Cliniqg4qa: Generating diverse questions for domain adaptation of clinical question answering. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 580-587). IEEE.
  5. Soni, S., Gudala, M., Pajouhi, A., & Roberts, K. (2022, June). RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 6250-6259).
  6. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  7. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  8. Microsoft Azure: [Accessed 3/25/2024]
  9. Github: [Accessed 3/25/2024]

Parent Projects
EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.