Name: EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings
Published: April 3, 2024
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Sunjun Kweon , Jiyoun Kim , Heeyoung Kwak , Dongchul Cha , Hangyul Yoon , Kwang Hyun Kim , Seunghyun Won , Edward Choi

Published: April 3, 2024. Version: 1.0.0

When using this resource, please cite: (show more options)
Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K. H., Won, S., & Choi, E. (2024). EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings (version 1.0.0). PhysioNet. https://doi.org/10.13026/kvca-f224.

MLA	Kweon, Sunjun, et al. "EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings" (version 1.0.0). PhysioNet (2024), https://doi.org/10.13026/kvca-f224.
APA	Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K. H., Won, S., & Choi, E. (2024). EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings (version 1.0.0). PhysioNet. https://doi.org/10.13026/kvca-f224.
Chicago	Kweon, Sunjun, Kim, Jiyoun, Kwak, Heeyoung, Cha, Dongchul, Yoon, Hangyul, Kim, Kwang Hyun, Won, Seunghyun, and Edward Choi. "EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings" (version 1.0.0). PhysioNet (2024). https://doi.org/10.13026/kvca-f224.
Harvard	Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K. H., Won, S., and Choi, E. (2024) 'EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/kvca-f224.
Vancouver	Kweon S, Kim J, Kwak H, Cha D, Yoon H, Kim K H, Won S, Choi E. EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings (version 1.0.0). PhysioNet. 2024. Available from: https://doi.org/10.13026/kvca-f224.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

We introduce EHRNoteQA, a patient-specific question answering benchmark tailored for evaluating Large Language Models (LLMs) in clinical environments. Based on MIMIC-IV Electronic Health Record (EHR), a team of three medical professionals has curated the dataset comprising 962 unique questions, each linked to a specific patient's EHR clinical notes. Our comprehensive evaluation on various large language models showed that their scores on EHRNoteQA correlate more closely with their performance in addressing real-world medical questions evaluated by clinicians than their scores from other LLM benchmarks. This emphasizes the importance of EHRNoteQA in assessing Large Language Models (LLMs) for medical purposes and underscores its contribution to incorporating LLMs into healthcare infrastructures.

Background

Existing benchmarks for question answering based on EHR clinical notes [1-5] primarily rely on extracting textual spans from the notes for answers, assessing models via F1 and Exact Match scores. This approach, while effective for extractive models like BERT [6], falls short for generative LLMs that produce more nuanced and detailed responses. Limiting answers to specific text spans also hinders the development of complex questions vital for real medical contexts, which often require synthesizing information from multiple clinical notes. In our work, we propose EHRNoteQA, a patient-specific EHR QA dataset on MIMIC-IV discharge summaries [1], inspected by clinicians and reflecting real-world medical scenarios. Our dataset is unique in requiring references to two or more clinical notes to answer a single question. Moreover by employing a multi-choice format, our dataset serves as a clinical benchmark that enables accurate and consistent automatic evaluation of LLMs.

Methods

The EHRNoteQA dataset was constructed using a three-phase process: Document Sampling, Question-Answer Generation, and Clinician Modification. During the Document Sampling phase, we randomly selected patients' Electronic Health Record (EHR) clinical notes from the publicly available MIMIC-IV EHR database, focusing on discharge summaries across multiple admissions. In the Question-Answer Generation phase, leveraging the advanced capabilities of GPT-4 [7] (version gpt-4-0613), we generated a unique multi-choice question-answering dataset tailored to each patient's specific EHR. This process was conducted on Azure's HIPAA-compliant platform [8], employing GPT-4 with a temperature setting of 1, while adhering to the default settings for all other parameters. The final phase involved Clinician Modification, wherein a team of three medical professionals reviewed and refined the generated questions and answer choices for accuracy and relevance to the clinical context.

Data Description

The EHRNoteQA dataset is divided into two levels: Level1 and Level2, to cater to the varying processing capabilities of current language models (LLMs) concerning lengthy clinical notes. Level1 targets models supporting up to 4k context length, featuring cases where the total token length of a patient's discharge summaries is below 3500. This level includes one to two hospital admissions (one to two discharge summaries). Level2 is aimed at models that can handle up to 8k context length, with cases where the total token length is below 7500, accommodating one to three admissions (one to three discharge summaries). Cases requiring models to manage significantly longer contexts are not included in this study. For detailed information on the dataset distribution across levels, refer to the table below.

Category	# of Discharge Summaries per patient	# of Questions	Total # of Discharge Summaries
Level1	1	264	264
	2	265	530
Level2	1	145	145
	2	144	288
	3	144	432
Total		962	1,659

Usage Notes

The EHRNoteQA dataset file, EHRNoteQA.jsonl, contains 962 records, each representing a unique patient. Each record of the dataset is a json line consisting of the following structure:

category : level1 or level2 (Indicates which category the record belongs to)
num_notes : 1, 2, or 3 (Specifies the total number of discharge summaries associated with the patient for this record)
patient_id : A unique patient identifier that is linked to the "subject_id" in MIMIC-IV notes. To use the EHRNoteQA data, one must utilize the discharge summaries from MIMIC-IV.
clinician : a, b, or c (Identifies which clinician reviewed and edited the data)
question : The question provided for this record
choice_A, choice_B, choice_C, choice_D, choice_E : Five answer choices for the question
Answer : A, B, C, D, or E

For detailed usage, please refer to GitHub repository [9].

Release Notes

1.0.0 - Initial Release

Ethics

The authors have no ethics statement to declare.

Conflicts of Interest

The authors have no conflicts of interest to declare.

References

Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. https://doi.org/10.13026/1n74-ne17.
Raghavan, P., Patwardhan, S., Liang, J. J., & Devarakonda, M. V. (2018). Annotating electronic medical records for question answering. arXiv preprint arXiv:1805.06816.
Pampari, A., Raghavan, P., Liang, J., & Peng, J. (2018, October). EMRQA: A large corpus for question answering on electronic medical records. In Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Yue, X., Zhang, X. F., Yao, Z., Lin, S., & Sun, H. (2021, December). Cliniqg4qa: Generating diverse questions for domain adaptation of clinical question answering. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 580-587). IEEE.
Soni, S., Gudala, M., Pajouhi, A., & Roberts, K. (2022, June). RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 6250-6259).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Microsoft Azure: https://learn.microsoft.com/en-us/azure/compliance/offerings/offering-hipaa-us [Accessed 3/25/2024]
Github: github.com/ji-youn-kim/EHRNoteQA [Accessed 3/25/2024]