Database Credentialed Access

# Annotated Question-Answer Pairs for Clinical Notes in the MIMIC-III Database

Published: Jan. 15, 2021. Version: 1.0.0

Yue, X., Zhang, X. F., & Sun, H. (2021). Annotated Question-Answer Pairs for Clinical Notes in the MIMIC-III Database (version 1.0.0). PhysioNet. https://doi.org/10.13026/j0y6-bw05.

Yue, X., Zhang, X. F., Yao, Z., Lin, S., & Sun, H. (2020). CliniQG4QA: Generating Diverse Questions for Domain Adaptation of Clinical Question Answering. arXiv preprint arXiv:2010.16021.

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

## Abstract

Clinical question answering (QA) (or reading comprehension) aims to automatically answer questions from medical professionals based on clinical texts. We release this dataset, which contains 1287 annotated QA pairs on 36 sampled discharge summaries from MIMIC-III Clinical Notes, to facilitate the clinical question answering task. Questions in our dataset are either verified or directly generated by clinical experts.

Note that the primary purpose of this dataset is to test the generalizability of a QA model, i.e., whether a QA model that is trained on other datasets can answer questions on this dataset (which may have a different distribution compared with the training data), rather than to train a QA model. Hence the scale of our annotations is relatively small compared to some existing QA datasets.

## Background

Medical professionals often query over clinical notes in Electronic Medical Records (EMRs) to find information that can support their decision making. One way to facilitate such information-seeking activities is to build a natural language question-answering (QA) system that can extract precise answers from clinical notes.

Studies show that neural QA models trained on one corpus may not generalize well to new clinical texts from a different institute or a different patient group [1]. To be more specific, as pointed out in [1], a clinical QA model that was trained on the emrQA dataset [2] is deployed to answer questions based on MIMIC-III clinical texts [3], its performance drops dramatically by around 30% even on the questions that are similar to those in training. Poor generalizability will severely hurt and limit the real use of clinical question answering systems.

Existing clinical QA dataset [2] was built for in-domain testing setting, to facilitate the generalizability (or out-of-the-domain) testing, we release this dataset, which contains 1287 annotated QA pairs on the sampled MIMIC-III Clinical Notes. We hope that the release of this test dataset can help build QA systems that are less error-prone in real scenarios.

## Methods

We first randomly sample 36 MIMIC-III clinical texts (discharge summaries) and then give them to the clinical experts (medical graduate students), based on which, they can ask any questions as long as an answer can be extracted from the context. To help facilitate reproducibility, we include IDs, which corresponds to the ROW_ID field in MIMIC-III, for each clinical note, saved as “title” for each "paragraph" (see Data Description Section for more details). To save annotation efforts, machine-generated QA pairs by neural question generation models (see our paper [3] for more details) are provided as references. However, the experts are highly encouraged to create new questions based on the given clinical text (which are marked as “human-generated"). But if they do find the machine-generated questions make sense, sound natural and match the associated answer evidence, they can keep them (which are marked as “human-verified"). After obtaining the annotated questions, we ask another clinical expert to do a final pass of the questions in order to further guarantee the quality of the test set. The final test set consists of 1287 questions (of which 975 are “human-verified" and 312 are “human-generated").

## Data Description

We follow the data format of the SQuAD dataset. Generally speaking, under SQuAD format, each clinical note is consisting of more than contexts, each of which can formulate multiple QA pairs.

test.final.json
├── "version"
└── "data"
└── [i]
├── "paragraphs"
│   └── [j]
│       ├── "context"
│       │
│       └── "qas"
│           └── [k]
│               │   └── [l]
│               │       │
│               │       └── "text"
│               │
│               ├── "id"
│               │
│               └── "question"
│
└── "title"


A running clinical QA example is as follows:

Context: ... he was guaiac negative on admission. hematocrit remained stable overnight. 5. abd pain: suspect secondary to chronic pancreatitis. amylase unchanged from previous levels. ...

Question: Why did the patient get abd pain?

Answer: 5. abd pain: suspect secondary to chronic pancreatitis

The "answer" in this example will be in the "text" field of "answers" in the format above. An item (i.e., one QA pair) with “id” between 0 and 974 (inclusive) are identified as human-verified question-answer pairs. Note that there are no items with “id” between 975 and 999, which are undefined.

An item with “id” between 1000 and 1311 (inclusive) are identified as human-generated question-answer pairs.

Our generated test set is diverse enough that covers different types of questions, which can be partially manifested by the distribution of interrogative words (See below).

 what when has was why how is did can any does others total 235 21 249 66 39 41 124 15 52 92 351 2 1287

*others: question with rare interrogative words “were” or “have”

## Usage Notes

For the data structure, please consult the data description section. In order to use the data for the testing purpose and replicate results as reported in our paper [4], please check our GitHub repo [5] for more details. A brief note is as follows:

• First download test set data and place them under the Data directory
• For QA inference, specify the path to the data file for the deployed QA model by changing the data path argument.
• After the inference is carried out, run the evaluation scripts against our test set data to obtain F1 and Accuracy.

The purpose of this dataset is to serve as a target domain and help examine the generalizability of a QA model solely trained on its source domain. In other words, our provided dataset (due to its limited size) should only be considered a test set without any exposure to the training stage.

## Acknowledgements

We thank our experts for the annotations. This research was sponsored in part by the Patient Centered Outcomes Research Institute Funding ME-2017C1-6413, the Army Research Office under cooperative agreements W911NF-17-1-0412, NSF Grant IIS1815674, NSF CAREER #1942980, and Ohio Supercomputer Center.

## Conflicts of Interest

The authors have no conflicts of interest to declare.

## References

1. Yue, X., Gutierrez, B. J., & Sun, H. (2020). Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset. ACL 2020.
2. Pampari, A., Raghavan, P., Liang, J., Peng, J. (2018). emrQA: A Large Corpus for Question Answering on Electronic Medical Records. EMNLP 2018.
3. Johnson, A. E., Pollard, T. J., Shen, L., Li-Wei, H. L., Feng, M., Ghassemi, M., ... & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1-9.
4. Yue, X., Zhang, X. F., Yao, Z., Lin, S., & Sun, H. (2020). CliniQG4QA: Generating Diverse Questions for Domain Adaptation of Clinical Question Answering. arXiv preprint arXiv:2010.16021.
5. Yue, X., Zhang, X. F., Yao, Z., Lin, S., & Sun, H. (2020). Code repository for the CliniQG4QA project. Website. https://github.com/sunlab-osu/CliniQG4QA [Accessed on 30 Dec 2020]

##### Parent Projects
Annotated Question-Answer Pairs for Clinical Notes in the MIMIC-III Database was derived from: Please cite them when using this project.
##### Access

Access Policy:
Only credentialed users who sign the DUA can access the files.