Database Credentialed Access
Annotated Question-Answer Pairs for Clinical Notes in the MIMIC-III Database
Published: Jan. 15, 2021. Version: 1.0.0
When using this resource, please cite:
(show more options)
Yue, X., Zhang, X. F., & Sun, H. (2021). Annotated Question-Answer Pairs for Clinical Notes in the MIMIC-III Database (version 1.0.0). PhysioNet. https://doi.org/10.13026/j0y6-bw05.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Clinical question answering (QA) (or reading comprehension) aims to automatically answer questions from medical professionals based on clinical texts. We release this dataset, which contains 1287 annotated QA pairs on 36 sampled discharge summaries from MIMIC-III Clinical Notes, to facilitate the clinical question answering task. Questions in our dataset are either verified or directly generated by clinical experts.
Note that the primary purpose of this dataset is to test the generalizability of a QA model, i.e., whether a QA model that is trained on other datasets can answer questions on this dataset (which may have a different distribution compared with the training data), rather than to train a QA model. Hence the scale of our annotations is relatively small compared to some existing QA datasets.
Medical professionals often query over clinical notes in Electronic Medical Records (EMRs) to find information that can support their decision making. One way to facilitate such information-seeking activities is to build a natural language question-answering (QA) system that can extract precise answers from clinical notes.
Studies show that neural QA models trained on one corpus may not generalize well to new clinical texts from a different institute or a different patient group . To be more specific, as pointed out in , a clinical QA model that was trained on the emrQA dataset  is deployed to answer questions based on MIMIC-III clinical texts , its performance drops dramatically by around 30% even on the questions that are similar to those in training. Poor generalizability will severely hurt and limit the real use of clinical question answering systems.
Existing clinical QA dataset  was built for in-domain testing setting, to facilitate the generalizability (or out-of-the-domain) testing, we release this dataset, which contains 1287 annotated QA pairs on the sampled MIMIC-III Clinical Notes. We hope that the release of this test dataset can help build QA systems that are less error-prone in real scenarios.
We first randomly sample 36 MIMIC-III clinical texts (discharge summaries) and then give them to the clinical experts (medical graduate students), based on which, they can ask any questions as long as an answer can be extracted from the context. To help facilitate reproducibility, we include IDs, which corresponds to the ROW_ID field in MIMIC-III, for each clinical note, saved as “title” for each "paragraph" (see Data Description Section for more details). To save annotation efforts, machine-generated QA pairs by neural question generation models (see our paper  for more details) are provided as references. However, the experts are highly encouraged to create new questions based on the given clinical text (which are marked as “human-generated"). But if they do find the machine-generated questions make sense, sound natural and match the associated answer evidence, they can keep them (which are marked as “human-verified"). After obtaining the annotated questions, we ask another clinical expert to do a final pass of the questions in order to further guarantee the quality of the test set. The final test set consists of 1287 questions (of which 975 are “human-verified" and 312 are “human-generated").
We follow the data format of the SQuAD dataset. Generally speaking, under SQuAD format, each clinical note is consisting of more than contexts, each of which can formulate multiple QA pairs.
test.final.json ├── "version" └── "data" └── [i] ├── "paragraphs" │ └── [j] │ ├── "context" │ │ │ └── "qas" │ └── [k] │ ├── "answers" │ │ └── [l] │ │ ├── "answer_start" │ │ │ │ │ └── "text" │ │ │ ├── "id" │ │ │ └── "question" │ └── "title"
A running clinical QA example is as follows:
Context: ... he was guaiac negative on admission. hematocrit remained stable overnight. 5. abd pain: suspect secondary to chronic pancreatitis. amylase unchanged from previous levels. ... Question: Why did the patient get abd pain? Answer: 5. abd pain: suspect secondary to chronic pancreatitis
The "answer" in this example will be in the "text" field of "answers" in the format above. An item (i.e., one QA pair) with “id” between 0 and 974 (inclusive) are identified as human-verified question-answer pairs. Note that there are no items with “id” between 975 and 999, which are undefined.
An item with “id” between 1000 and 1311 (inclusive) are identified as human-generated question-answer pairs.
Our generated test set is diverse enough that covers different types of questions, which can be partially manifested by the distribution of interrogative words (See below).
*others: question with rare interrogative words “were” or “have”
For the data structure, please consult the data description section. In order to use the data for the testing purpose and replicate results as reported in our paper , please check our GitHub repo  for more details. A brief note is as follows:
- First download test set data and place them under the Data directory.
- For QA inference, specify the path to the data file for the deployed QA model by changing the data path argument.
- After the inference is carried out, run the evaluation scripts against our test set data to obtain F1 and Accuracy.
The purpose of this dataset is to serve as a target domain and help examine the generalizability of a QA model solely trained on its source domain. In other words, our provided dataset (due to its limited size) should only be considered a test set without any exposure to the training stage.
We thank our experts for the annotations. This research was sponsored in part by the Patient Centered Outcomes Research Institute Funding ME-2017C1-6413, the Army Research Office under cooperative agreements W911NF-17-1-0412, NSF Grant IIS1815674, NSF CAREER #1942980, and Ohio Supercomputer Center.
Conflicts of Interest
The authors have no conflicts of interest to declare.
- Yue, X., Gutierrez, B. J., & Sun, H. (2020). Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset. ACL 2020.
- Pampari, A., Raghavan, P., Liang, J., Peng, J. (2018). emrQA: A Large Corpus for Question Answering on Electronic Medical Records. EMNLP 2018.
- Johnson, A. E., Pollard, T. J., Shen, L., Li-Wei, H. L., Feng, M., Ghassemi, M., ... & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3(1), 1-9.
- Yue, X., Zhang, X. F., Yao, Z., Lin, S., & Sun, H. (2020). CliniQG4QA: Generating Diverse Questions for Domain Adaptation of Clinical Question Answering. arXiv preprint arXiv:2010.16021.
- Yue, X., Zhang, X. F., Yao, Z., Lin, S., & Sun, H. (2020). Code repository for the CliniQG4QA project. Website. https://github.com/sunlab-osu/CliniQG4QA [Accessed on 30 Dec 2020]
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
CITI Data or Specimens Only Research