Database Credentialed Access

# MedNLI for Shared Task at ACL BioNLP 2019

Published: Nov. 28, 2019. Version: 1.0.1

Shivade, C. (2019). MedNLI for Shared Task at ACL BioNLP 2019 (version 1.0.1). PhysioNet. https://doi.org/10.13026/gtv4-g455.

Abacha, A. B., Shivade, C., & Demner-Fushman, D. (2019, August). Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task (pp. 370-379).

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

## Abstract

Natural Language Inference (NLI) is the task of determining whether a given hypothesis can be inferred from a given premise. Also known as Recognizing Textual Entailment (RTE), this task has enjoyed popularity among researchers for some time. However, almost all datasets for this task focused on open domain data such as as news texts, blogs, and so on. To address this gap, the MedNLI dataset was created for language inference in the medical domain. MedNLI is a derived dataset with data sourced from MIMIC-III v1.4. In order to stimulate research for this problem, a shared task on Medical Inference and Question Answering (MEDIQA) was organized at the workshop for biomedical natural language processing (BioNLP) 2019. The dataset provided herein is a test set of 405 premise hypothesis pairs for the NLI challenge in the MEDIQA shared task. Participants of the shared task are expected to use the MedNLI data for development of their models and this dataset was used as an unseen dataset for scoring each participant submission.

## Background

Natural Language Inference (NLI) is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”. NLI has been extremely popular among NLP researchers in the past few years. The Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) is a large, high quality dataset and serves as a benchmark to evaluate NLI systems. However, it is restricted to a single text genre (Flickr image captions) and mostly consists of short and simple sentences. The MultiNLI corpus (Williams et al., 2018) introduced NLI corpora from multiple genres (e.g. fiction, travel) addressing this limitation. However, inferences in specialized domains such as medicine are more nuanced and require specialized knowledge. Owing to high costs of annotation and barriers in data access, the clinical NLP community lacks large labeled datasets to train modern data-intensive models for end-to-end tasks such as NLI.

## Methods

This dataset was created following the same annotation protocol as for MedNLI. Sentences from the Past Medical History section of clinical notes from MIMIC-III were segmented out using a simple rule based program. Clinicians were then shown a premise sentence and asked to generate three sentences: (1) a hypothesis that is definitely true about the patient given the premise, (2) a hypothesis that is definitely false about the patient given the premise, and (3) a hypothesis that may be true about the patient given the premise. The inter-annotator agreement for MedNLI was a Cohen's kappa of 0.78 on a subset of 500 premise-hypothesis pairs. Additional details such as the exact annotation prompt can be found in Romanov and Shivade, 2018 [1]. The model implementations are also available on GitHub (https://github.com/jgc128/mednli).

## Data Description

This test set consists of 405 premise-hypothesis pairs curated by the same clinicians who worked on creating the original MedNLI dataset. This dataset can be viewed as an additional test for the MedNLI data created for the BioNLP 2019 shared task. The premises in this dataset do not have an overlap with the premises in MedNLI. Participants interested in the shared task should register on the AIcrowd (https://www.aicrowd.com/) platform and follow instructions for submitting a system. Each run will be evaluated using accuracy as a performance metric following an evaluation script on GitHub (https://github.com/abachaa/MEDIQA2019/blob/master/Eval_Scripts/mediqa_evaluator_tasks_1_2.py).

## Usage Notes

The clinical notes from the NOTEEVENTS table of MIMIC-III (v1.4) are the source for the premise statements in this dataset [2]. More specifically, each note was segmented into sections and sentences from the "past medical history" section were randomly sampled. The dataset is in json lines format and follows the exact the same format as the SNLI and Multi_NLI datasets. Each record of this test set is a json line consisting of the following structure:

1. gold_label - entailment, contradiction, or neutral (redacted since this is a test set)
2. sentence1 - the premise statement
3. sentence2 - the hypothesis statement
4. sentence1 parse - The constituency parse of the premise using Stanford parser
5. sentence2 parse - The constituency parse of the hypothesis using Stanford parser
6. sentence1 binary parse - The binary parse of the premise using Stanford parser
7. sentence2 binary parse - The binary parse of the hypothesis using Stanford parser

A sample record from the training set is shown below

{"sentence1": "Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.", "pairID": "23eb94b8-66c7-11e7-a8dc-f45c89b91419", "sentence1_parse": "(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC and) (NP (NN lactate) (CD 2.4)))))) (. .)))", "sentence1_binary_parse": "( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )", "sentence2": " Patient has elevated Cr", "sentence2_parse": "(ROOT (S (NP (NN Patient)) (VP (VBZ has) (NP (JJ elevated) (NN Cr)))))", "sentence2_binary_parse": "( Patient ( has ( elevated Cr ) ) )", "gold_label": "entailment"}

The goal of the task is to classify a given premise-hypothesis pair into one of the thre classes: entailment, contradiction, or neutral.

## Acknowledgements

We would like to thank Adam Coy and Chanida Thammachart for their help in curating this dataset. We would also like to thank Vandana Mukherjee for supporting this project.

## Conflicts of Interest

The authors have no conflicts of interest to declare.

## References

1. Abacha, A. B., Shivade, C., & Demner-Fushman, D. (2019, August). Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task (pp. 370-379).
2. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. MIMIC-III, a freely accessible critical care database. Scientific Data (2016). https://doi.org/10.1038/sdata.2016.35.

##### Parent Projects
MedNLI for Shared Task at ACL BioNLP 2019 was derived from: Please cite them when using this project.
##### Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

##### Corresponding Author
You must be logged in to view the contact information.