Name: ShAReCLEF eHealth Evaluation Lab 2014 (Task 2): Disorder Attributes in Clinical Reports
Published: Nov. 1, 2013
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Challenge Credentialed Access

Danielle Mowery

Published: Nov. 1, 2013. Version: 1.0

When using this resource, please cite: (show more options)
Mowery, D. (2013). ShAReCLEF eHealth Evaluation Lab 2014 (Task 2): Disorder Attributes in Clinical Reports (version 1.0). PhysioNet. https://doi.org/10.13026/0zgk-9j94.

MLA	Mowery, Danielle. "ShAReCLEF eHealth Evaluation Lab 2014 (Task 2): Disorder Attributes in Clinical Reports" (version 1.0). PhysioNet (2013), https://doi.org/10.13026/0zgk-9j94.
APA	Mowery, D. (2013). ShAReCLEF eHealth Evaluation Lab 2014 (Task 2): Disorder Attributes in Clinical Reports (version 1.0). PhysioNet. https://doi.org/10.13026/0zgk-9j94.
Chicago	Mowery, Danielle. "ShAReCLEF eHealth Evaluation Lab 2014 (Task 2): Disorder Attributes in Clinical Reports" (version 1.0). PhysioNet (2013). https://doi.org/10.13026/0zgk-9j94.
Harvard	Mowery, D. (2013) 'ShAReCLEF eHealth Evaluation Lab 2014 (Task 2): Disorder Attributes in Clinical Reports' (version 1.0), PhysioNet. Available at: https://doi.org/10.13026/0zgk-9j94.
Vancouver	Mowery D. ShAReCLEF eHealth Evaluation Lab 2014 (Task 2): Disorder Attributes in Clinical Reports (version 1.0). PhysioNet. 2013. Available from: https://doi.org/10.13026/0zgk-9j94.

Additionally, please cite the original publication:

Mowery DL, Velupillai S, South BR, Christensen L, Martinez D, Elhadad N, Pradhan S, Savova G, Chapman WW. Task 2: ShARe/CLEF eHealth Evaluation Lab 2014. CLEF 2014 Working Notes, 1180. pp. 31-42. ISSN 1613-0073. Sheffield, UK. 2014.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

This project focuses on Task 2 of the 2014 ShARe/CLEF eHealth evaluation lab, which extended Task 1 of the 2013 ShARe/CLEF eHealth evaluation lab by focusing on identification and normalization of disorder attributes. The task comprises two subtasks: attribute normalization (task 2a) and cue identification (task 2b). We instructed participants to develop a system which either kept or updated a default attribute value for each task. Participant systems were evaluated against a blind reference standard of 133 discharge summaries using Accuracy (task 2a) and F-score (task 2b).

This is the second year of the ShARe/CLEF eHealth Evaluation Lab, a shared task focused on natural language processing (NLP) and information retrieval (IR) for clinical care. The task is co-organized by the Shared Annotated Resources (ShARe) project and the CLEF Initiative (Conference and Labs of the Evaluation Forum, formerly known as Cross-Language Evaluation Forum). The vision of ShARe/CLEF is two-fold: (1) to develop tasks that potentially impact patient understanding of medical information and (2) to provide the community with an increasingly sophisticated dataset of clinical narrative to advance the state-of-the-art in Natural Language Processing, Information Extraction and Information Retrieval in healthcare.

Objective

Healthcare initiatives such as the United States Meaningful Use and European Union Directive 2011/24/EU have created policies and legislation to promote patient involvement and understanding of their personal health information. These policies and legislation have encouraged health care organizations to provide patients open access to their medical records and advocate for more patient-friendly technologies. Patient-friendly technologies that could help patients understand their personal health information, e.g., clinical reports, include providing links for unfamiliar terms to patient-friendly websites and generating patient summaries that use consumer-friendly terms and simplified syntactic constructions. These summaries could also limit the semantic content to the most salient events such as active disorder mentions and their related discharge instructions. Natural Language Processing (NLP) can help by filtering non-active disorder mentions using their semantic attributes e.g., negated symptoms (negation) or uncertain diagnoses (certainty) and by identifying the discharge instructions using text segmentation.

In previous years, several NLP shared tasks have addressed related semantic information extraction tasks such as automatically identifying concepts (for example, problems, treatments, and tests) and their related attributes (2010 i2B2/VA Challenge) as well as identifying temporal relationships between these clinical events (2012 i2B2/VA Challenge). The release of these semantically annotated datasets to the NLP community is important for promoting the development and evaluation of automated NLP tools. Such tools can identify, extract, filter and generate information from clinical reports that assist patients and their families in understanding the patient’s health status and their continued care. The ShARe/CLEF eHealth 2014 shared task focused on facilitating understanding of information in narrative clinical reports, such as discharge summaries, by visualizing and interactively searching previous eHealth data (Task 1), identifying and normalizing disorder attributes (Task 2), and retrieving documents from the health and medicine websites for addressing questions patients may have about the disease/disorders in the clinical notes (Task 3). In this paper, we discuss Task 2: disorder template filling.

For Task 2, participants are provided with an empty template for each disease/disorder mention; each template consists of the mention's Unified Medical Language System concept unique identifiers (CUI), mention boundaries, and unfilled attribute: value slots (modifiers described above). Participants are asked to develop attribute classifiers that predict the value for each attribute: value slot for the provided disease/disorder mention. There are two attributes: value slot types: normalization and cue. For each attribute in the guidelines, these gray boxes will denote assumptions, defaults, and examples expected for leveraging these annotations for the ShARe/CLEF eHealth 2014 Task 2.

Task 2(a): Assign Normalization values to the ten attributes. Participants will keep or update the Normalization values.
Task 2(b) Assign Cue values to the nine attributes with cues. Participants will keep or update the Cue values.

Participation

Participants in the Challenge were identified via announcements on listservs including the AMIA NLP Working Group, AISWorld, BioNLP, TREC, CLEF, Corpora, NTCIR, and Health Informatics World. After registration for task 2 through the CLEF Evaluation Lab, each participant completed the following data access procedure, which included (1) a CITI or NIH Training certificate in Human Subjects Research, (2) registration on PhysioNet, (3) signing a Data Use Agreement to access the MIMIC-II data [1].

Timeline: All Tasks

01 Nov 2013: Registration opens
15 Nov 2013: Task data release begins
01 May 2014: Participant submission deadline (https://www.easychair.org/conferences/?conf=clefehealth2014)
01 Jun 2014: Results released
03 Jun 2014: Participant working notes for internal review submission deadline. Details on preparing working notes & link to the working notes submission system are available at: http://clef2014.clef-initiative.eu/index.php?page=Pages/instructions_for_authors.html
07 Jun 2014: Participant working notes (camera ready version).
15-18 Sep 2014: CLEFeHealth one-day lab session at CLEF 2014 in Sheffield, UK. All participants are invited to present their work (i.e., the submission and related working notes) as a poster. Shortlisted working notes are chosen for oral presentations or design demonstrations.

Timeline: Task 1

01 Feb 2014: Participant submission deadline (optional): drafts for comments.
01 Mar 2014: Comments sent to participants.
15 May 2014: Participant survey deadline: responses to our online survey.

Timeline: Task 2

09 Dec 2013: Example data set release (Small).
10 Jan 2014: Training data set release (Full).
23 Apr 2014: Test data set release.

Timeline: Task 3

15 Dec 2013: Document set release.
31 Jan 2014: Training queries and relevance assessments release.
01 Apr 2014: Test queries release.

UPDATE: This challenge is no longer active. For Challenge results, see Mowery et al, 2014 [2].

Data Description

The ShARe dataset comprises 433 de-identified clinical reports sampled from over 30,000 ICU patients stored in the MIMIC-II (Multiparameter Intelligent Monitoring in Intensive Care) database. The initial development set contained 300 documents of 4 clinical report types - discharge summaries, radiology, electrocardiograms, and echocardiograms. The unseen test set contained 133 documents of only discharge summaries. Participants were required to participate in Task 2a and had the option to participate in Task 2b. The notes are annotated for disorder mentions and normalized to an UMLS Concept Unique Identifier (CUI) when possible. The corpus annotation guidelines contain more details and examples.

Data Format

For Task 2a and 2b, the dataset contained templates in a “|” delimited format with: a) the disorder CUI assigned to the template as well as the character boundary of the named entity, and b) the default values for each of the 10 attributes of the disorder. Each template contained the following format:

DD DocName|DD Spans|DD CUI|Norm NI|Cue NI|
Norm SC|Cue SC|Norm UI|Cue UI|Norm CC|Cue CC|
Norm SV|Cue SV|Norm CO|Cue CO|Norm GC|Cue GC|
Norm BL|Cue BL|Norm DT|Norm TE|Cue TE

For example, the following sentence, “The patient has an extensive thyroid history.”, was represented to participants with the following disorder template with default normalization and cue values:

09388-093839-DISCHARGE SUMMARY.txt|30-36|C0040128|*no|*NULL|
patient|*NULL|*no|*NULL|*false|*NULL|
unmarked|*NULL|*false|*NULL|*false|*NULL|
NULL|*NULL|*Unknown|*None|*NULL

For Task 2a: Normalization, participants were asked to either keep or update the normalization values for each attribute. For the example sentence, the Task 2a changes:

09388-093839-DISCHARGE SUMMARY.txt|30-36|C0040128|*no|*NULL|
patient|*NULL|*no|*NULL|*false|*NULL|
unmarked|*NULL|severe|*NULL|*false|*NULL|
C0040132|*NULL|Before|*None|*NULL

For Task 2b: Cue detection, participants were asked to either keep or update the cue values for each attribute. For the example sentence, the Task 2b changes:

09388-093839-DISCHARGE SUMMARY.txt|30-36|C0040128|*no|*NULL|
patient|*NULL|*no|*NULL|*false|*NULL|
unmarked|*NULL|severe|20-28|*false|*NULL|
C0040132|30-36|Before|*None|*NULL

In this example, the Subject Class cue span is not annotated in ShARe since *patient is an attribute default. A detailed data description can be found here: http://alt.qcri.org/semeval2015/task14/index.php?id=task-description

ShARe Annotation Schema

As part of the ongoing Shared Annotated Resources (ShARe) project, disorder annotations consisting of disorder mention span offsets, their SNOMED CT codes, and their contextual attributes were generated for community distribution. For 2013 ShARe/CLEF eHealth Challenge Task 1 the disorder mention span offsets and SNOMED CT codes were released. For 2014 ShARe/CLEF eHealth Challenge Task 2, we released the disorder templates with 10 attributes that represent a disorder’s contextual description in a report including: Negation Indicator, Subject Class, Uncertainty Indicator, Course Class, Severity Class, Conditional Class, Generic Class, Body Location, DocTime Class, and Temporal Expression. Each attribute contained two types of annotation values: normalization value and cue detection value. For instance, if a disorder is negated e.g., “denies nausea”, the Negation Indicator attribute would represent nausea with a normalization value: yes indicating the presence of a negation cue and cue value: start span-end span for denies. All attributes contained a slot for a cue value with the exception of the DocTime Class. Each note was annotated by two professional coders trained for this task, followed by an open adjudication step.

From the ShARe guidelines, each disorder mention contained an attribute cue as a text span representing a non-default normalization value (*default normalization value):

Negation Indicator (NI): def. indicates a disorder was negated: *no, yes. Example: “No cough.”
Subject Class (SC): def. indicates who experienced a disorder: *patient, family member, donor family member, donor other, null, other. Example: “Dad had MI.”
Uncertainty Indicator (UI): def. indicates a measure of doubt about the disorder: *no, yes. Example: “Possible pneumonia.”
Course Class (CC): def. indicates progress or decline of a disorder: *unmarked, changed, increased, decreased, improved, worsened, resolved. Example: “Bleeding abated.”
Severity Class (SV): def. indicates how severe a disorder is: *unmarked, slight, moderate, severe. Example: “Infection is severe.”
Conditional Class (CO): def. indicates existence of disorder under certain circumstances: *false, true. Example: “Return if nausea occurs.”
Generic Class (GC): def. indicates a generic mention of disorder: *false, true. Example: “Vertigo while walking.”
Body Location (BL): def. represents an anatomical location: *NULL, CUI: C0015450, CUI-less. Example: “Facial lesions.”
DocTime Class (DT): def. indicates temporal relation between a disorder and document authoring time: before, after, overlap, before-overlap, *unknown. Example: “Stroke in 1999.”
Temporal Expression (TE): def. represents any TIMEX (TimeML) temporal expression related to the disorder: *none, date, time, duration, set. Example: “Flu on March 10.”

Evaluation

For Tasks 2a and 2b, we determined system performance by comparing participating system outputs against reference standard annotations. We evaluated overall system performance and performance for each attribute type e.g., Negation Indicator.

Evaluation Metric (Task 2a: Normalization)

Since we defined all possible normalized values for each attribute, we calculated system performance using Accuracy:

Accuracy = \frac{\textrm{count of correct normalized values}}{\textrm{total count of disorder templates}}

Accuracy = count of correct normalized values divided by total count of disorder templates.

Evaluation Metric (Task 2b: Cue Detection)

Since the number of strings not annotated as attribute cues (i.e., true negatives (TN)) is very large, we calculated F1-score as a surrogate for kappa. F1-score is the harmonic mean of recall and precision, calculated from true positive, false positive, and false negative annotations, which were calculated as follows:

Recall = \frac{TP}{(TP + FN)}

Precision = \frac{TP}{(TP+FP)}

\textrm{F1-score} = 2 * \frac{(Recall * Precision)}{(Recall + Precision)}

Where: true positive (TP) = the annotation cue span from the participating system overlapped with the annotation cue span from the reference standard; false positive (FP) = an annotation cue span from the participating system did not exist in the reference standard annotations; false negative (FN) = an annotation cue span from the reference standard did not exist in the participating system annotations

Strict F-score: a predicted mention is considered a true positive if (i) its predicted span is exactly the same as for the gold-standard mention; and (ii) the predicted CUI is correct. The predicted disorder is considered a false positive if the span is incorrect or the CUI is incorrect.

Relaxed F-score: a predicted mention is a true positive if (i) there is any word overlap between the predicted mention span and the gold-standard span (both in the case of contiguous and discontiguous spans); and (ii) the predicted CUI is correct. The predicted mention is a false positive if the span shares no words with the gold-standard span or the CUI is incorrect.

Release Notes

This challenge is no longer active. Challenge results are available at: http://ceur-ws.org/Vol-1180/CLEF2014wn-eHealth-MoweryEt2014.pdf

Acknowledgements

We thank the ShARe/CLEF eHealth Challenge organizers and teams! Many people are/have been involved in CLEF eHealth (in alphabetical order):

Samir Abdelrahman, University of Utah, USA; Wendy W Chapman, University of Utah, USA; Noemie Elhadad, Columbia University, USA; Lorraine Goeuriot, Université Grenoble Alpes, France; Liadh Kelly, Trinity College Dublin, Ireland; David Martinez, NICTA and The University of Melbourne, Australia; Danielle L Mowery, University of Pittsburgh, USA; Guergana Savova, Harvard Medical School and Boston Children's Hospital, USA; Brett R South, University of Utah, USA; Hanna Suominen, NICTA, The Australian National University, University of Canberra, University of Turku (Turku, Finland), Canberra, ACT,Australia; Sumithra Velupillai, DSV Stockholm University, Sweden;

Conflicts of Interest

We have no conflicts of interest to report

References

Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. MIMIC-III, a freely accessible critical care database. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available from: http://www.nature.com/articles/sdata201635
Mowery DL, Velupillai S, South BR, Christensen L, Martinez D, Elhadad N, Pradhan S, Savova G, Chapman WW. Task 2: ShARe/CLEF eHealth Evaluation Lab 2014. CLEF 2014 Working Notes, 1180. pp. 31-42. ISSN 1613-0073. Sheffield, UK. 2014. http://ceur-ws.org/Vol-1180/CLEF2014wn-eHealth-MoweryEt2014.pdf