Database Credentialed Access

Tasks 1 and 3 from Progress Note Understanding Suite of Tasks: SOAP Note Tagging and Problem List Summarization

Yanjun Gao John Caskey Timothy Miller Brihat Sharma Matthew Churpek Dmitriy Dligach Majid Afshar

Published: Sept. 30, 2022. Version: 1.0.0

When using this resource, please cite: (show more options)
Gao, Y., Caskey, J., Miller, T., Sharma, B., Churpek, M., Dligach, D., & Afshar, M. (2022). Tasks 1 and 3 from Progress Note Understanding Suite of Tasks: SOAP Note Tagging and Problem List Summarization (version 1.0.0). PhysioNet.

Additionally, please cite the original publication:

Gao Y, Dligach D, Miller T, Tesch S, Laffin R, Churpek MM, Afshar M. Hierarchical annotation for building a suite of clinical natural language processing tasks: Progress note understanding. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC) 2022. 5484-5493

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Applying methods in natural language processing on electronic health records (EHR) data is a growing field. Existing corpus and annotation focus on modelling textual features and relation prediction [1] . However, there is a paucity of annotated corpus built to model clinical diagnostic reasoning, a process that involves text understanding, domain knowledge abstraction and reasoning, and clinical text generation. The datasets here support a hierarchical annotation schema with two out of the three stages available to address clinical text understanding and text generation. The datasets provided here are for individual tasks in Stages 1 and 3. The task for Stage 2 was previously accepted as part of the National NLP Clinical Challenges (n2c2) and may be retrieved from the n2c2 challenge website.

The annotated corpus is based on an extensive collection of Intensive Care Unit progress notes, a type of EHR documentation that is collected in time series in a problem-oriented format. The progress notes were sourced from MIMIC-III. The conventional format for a progress note follows a Subjective, Objective, Assessment and Plan heading (SOAP). The novel suite of tasks was designed to train and evaluate future NLP models for clinical text understanding, clinical knowledge representation, inference, and summarization. The ultimate goal of these datasets is to advance the development and evaluation of NLP models for clinical applications that lead to AI-assisted clinical decision support and reduce medical errors.


Patients in the hospital have a multidisciplinary team of physicians, nurses, and support staff who attend to their care. As part of this care, providers input daily progress notes to update the diagnoses and treatment plan, and to document changes in the patient’s health status. The electronic health record (EHR) contains these daily progress notes, and they are one of the most frequent note types that carry the most relevant and viewed documentation of a patient’s care [1]. Applying methods in natural language processing to the EHR is a growing field with many potential applications in clinical decision support and augmented care [2]. However, few corpora have been built to model clinical thinking, especially about clinical diagnostic reasoning, a process involving clinical evidence acquisition, generating hypothesis, integration and abstraction over medical knowledge and synthesizing a conclusion in the form of a diagnosis and treatment plan [3].

The daily progress note follows a specific format with four major components: Subjective, Objective, Assessment, and Plan (SOAP). SOAP note documentation is engrained in medical school curricula and other training curricula, developed by Larry Weed, MD, known as the father of the problem-oriented medical record and inventor of the ubiquitous SOAP daily progress note [4]. The main purpose of SOAP documentation is to record the patient's information, including recent events in their care and active problems in a readable and structured way, so the patients’ diagnoses are readily identified. 

Subjective includes sections of free text describing patients’ symptoms, conditions, and daily changes in care. The Objective section contains sections of structured data such as lab results, vital signs, and radiology reports. Assessment and Plan sections are considered by providers as the most important components in SOAP note, synthesizing evidence from Subjective and Objective and concluding the diagnoses and treatment plans. Specifically, Assessment is the section describing the patient and establishing the main symptoms or problems for their encounter, and the Plan addresses each differential diagnosis/problem with an action plan or treatment plan for the day. 

In the end, the SOAP note reflects the provider’s effort to collect the most recent and relevant data and synthesize the collected information into a coherent understanding of the patient’s condition for decision-making and to ensure coordination of care. This skill in documentation requires clear reasoning to link symptoms, lab and imaging results, and other observations into temporally relevant and problem-specific treatment plans.

In this work, we introduce a hierarchical annotation with three stages addressing clinical text understanding, reasoning and abstraction over evidence, and diagnosis summarization. The annotation guidelines were designed and developed by physicians with clinical informatics expertise and computational linguistic experts to model the healthcare provider decision-making process. Our annotations were built on top of the Medical Information Mart for Intensive Care-III (MIMIC-III) [5,6].


All progress notes were sourced from MIMIC-III, a publicly available dataset of de-identified EHR data from approximately 60,000 hospital ICU admissions at Beth Isreal Deaconess Medical Center in Boston, Massachusetts. We randomly sampled progress notes that include the SOAP sections for Task 1. For Task 3, the goal of the annotation was to label lists of relevant problems/diagnoses from the Plan subsections. For each Plan subsection, the annotators marked the text span for the Problem, separating the diagnosis/problems from the treatment or action plans.

The progress note types from MIMIC-III included a total of 84 note types (DESCRIPTION header) including the following: Physician Resident Note, Intensivist Note (SICU, CVICU, TICU), PGY1 Progress Note, PGY1/Attending Daily Progress Note MICU, MICU Resident/Attending Daily Progress Note. Other note types were excluded such as Nursing Progress Note and SocialWorker Progress Note because these are not commonly structured in the SOAP format. We propose a hierarchical annotation schema consisting of three stages: SOAP Section Tagging organizing all sections of the progress note into a SOAP category; Assessment and Plan Relation Labeling specifying causal relations between symptoms and problems mentioned in the Assessment and diagnoses covered in each PLAN SUBSECTION (not included here); Problem List Identification highlighting the final diagnoses. Every stage of the annotation builds on top of the previous annotation.

The first stage of annotation is to segment the progress notes into sections, where each section belongs to a part of SOAP. Given a progress note, the annotator will mark each line of text using one of the attributes using an NLP annotation software called INCEpTION. Most of the sections start with a section header, indicating the lines below it fall into the same category of information until the next section. When there is no section header, the annotator marked the attributes by the content expressed in the lines (e.g. the line 14 Last dose of Antibiotics belonging to MEDICATIONS). We post-process the attribute labels and further categorize them as one of the SOAP sections.

The relevant plan subsections include a problem/ diagnosis with an associated treatment or action plan, stating how the provider will address the problem. At the third stage of the annotation, the goal was to highlight the problems/diagnoses mentioned in the Plan subsections separately from the treatment or action plans for the day. In identifying problems/diagnoses, the annotators only labelled the text spans covering the problem in each Plan subsection, using the label PROBLEM. Once the problem/diagnosis was labelled then annotators labelled the accompanying ACTIONPLAN for that PROBLEM and link the two attributes indicated by PROBLEMAPPROACH.

The annotation guidelines and rules were initially developed and tested by two physicians with board certifications in critical care medicine, pulmonary medicine, and clinical informatics. The physicians practice in the same field as the authors of the source notes. Two medical students were recruited and had received training in their medical school curriculum in medical history taking and documentation (including SOAP format), anatomy, pathophysiology, and pharmacology. An additional three-week period with orientation and training was provided by one of the critical care physicians to the annotators. Each annotator met inter-rater reliability with a kappa score of > 0.80 with the adjudicator prior to independent review. More details on the annotation protocol and datasets designed for the hierarchy of tasks may be found in an associated paper published in LREC 2022 Proceedings [7].

Data Description

The first dataset is for the SOAP note section tagging task (Task 1) with labels to designate each part of the progress note to its relevant SOAP section, a first step in understanding the components of the progress note. The second dataset is a more complex summarization task to generate a list of relevant diagnoses/problems given the information in the Subject, Objective, and Assessment sections of the note. Only diagnoses/problems that are available in the progress note were labelled for the task (Task 3). As mentioned above, the Task for reasoning to find the relationship between the Assessment and Plan section was included as a task for the n2c2 2022 challenge and may be retrieved from the challenge website to complete our suite of tasks for NLP Clinical Diagnostic Reasoning [8].


In total, two annotators labelled 765 progress notes. Therefore, we split the corpus into train/validation/test, resulting in 603, 75 and 87 notes, respectively. We post-processed section subheadings of the progress note into the larger section headings of SUBJECTIVE, OBJECTIVE, ASSESSMENT, and PLAN for Task 1. We post-processed the annotation such that for every assessment, there was a list of direct problems and a list of indirect problems for Task 3. Several progress notes were duplicated during inter-annotator agreement evaluations and 3 of the duplicates were included in the original LREC paper; therefore, our final sample size in PhysioNet is 3 lower (n=765) than the original paper description in LREC (n=768).

Usage Notes

We provide the full progress notes with section labels for the SOAP note Task 1. The samples with the raw text are in a .csv file format with train/validation/test datasets build separately for Tasks 1 and 3.

Task 1

For Task 1, the final post-processed labels are at the line-level with the first label representing the Beginning (B) or Inner (I) part of the SOAP section followed by a label S, O, A, or P to represent the relevant section. Another label is provided for the index row to identify the beginning of the note (0) with integer variables through the end of the note (0+i). Labels are delimited by single pipes. Task 1 contains the full progress notes. Ultimately, the files for Task 1 are organized as follows:

NOTE_ID | line_text | section_label | note_index.

An example row from the Task 1 data is shown below:

148567 | Patient able to wean off vent to trach collar in AM | BS | 0

The explanation for these fields is:

HADM_ID | Text input | B = Beginning S = Subjective | 0 = Index Row for Progress Note

Task 3

For Task 3, the train/validation/test datasets are organized into the following comma-separated columns: 1. HADM ID; 2. ASSESSMENT Section extracted from the Progress Note; 3. Problems/Diagnoses extracted from each subsection of the PLAN section.

In identifying problems/diagnoses, the annotators only labelled the text spans covering the problem in each PLAN subsection, and only those labelled as 'Direct' or 'Indirect' Problems from Task 2 so they can be extracted or abstracted from the source note (other labels like 'Neither' or 'Not Relevant' can not be discovered from the source progress note and were excluded in the list of problems/diagnoses). Every list is considered a short summary, with distinct problems/diagnoses delimited by semicolons. Only the ASSESSMENT section text span is provided, so if you desire to access the full progress note for Task 3, then they may be retrieved from Task 1 by using the unique key of NOTE_ID between tasks.

110437 , 53 year old female with hx of alcoholism presents with hepatic and renal failure. Transfered to ICU with new oxygen requirement. , Hypoxia; Hepatic failure; ARF; Alcoholism

The explanation for these fields is:

NOTE_ID , ASSESSMENT section input , List of Problems/Diagnoses from relevant PLAN section

Release Notes

The datasets provided in PhysioNet are for individual tasks in Stages 1 and 3. The task for Stage 2 was previously accepted as part of the National NLP Clinical Challenges (n2c2) and may be retrieved online [9].


The use of the data in this research came from a fully de-identified dataset (contains no protected health information) that we received permission for use under a PhysioNet Credentialed Health Data Use Agreement (v1.5.0). The study was determined to be exempt from human subjects research. All experiments followed the PhysioNet Credentialed Health Data License Agreement. Medical charting by providers in the electronic health record is at-risk for multiple types of bias.

Our research focused on building a system to overcome the cognitive biases in medical decision-making by providers. However, statistical and social biases need to be addressed before integrating our work into any clinical decision support system for clinical trials or healthcare delivery. In particular, implicit bias towards vulnerable populations and stigmatizing language in certain medical conditions like substance use disorders are genuine concerns that can transfer into language model training. Therefore, it should be assumed that our corpus of notes for this task will carry social bias features that can affect fairness and equity during model training.

Before the deployment of any pre-trained language model, it is the responsibility of the scientists and health system to audit the model for fairness and equity in its performance across disparate health groups. Fairness and equity audits alongside model explanations are needed to ensure an ethical model trustworthy to all stakeholders, especially patients and providers.


We would like to acknowledge the hard work by our medical student annotators, Ryan Laffin and Samuel Tesch, who were supported by the University of Wisconsin Summer Shapiro Grant Program. 

Conflicts of Interest

No conflicts of interest to declare.


  1. Brown, P., Marquard, J., Amster, B., Romoser, M., Friderici, J., Goff, S., and Fisher, D. (2014). What do physicians read (and ignore) in electronic progress notes? Applied clinical informatics, 5(02):430–444.
  2. Gao Y, Dligach D, Christensen L, Tesch S, Laffin R, Xu D, Miller T, Uzuner O, Churpek MM, Afshar M. A scoping review of publicly available language tasks in clinical natural language processing. J Am Med Inform Assoc. 2022 [Accepted]
  3. Bowen, J. L. (2006). Educational strategies to promote clinical diagnostic reasoning. New England Journal of Medicine, 355(21):2217–2225.
  4. Weed, L. L. (1964). Medical records, patient care, and medical education. Irish Journal of Medical Science (1926-1967), 39(6):271–282.
  5. Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet.
  6. Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.
  7. Gao Y, Dligach D, Miller T, Tesch S, Laffin R, Churpek MM, Afshar M. Hierarchical annotation for building a suite of clinical natural language processing tasks: Progress note understanding. Proceedings of the 13th Conference on Language Resources and Evaluation (LREC) 2022. 5484-5493
  8. n2c2 Challenge Website (Track 3). [Accessed: 30 September 2022]
  9. n2c2 NLP Research Data Sets. [Accessed: 30 September 2022]

Parent Projects
Tasks 1 and 3 from Progress Note Understanding Suite of Tasks: SOAP Note Tagging and Problem List Summarization was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.