Challenge Credentialed Access

Discharge Me: BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation

Justin Xu

Published: March 5, 2024. Version: 1.2 <View latest version>

When using this resource, please cite: (show more options)
Xu, J. (2024). Discharge Me: BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation (version 1.2). PhysioNet.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


"Discharge Me!", part of the BioNLP workshop co-located with ACL 2024, seeks to alleviate the significant burden on clinicians who dedicate substantial time to crafting detailed discharge notes in the EHR. Participants in the task will explore approaches to generating "Brief Hospital Course" and "Discharge Instructions" sections of the discharge summary using a subset of MIMIC-IV-Note and MIMIC-IV-ED that have been compiled by the task organizers. The full dataset (comprised of a defined training, validation, phase 1 testing, and phase 2 testing sets) consists of 109,168 emergency department admissions. The competition is being hosted on the Codabench platform, which will manage team registration, results submission, and score evaluation.


Our objective is to encourage the development of new systems for the generation of discharge summaries and to disseminate preliminary findings to the medical natural language processing community.

Clinicians play a crucial role in documenting patient progress, but the creation of concise yet comprehensive hospital course summaries and discharge instructions often demands a significant amount of time. This contributes to clinician burnout and poses operational inefficiencies within hospital workflows. By streamlining the generation of these sections, we can help enhance the accuracy and completeness of clinical documentation.

Participants are given a dataset based on MIMIC-IV which includes 109,168 admissions from the Emergency Department (ED), split into training, validation, and test sets. Each admission includes chief complaints and diagnosis codes (either ICD-9 or ICD-10) documented by the ED, at least one radiology report, and a discharge summary with both "Brief Hospital Course" and "Discharge Instructions" sections. The goal is to generate these two critical sections in discharge summaries based on other inputs.

We hope that this challenge will bolster the efforts of the clinical natural language processing community in developing effective solutions for the generation of discharge summary sections. We believe this task could form a solid foundation for future work on generating the entire discharge summary including the other sections, which would significantly help reduce the time clinicians spend on administrative tasks, ultimately improving patient care quality.


Please visit the Codabench competition page to register for this shared task. Codabench [1] is the platform that we will use throughout the challenge, and an account is required to officially join the competition. All submissions and leaderboards will be available on that platform. Please direct any questions about the competition to the Codabench discussion forum. Deadlines and further participation information is available on the Shared Task Website below.

All participants will be invited to submit a paper describing their solution to be included in the Proceedings of the 23rd Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2024. If you do not wish to write a paper, you must at least provide a thorough description of your system which will be included in the overview paper for this task. Otherwise, your submission (and reported scores) will not be taken into account.


  • Participants must comply with the PhysioNet Credentialed Health Data Use Agreement when using the data.
  • Participants may use any additional data to train (or pre-train) their systems. However, all data used for the submission must be in some way available to other researchers.
  • Participants may involve existing models trained on proprietary data in their systems. However these models must also be accessible to other researchers in some capacity.
  • If participants employ LLMs, please ensure that the team clearly notes the expected outputs by the models or the prompting strategies used so that results can be reproduced. However, please note that sending data via an API to a third party is a violation of the DUA. Please consult the informational note provided by PhysioNet for further detail.
  • All submissions must be made through the Codabench competition page.

Shared Task Website 

Data Description

The dataset for this task is created from MIMIC-IV's submodules MIMIC-IV-Note [3] and MIMIC-IV-ED [4]. In order to download the data, you must have a PhysioNet [2] account with signed agreements for both datasets.

The dataset has been split into a training (68,785 samples), a validation (14,719 samples), a phase I testing (14,702 samples), and a phase II testing (10,962 samples) dataset. The phase II testing dataset will serve as the final test set that will be released on April 12th (Friday), 2024. All datasets and tables are derived from the MIMIC-IV submodules.

  • Code to re-create the data splits is available on Colab.
  • Participants are free to use all or part of the provided dataset to develop their systems. However, submissions on Codabench will be evaluated on the entirety of the testing datasets.

Discharge summaries are split into various sections and written under a variety of headings. However, each note in the dataset for this task includes a "Brief Hospital Course" and a "Discharge Instructions" section. The "Brief Hospital Course" section is usually located in the middle of the discharge summary following information about patient history and treatments received during the current admission. The "Discharge Instructions" section is generally located the end of the note as one of the last sections.

Each admission is defined by a unique hadm_id and is associated with a corresponding discharge summary and at least one radiology report. Most admissions in the dataset will have only one corresponding ED stay. However, a select few admissions may have more than one ED stay (ie. multiple stay_id). Each stay_id can have multiple ICD diagnoses, but will only have one chief complaint. Participants may use online resources for descriptions and details about ICD codes.

Special Note:

If you are using pandas to read the .csv.gz tables, please ensure you set keep_default_na=False. For instance:

pd.read_csv('discharge_target.csv.gz', keep_default_na=False)

Otherwise, pandas will automatically convert certain strings, such as in cases where the discharge instruction is 'NA' or 'N/A', into the float NaN.

Dataset Statistics

The complete dataset contains the following items:

Item Total Count Training Validation Phase I Testing Phase II Testing
Admissions 109,168 68,785 14,719 14,702 10,962
Discharge Summaries 109,168 68,785 14,719 14,702 10,962
Radiology Reports 409,359 259,304 54,650 54,797 40,608
ED Stays & Chief Complaints 109,403 68,936 14,751 14,731 10,985
ED Diagnoses 218,376 138,112 29,086 29,414 21,764

Dataset Schemas

For consistency and ease-of-use, the schemas of the data tables have been kept the same as the ones originally provided in MIMIC-IV and its submodules. An additional table in discharge_target.csv.gz is provided, which includes extracted "Brief Hospital Course" and "Discharge Instructions" sections from the discharge summaries.


The evaluation metrics for this task are based on textual similarity and factual correctness of the generated text. Specifically, the following 8 metrics will be considered:

  • BLEU-4 [5]
  • ROUGE-1, -2, -L [6]
  • BERTScore [7]
  • Meteor [8]
  • AlignScore [9]
  • MEDCON [10]

Additionally, the submissions from the top-performing teams will be reviewed by clinicians at the end of the competition.

There will be two separate leaderboards on the Codabench competition page. One will be dedicated for the scores from the initial phase I testing dataset, and one will be dedicated for the scores from the phase II testing dataset which will be released on April 12th (Friday), 2024.

Submissions will first be scored on their performance for the two target sections separately. For  N N  test set samples, we define the score for a given measure as:

s m = 1 2 ( 1 N i = 1 N g ( B H I i ) + 1 N i = 1 N g ( D I i ) ) s_m = \frac{1}{2} \left( \frac{1}{N}\sum _{i=1}^N g(BHI_i) + \frac{1}{N}\sum _{i=1}^N g(DI_i) \right)

where   g ( B H I i ) g(BHI_i)  is the measure calculated on the brief hospital course section for observation  i i  and  g ( D I i ) g(DI_i)  is the measure calculated on the discharge instructions of for the same observation. Finally, the overall score would be calculated by:

O v e r a l l = 1 M m = 1 M s m Overall=\frac{1}{M}\sum _{m=1}^{M} s_m

... where  M M  is the number of measures evaluated (which is defined as 8 above). All scoring calculations will be done on Codabench with a Python 3.9 environment. The evaluation scripts are available on GitHub for reference.

For specific submission instructions and details on evaluation, please visit the Codabench competition page.

Release Notes

Version 1.2 - March 1st (Friday), 2024

  • Samples with target sections of less than 10 words were removed from the training and validation datasets.

Version 1.1 - February 20th (Tuesday), 2024

  • Samples with target sections of less than 10 words were removed from the phase I testing and phase II testing datasets.

Version 1.0 - February 6th (Tuesday), 2024

  • Original dataset released.

Currently, the dataset only contains the samples in the training, validation, and phase I testing dataset. The phase II testing dataset will be released on April 12th (Friday), 2024.

Additionally, the organizers may further update this dataset throughout the shared task to address issues raised by the participants.


All members of the organizing team have completed the required training and are credentialed users of MIMIC-IV.


Special thanks to Alistair Johnson and the PhysioNet team from the MIT Laboratory for Computational Physiology for managing the credentialing process and for hosting the data for this shared task.

Task Organizers:

  • Justin Xu
  • Jean-Benoit Delbrouck
  • Andrew Johnston
  • Louis Blankemeier
  • Curtis Langlotz

Conflicts of Interest

Nothing to declare.


  1. Zhen Xu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, & Isabelle Guyon (2022). Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform. Patterns, 3(7), 100543.
  2. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
  3. Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet.
  4. Johnson, A., Bulgarelli, L., Pollard, T., Celi, L. A., Mark, R., & Horng, S. (2023). MIMIC-IV-ED (version 2.2). PhysioNet.
  5. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  6. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  7. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,”, 2019.
  8. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  9. Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
  10. W. Yim, Y. Fu, A. Ben Abacha, N. Snider, T. Lin, and M. Yetisgen, “Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation,” Scientific Data, vol. 10, no. 1, p. 586, Sep. 2023, doi:

Parent Projects
Discharge Me: BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.
  • 1.2 - March 5, 2024
  • 1.3 - April 12, 2024