Database Credentialed Access
MIMIC-IV-Note: Deidentified free-text clinical notes
Alistair Johnson , Tom Pollard , Steven Horng , Leo Anthony Celi , Roger Mark
Published: Jan. 5, 2023. Version: 2.1 <View latest version>
When using this resource, please cite:
(show more options)
Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.1). PhysioNet. https://doi.org/10.13026/0p14-t007.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
The advent of large, open access text databases has driven advances in state-of-the-art model performance in natural language processing (NLP). The relatively limited amount of clinical data available for NLP has been cited as a significant barrier to the field's progress. Here we describe MIMIC-IV-Note: a collection of deidentified free-text clinical notes for patients included in the MIMIC-IV clinical database. MIMIC-IV-Note contains 357,289 deidentified discharge summaries from 161,403 patients admitted to the hospital and emergency department at the Beth Israel Deaconess Medical Center in Boston, MA, USA. The database also contains 2,471,881 deidentified radiology reports for 256,400 patients. All notes have had protected health information removed in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. All notes are linkable to MIMIC-IV providing important context to the clinical data therein. The database is intended to stimulate research in clinical natural language processing and associated areas.
Free-text notes are integral to the care provided to patients in most health systems. Although digital information systems with structured forms for input have proliferated over the last few decades, providers continue to rely on free-text notes to communicate amongst themselves, relay important health related information to patients, and document information related to care plans . For researchers, free-text notes are an important source of information for understanding a patient's clinical course. Investigations into public views on the use of patient data for research broadly suggest that there is a willingness to share data where it is for the common good . This trend is held when considering free-text notes in particular [3, 4]. Despite both public and researcher desire to improve health through processing of clinical notes, lack of access to shared clinical text has slowed down progress . The cost and complexity with deidentifying clinical text is often the major barrier to the creation of clinical text datasets. In contrast, the broad use of MIMIC-III for clinical natural language processing and demonstrates the potential for these datasets once deidentified. Recent advances in natural language processing have dramatically improved the capabilities of text processing models for a variety of tasks including named entity recognition. These improvements have been demonstrated to carry over to the deidentification of free-text notes, which can be cast as a named entity recognition task . The advances of automatic deidentification of free-text clinical notes has made sharing of clinical notes for research possible in a manner that protects patient privacy and maximizes public benefit.
All inclusion criteria for MIMIC-IV also apply to MIMIC-IV-Note. In general, only notes occurring within one year of a patient encounter are included in the database, where an encounter is defined as an emergency department or hospital stay. Free-text notes were acquired from the hospital system and deidentified using a custom rule-based approach combined with a neural network trained for deidentification. Annotations from the two approaches were unioned, and identified instances of protected health information (PHI) were removed from the notes. Each instance of PHI was replaced with exactly three underscores. Previous work identified the sensitivity of the approach to be 99.9% for radiology reports . A manual review of discharge summaries was also conducted, with no PHI found.
There are four tables in the dataset: discharge, discharge_detail, radiology, and radiology_detail. In general the name of the table refers to the domain of the note, and tables with a _detail suffix are entity-attribute-value tables with additional information relating to the free-text notes. Each table contains a note_id which uniquely identifies a note and is composed with the subject_id, the abbreviated note domain, and a sequential integer.
The discharge table contains discharge summaries for hospitalizations. Discharge summaries are long form narratives which describe the reason for a patient’s admission to the hospital, their hospital course, and any relevant discharge instructions. The discharge_detail table contains auxiliary information associated with discharge summaries. As of v2.0, it only contains deidentified author names for the summaries.
The radiology table contains free-text radiology reports associated with radiography imaging. Radiology reports cover a variety of imaging modalities: x-ray, computed tomography, magnetic resonance imaging, ultrasound, and so on. Free-text radiology reports are semi-structured and usually follow a consistent template for a given imaging protocol. For example, chest x-rays typically have four sections: indication, comparison, findings, and impression.
The radiology_detail table provides information associated with the imaging study. Current Procedural Terminology (CPT) codes, exam names, and links between parent reports and addendums are available in the table.
The notes are distributed as comma separated value (CSV) files. Each row corresponds to a unique note and has been assigned a unique note_id.
We have created an open source repository for the sharing of code and discussion of the database, referred to as the MIMIC Code Repository [8, 9]. The code repository provides a mechanism for shared discussion and analysis of all versions of MIMIC, including MIMIC-IV-Note.
MIMIC-IV-Note v2.1 was released in November 2022. It was the first publicly available version of the database.
The collection of patient information and creation of the research resource was reviewed by the Institutional Review Board at the Beth Israel Deaconess Medical Center, who granted a waiver of informed consent and approved the data sharing initiative.
We would like to thank the Beth Israel Deaconess Medical Center for their continued support of the MIMIC project. In particular we would like to thank Carolyn Conti, Alvin Gayles, Larry Markson, Ayad Shammout, Lu Shen, and Manu Tandon for their assistance. We would also like to thank the NIH for their gracious support.
Conflicts of Interest
None to declare.
- Makam AN, Lanham HJ, Batchelor K, Samal L, Moran B, Howell-Stampley T, Kirk L, Cherukuri M, Santini N, Leykum LK, Halm EA. Use and satisfaction with key functions of a common commercial electronic health record: a survey of primary care providers. BMC medical informatics and decision making. 2013 Dec;13(1):1-7.
- Stockdale J, Cassell J, Ford E. “Giving something back”: A systematic review and ethical enquiry into public views on the use of patient data for research in the United Kingdom and the Republic of Ireland. Wellcome open research. 2018;3.
- Ford E, Stockdale J, Jackson R, Cassell J. For the greater good? Patient and public attitudes to use of medical free text data in research. International Journal of Population Data Science. 2017;1(1).
- Ford E, Oswald M, Hassan L, Bozentko K, Nenadic G, Cassell J. Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK. Journal of medical ethics. 2020 Jun 1;46(6):367-77.
- Chapman WW, Nadkarni PM, Hirschman L, D'avolio LW, Savova GK, Uzuner O. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. Journal of the American Medical Informatics Association. 2011 Sep 1;18(5):540-3.
- Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data. 2019 Dec 12;6(1):1-8.
- Johnson AE, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. InProceedings of the ACM Conference on Health, Inference, and Learning 2020 Apr 2 (pp. 214-221).
- Johnson AE, Stone DJ, Celi LA, Pollard TJ. The MIMIC Code Repository: enabling reproducibility in critical care research. Journal of the American Medical Informatics Association. 2018 Jan;25(1):32-9.
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
CITI Data or Specimens Only Research