Database Credentialed Access

Deidentified Medical Text

Margaret Douglass Bill Long George Moody Peter Szolovits Li-wei Lehman Roger Mark Gari D. Clifford

Published: Dec. 18, 2007. Version: 1.0


When using this resource, please cite: (show more options)
Douglass, M., Long, B., Moody, G., Szolovits, P., Lehman, L., Mark, R., & Clifford, G. D. (2007). Deidentified Medical Text (version 1.0). PhysioNet. https://doi.org/10.13026/jc2a-ca12.

Additionally, please cite the original publication:

Neamatullah I, Douglass M, Lehman LH, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD. Automated De-Identification of Free-Text Medical Records. BMC Medical Informatics and Decision Making, 2008, 8:32. doi:10.1186/1472-6947-8-32

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

Available here is a gold standard corpus of 2,434 nursing notes that have been thoroughly deidentified by a multi-pass process that included meticulous reviews by three or more experts working independently, as well as by a variety of automated methods. All detected instances of PHI in these nursing notes have been replaced by realistic surrogate data. The gold standard corpus is currently available only to those who have been granted access to PhysioNet Clinical Databases.


Background

In the USA, the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule restricts exchange of medical data containing protected health information (PHI), defined as any information that might be used to identify the individual(s) from whom the data were collected [1]. Data known to contain PHI can be shared for research purposes only under tightly controlled circumstances, typically involving data use agreements under which the researchers involved must obtain IRB or equivalent approvals for use of the data.

By contrast, medical data that do not contain PHI are exempt from the restrictions of the HIPAA Privacy Rule and may be shared freely. The content of this dataset falls into this category.

Many research datasets in healthcare include PHI, and the process of removing this PHI ("de-identification" in the language of HIPAA, or "anonymization") may be tedious and error-prone. For many research projects, the cost of de-identification is a significant barrier to data sharing. The gold standard corpus aims to assist the development and evaluation of software for de-identification.


Methods

To evaluate the de-identification approach described in this article, a randomly selected subset of the nursing progress notes was extracted from the MIMIC-II database. The selected nursing notes were thoroughly deidentified by a multi-pass process that included meticulous reviews by three or more human experts working independently, as well as by a variety of automated methods. All detected instances of PHI in these nursing notes have been replaced by realistic surrogate data in the gold standard corpus. For more details, see Neamatullah et al [2].


Data Description

The gold standard corpus consists of fully de-identified 2,434 nursing notes from the MIMIC-II database.  The nursing progress notes are unstructured free text typed into a clinical information system by the nurses at the end of each shift. The notes include observations about the patient's medical history, his or her current physical and psychological state, medications being administered, laboratory test results, and other information about the patient's course in the ICU.  For other gold standard corpus related files (such as the detected PHI location), please see the associated software package [3].


Usage Notes

The text file "id.text" contains fully de-identified 2,434 nursing notes, where PHI in these nursing notes have been replaced by realistic surrogate data.  The text file "id.res" contains fully de-identified 2,434 nursing notes, where PHI in these nursing notes have been replaced by the corresponding tags.


Conflicts of Interest

The authors have no conflicts of interest.


References

  1. Standards for privacy of individually identifiable health information final rule. 67. Federal Register. 2002, 53181-53273.
  2. Neamatullah I, Douglass M, Lehman LH, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD. Automated De-Identification of Free-Text Medical Records. BMC Medical Informatics and Decision Making, 2008, 8:32. doi:10.1186/1472-6947-8-32
  3. De-Identification Software Package. https://www.physionet.org/content/deid/

Parent Projects
Deidentified Medical Text was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files