Database Credentialed Access
Deidentified Medical Text
Margaret Douglass , Bill Long , George Moody , Peter Szolovits , Li-wei Lehman , Roger Mark , Gari D. Clifford
Published: Dec. 18, 2007. Version: 1.0
When using this resource, please cite:
(show more options)
Douglass, M., Long, B., Moody, G., Szolovits, P., Lehman, L., Mark, R., & Clifford, G. D. (2007). Deidentified Medical Text (version 1.0). PhysioNet. https://doi.org/10.13026/jc2a-ca12.
Neamatullah I, Douglass M, Lehman LH, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD. Automated De-Identification of Free-Text Medical Records. BMC Medical Informatics and Decision Making, 2008, 8:32. doi:10.1186/1472-6947-8-32
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Available here is a gold standard corpus of 2,434 nursing notes that have been thoroughly deidentified by a multi-pass process that included meticulous reviews by three or more experts working independently, as well as by a variety of automated methods. All detected instances of PHI in these nursing notes have been replaced by realistic surrogate data. The gold standard corpus is currently available only to those who have been granted access to PhysioNet Clinical Databases.
In the USA, the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule restricts exchange of medical data containing protected health information (PHI), defined as any information that might be used to identify the individual(s) from whom the data were collected . Data known to contain PHI can be shared for research purposes only under tightly controlled circumstances, typically involving data use agreements under which the researchers involved must obtain IRB or equivalent approvals for use of the data.
By contrast, medical data that do not contain PHI are exempt from the restrictions of the HIPAA Privacy Rule and may be shared freely. The content of this dataset falls into this category.
Many research datasets in healthcare include PHI, and the process of removing this PHI ("de-identification" in the language of HIPAA, or "anonymization") may be tedious and error-prone. For many research projects, the cost of de-identification is a significant barrier to data sharing. The gold standard corpus aims to assist the development and evaluation of software for de-identification.
To evaluate the de-identification approach described in this article, a randomly selected subset of the nursing progress notes was extracted from the MIMIC-II database. The selected nursing notes were thoroughly deidentified by a multi-pass process that included meticulous reviews by three or more human experts working independently, as well as by a variety of automated methods. All detected instances of PHI in these nursing notes have been replaced by realistic surrogate data in the gold standard corpus. For more details, see Neamatullah et al .
The gold standard corpus consists of fully de-identified 2,434 nursing notes from the MIMIC-II database. The nursing progress notes are unstructured free text typed into a clinical information system by the nurses at the end of each shift. The notes include observations about the patient's medical history, his or her current physical and psychological state, medications being administered, laboratory test results, and other information about the patient's course in the ICU. For other gold standard corpus related files (such as the detected PHI location), please see the associated software package .
The text file "id.text" contains fully de-identified 2,434 nursing notes, where PHI in these nursing notes have been replaced by realistic surrogate data. The text file "id.res" contains fully de-identified 2,434 nursing notes, where PHI in these nursing notes have been replaced by the corresponding tags.
Conflicts of Interest
The authors have no conflicts of interest.
- Standards for privacy of individually identifiable health information final rule. 67. Federal Register. 2002, 53181-53273.
- Neamatullah I, Douglass M, Lehman LH, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD. Automated De-Identification of Free-Text Medical Records. BMC Medical Informatics and Decision Making, 2008, 8:32. doi:10.1186/1472-6947-8-32
- De-Identification Software Package. https://www.physionet.org/content/deid/
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
CITI Data or Specimens Only Research
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project