De-Identification Version 1.1 README ===================================== Name: Automated De-Identification of Free-Text Medical Records Purpose: This software de-identifies protected health information (PHI) from free-text medical records and outputs the de-identified text. Authors: Margaret Douglass William J. Long Ishna Neamatullah (ishna AT alum DOT mit DOT edu) Li-wei Lehman (lilehman AT alum DOT mit DOT edu) Last modified by Li-wei Lehman, April 2009 License: GNU GPL 2.0 : See file called "COPYING" Version: 1.1 Background literature: 1. Neamatullah I, Douglass M, Lehman LH, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, and Clifford GD. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak 2008;8(32). URL http://www.biomedcentral.com/1472-6947/8/32/. 2. Neamatullah I. Automated De-Identification of Free-Text Medical Records. MIT Press, 77 Mass. Ave., Cambridge, MA, 2006. MEng Thesis. 3. Douglass M. Computer-Assisted De-identification of Free-text Nursing Notes. MIT Press, 77 Mass. Ave., Cambridge, MA, USA, 2005. MEng Thesis. 4. Douglass M, Clifford GD, Reisner A, Long WJ, Moody GB, Mark RG. De-Identification Algorithm for Free-Text Nursing Notes. Computers In Cardiology, S6.2, 2005. 5. Douglass M, Clifford GD, Reisner A, Moody GB, Mark RG. Computer-Assisted Deidentification of Free Text in the MIMIC II Database. Computers In Cardiology, M6.2, 2004. Platforms: Perl 5.8.8 and Perl 5.10, Fedora Core 10, Linux 2.6.27 (development and testing). The code is expected to run on Windows but is unsupported. Code organization: README.txt --- This file Changes.log -- Documentation of changes since version 1.0. deid.pl --- Source code in perl to de-identify medical notes deid.config --- An example config file to run the perl code in performance comparison mode deid-output.config -- An example config file to run the perl code in output mode id.text --- Gold standard corpus with 2,434 re-identified nursing notes id.deid --- List of PHI locations in the gold std corpus (id.text) id.types --- Category of PHIs in id.deid id-phi.phrase --- List of PHI locations and the PHI terms as appeared in text shift.txt --- The date shift file for patients in the gold standard corpus lists/ --- Directory containing dictionary/database of potential PHIs dict/ --- Directory containing dictionary of common words or UMLS terms docs/DeidUserManual.doc --- More documentation on the deid software. The source code is contained in a single file (deid.pl). Each run can be configured using deid.config. Associated dictionaries and database used are in folders /lists and /dict. The shift.txt file contains a randomly assigned date shift (between 1000 - 3000 days) for each patient in the gold standard corpus. If the date shift filter is on, the dates will be shifted by the specified number of days. Note: the date shift in shift.txt is randomly generated for this public release, and is different than what is used internally to re-identify our medical notes. The per-patient date shifts used in re-identifying dates in our medical notes are generated to preserve the day of the week or season information in the medical notes. The id-phi.phrase is not used by the deid code. It is for users to see the text corresponding to each PHI location in the gold standard corpus. Its format is . Installation: Use "gunzip" to unzip the gzipped file, then unpack the tar file with the "tar -xvf" command. Testing: To allow testing of the algorithm's execution, we have provided a text-file, id.text, with an associated gold standard id.deid. You can run the perl code in two different modes: (1) output mode without performance statistics, in which case the program will output the de-identified text (2) performance statistics mode, in which case, the program will compare the PHI list generated by the code with the PHI list from the gold standard, and output performance statistics. Note: in either mode, it takes approximately 10 minutes to complete a run on a 3 GHz dual Pentium 4 processor. Test code WITHOUT performance statistics (i.e., in output mode): ================================================================ 1. Configure the run using deid.config. a) Comparison with Gold Standard: Set "Gold Standard Comparison" to '0' for output mode (without performance statistics). b) Date shifting: Enter number of days of forward shift or supply a shift.txt file and specify "y" (for yes) for "PID to date offset mapping". The default setting in deid-output.config for date shifting is 1000 days for all notes (without using shift.txt). Use the supplied shift.txt in order to assign patient-specific date shift. c) PHI filters: Enter which filters should be turned on/off. d) Dictionary filters: Enter which dictionaries should be loaded. 2. Run deid code on id.text: type "perl deid.pl id deid-output.config" Configure the run using deid-output.config. a) The input filename should have extension .text, but should be entered in the command without the extension. b) The code will output id.res, which is the scrubbed medical text with the PHIs removed and replaced with appropriate tags. 3.Open id.res to examine de-identified output. Test results: 1. id.res = de-identified text with PHI removed. 2. id.phi = PHI locations in text. 3. id.info = information on PHI locations and de-identification process for debugging purposes. Since the gold standard corpus (id.text) does not supply record date for each nursing note, the code uses a default record date when performing date shift (see .config file on how to specify the default date). If you would like the deid code to date shift the dates within your medical records properly, you need to supply a different record date for each note. Please see DeidUserManual.doc for more details. Sample output files from this mode can be found at the directory ./GSoutput. To verify that the resulting output files you generated from running the code is the same as the ones we provide you in the ./GSoutput directory, you can use the 'diff' command on unix/linux. When running the code in output mode, you should see the following message output to the screen. ******************************************************************************************************************* De-Identification Algorithm: Identifies Protected Health Information (PHI) in Discharge Summaries and Nursing Notes ******************************************************************************************************************* Starting de-identification... Running deid in output mode. Output files will be: id.phi: the PHI locations found by the code. id.text: the scrubbed text. id.info: debug info about the PHI locations. Test code with performance statistics: ====================================== 1. Configure the run using deid.config. a) Comparison with Gold Standard: set 'Comparison with Gold Standard' to '1' for performance statistics. The gold standard PHI locations are in id.deid. Statistics will be printed on the screen at the end of the run. b) Turn off the following lists: Country names and Ethnicities (Note: this should be done for you already in deid.config) Make sure file id.deid is in the same directory. Deid will evaluate the program output using the PHI locations in id.deid as a gold standard. 2. Run deid code on id.text: type "perl deid.pl id deid.config" The filename should have extension .text, but should be entered in the command without the extension. 3. Performance statistics will be printed on screen. Test results: 1. id.phi = PHI locations in text. 2. id.info = information on PHI locations and de-identification process for debugging purposes. 3. Performance statistics printed on screen. 4. Note that no id.res is created. In order to create this file, the code has to be run with the 'Comparison with Gold Standard' option set to '0'. Sample output files from this mode can be found at the directory ./GSstats. The software reports sensitivity (or recall) and positive prediction value (PPV or precision) of the output from software. Sensitivity/Recall is defined as the proportion of PHI identified by the software out of all instances of PHI in the text. PPV/Precision is the proportion of true positivies of all terms identified as PHI in the software. When running the code in performance comparison mode, you should see the following output on your screen. ******************************************************************************************************************* De-Identification Algorithm: Identifies Protected Health Information (PHI) in Discharge Summaries and Nursing Notes ******************************************************************************************************************* Starting de-identification (version 1.1) ... Running deid in performance comparison mode. Using PHI locations in id.deid as comparison. Output files will be: id.phi: the PHI locations found by the code. id.info: debug info about the PHI locations. ========================== Num of true positives = 1720 Num of false positives = 546 Num of false negatives = 59 Sensitivity/Recall = 0.967 PPV/Specificity = 0.748 ========================== Customizing DeID to Work with Other Notes ========================================== In order to customize this de-identification software to work with notes in other applications, you can customize by replacing our filter modules with your application-specific filters. Additionally, at a minimum, you will have to replace the following dictionary files: * lists/pid_patientname.txt * lists/stripped_hospitals.txt * lists/local_places_ambig.txt * lists/local_places_unambig.txt * lists/doctor_first_names.txt * lists/doctor_last_names.txt Depending on your applications, you may wish to re-classify names as ambiguous or not. For example, while in most applications, the word "Mae" is an un-ambiguous name, in nursing and discharge notes, however, the word also means "moving all extremities" and therefore is an ambiguous term. In case of problems contact: Li-wei Lehman (lilehman AT alum DOT mit DOT edu). Gari Clifford (gari AT mit DOT edu)