Model Credentialed Access

What's in a Note? Unpacking Predictive Value in Clinical Note Representations

Tristan Naumann William Boag

Published: Jan. 7, 2018. Version: 0.1

When using this resource, please cite: (show more options)
Naumann, T., & Boag, W. (2018). What's in a Note? Unpacking Predictive Value in Clinical Note Representations (version 0.1). PhysioNet.

Additionally, please cite the original publication:

Boag, Willie and Doss, Dustin and Naumann, Tristan and Szolovits, Peter. What’s in a note? Unpacking predictive value in clinical note representations. AMIA Summits on Translational Science Proceedings (2018). American Medical Informatics Association

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Unlocking the full potential of clinical notes within electronic health records presents a challenge for large scale, computational analyses. Many studies that incorporate information from clinical notes do so with simple word-matching approaches and a focused goal of improving performance in a specific task such as mortality prediction. Our project explores representation strategies for clinical text and how to combine many words into larger document-level representations. The code and model shared here are supplements to the AMIA 2018 Informatics Summit paper of the same name.


Electronic Health Records (EHRs) contain rich information about patient physiology, interventions and treatments, and diagnoses. Alongside highly structured data such as vital sign measurements, free text notes are often used to capture important observations about patient state and interventions, as well as caregiver insights on patient trajectory. There are different ways in which information can be extracted from these free text notes to support retrospective analysis, ranging from simple pattern-matching approaches that count frequency of specified words to more complex methods that capture relationships between words.

The principal aim of our study, which is described in the AMIA 2018 Informatics Summit paper of the same name, was to better understand the information captured in various representations of clinical notes [1]. Here we share a word vector representation of clinical notes that was generated as part of that study, along with associated code. The representation was created using notes that appear in the publicly-available MIMIC-III database, version 1.4. MIMIC-III v1.4 contains de-identified EHR data from over 58,000 hospital admissions for nearly 38,600 adult patients [2]. 

Model Description

The word model shared here is generated with word2vec, trained on clinical notes from MIMIC-III (as opposed to the classic Mikolov vectors which were trained on GoogleNews) [3,4]. The model essentially maps words to a numerical representation in a multi-dimensional space. For example, Word A corresponds to a 300-dimensional vector, Word B corresponds to another 300-dimensional vector, and so forth. The position of a word in this space provides information about its relationship with other words. Unlike with something more modern, like BERT, the representation for the word does not change based on its surrounding context.

The files are described below:

  • mimic10.vec: Plain text word2vec representation of MIMIC-III clinical notes.
  • Python script used to generate the mimic10.vec file.
  • Zip file containing a snapshot of the code used in the AMIA 2018 Informatics Summit paper.
  • requirements.txt: Python requirements file outlining packages used when generating the model.

Technical Implementation

All MIMIC-III clinical notes were used to generate the vector representation. The de-identified clinical notes were pre-processed with the following steps:

  • Tags indicating de-identified protected health information were removed. 
  • Phrases written entirely in capital characters, which typically represent structural elements of a document such as section headers, were replaced by a single token. For example, "ADMISSION MEDICATIONS” would be replaced with a single token. 
  • Regular expressions for common age patterns were used to replace all ages with symbols binned by decade. 
  • All non-alphanumeric tokens were removed, and remaining numbers were normalized to a single token to represent a number.

As described in the AMIA 2018 Informatics Summit paper, word vectors are trained with word2vec using hyperparameters suggested by Levy et al [3].:

  • 300-dimensional SGNS with 10 negative samples.
  • min-count of 10.
  • subsampling rate of 1e-5.
  • a 10-word window.

Installation and Requirements

The word embedding (mimic10.vec) is provided as a plain text file. It can be loaded into a Python dictionary using the following function:

def load_word2vec(filename):
    W = {}
    with open(filename, 'r') as f:
        for i,line in enumerate(f.readlines()):
            if i==0: 
            toks = line.strip().split()
            w = toks[0]
            vec = np.array(map(float,toks[1:]))
            W[w] = vec
    return W

For example, the following steps will assign the embedding to a variable called "model":

# import numpy
import numpy as np

# give path to the embedding
path = "mimic10.vec"

# load the embedding
model = load_word2vec(path) 

Usage Notes

The model and code shared here can be used to reproduce the study in our AMIA 2018 Informatics Summit paper. It may also necessary to download the MIMIC-III dataset. To access MIMIC-III, users must follow guidance to complete a course in human research and sign the appropriate data use agreement. Experiments described in study can be run after installation via the following commands:

python code/ all
python code/ all 
python code/ all 
python code/ all

Release Notes

This work was carried out before BERT became very popular in NLP. Essentially, we were looking to compare different representation strategies for clinical text (e.g. bag of words vs word embeddings) and how to combine many words into larger document-level representations (e.g. a simple point-wise min/max/average vs a more complicated blackbox LSTM neural net).


This research was funded in part by the Intel Science and Technology Center for Big Data, the National Library of Medicine Biomedical Informatics Research Training grant 2T15 LM007092-22, the National Science Foundation Graduate Research Fellowship Program under Grant No. 1122374, NIH grants U54-HG007963 and R01-EB017205, and collaborative research agreements from Philips Corporation and Wistron Corporation. The authors would like to thank Jen Gong for her input and suggestions.

Conflicts of Interest

The authors have no conflicts of interest to declare.


  1. Boag, W., Doss, D., Naumann, T., & Szolovits, P. (2018). What's in a Note? Unpacking Predictive Value in Clinical Note Representations. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science, 2017, 26–34.
  2. Johnson AE, Pollard TJ, Shen L, Lehman LwH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Scientific data. 2016:3
  3. Levy O, Goldberg Y, Dagan I. Improving Distributional Similarity with Lessons Learned from Word Embeddings. In: Transactions of the Association for Computational Linguistics. 2015:211–225.
  4. Source code for word2vec on GitHub (accessed 18 May 2018).

Parent Projects
What's in a Note? Unpacking Predictive Value in Clinical Note Representations was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.