Model Credentialed Access
Clinical-T5: Large Language Models Built Using MIMIC Clinical Text
Eric Lehman , Alistair Johnson
Published: Jan. 25, 2023. Version: 1.0.0
When using this resource, please cite:
(show more options)
Lehman, E., & Johnson, A. (2023). Clinical-T5: Large Language Models Built Using MIMIC Clinical Text (version 1.0.0). PhysioNet. https://doi.org/10.13026/rj8x-v335.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Abstract
Recent advances in scaling large language models (LLMs) has resulted in significant improvements over a number of natural language processing benchmarks. There has been some work to pretrain these language models over clinical text. These works demonstrate that training a language model using masked language modeling (MLM) on clinical notes is an effective technique for boosting performance on downstream tasks. All of these previous works use decoder-only architectures. We train 4 different clinical T5 models on the union of MIMIC-III and IV notes. Two of the models are initialized from previous T5-models (T5-base and SciFive). We additionally train a T5-Base and T5-Large model from scratch. These models should not be distributed to non-credentialed users. Research has shown that these language models have the potential to leak sensitive information. Due to this potential risk, we release the model weights under PhysioNet credentialed access.
Background
Large language models (LLMs) pretrained over large text corpora has generally led to performance gains on a wide variety of tasks. Domain-agnostic models, i.e., models trained over general text like news, wikipedia, etc., often struggle to adapt to clinical text. This may be due to the unique structure of clinical notes. These notes, like the ones in MIMIC [1], contain domain-specific knowledge, unexplained abbreviations, and incomplete sentences. There has been some work to show that biomedically pretrained models, i.e., models trained over biomedical text like PubMed, generally perform better than general-domain models on clinical tasks. For example, [2] showed that ClinicalBERT outperformed BioBERT [3] on the i2b2 2010, and 2012 task [4] [5]. This is likely because there is an overlap in domain-specific knowledge between biomedical text and clinical text. However, similar to [2] [6] [7], we find that clinically pretrained models outperform their biomedical-equivalents on a number of tasks. In this work, we use the T5-model [8], as the bidirectional encoder-decoder architecture allows for a more diverse set of tasks.
There has been some concern, however, over the safety of releasing a model trained on MIMIC, as these models may contain leakage [9] [10]. Notably, [9] found that GPT-2-XL (1.5B parameters) [11], a decoder-only model, memorized sensitive information seen during training. For GPT-Medium (~350M) and GPT-Small (110M), there were some and minimal leakage, respectively. For this reason, we release these models with credentialed access.
Model Description
T5 Models are bidirectional encoder-decoder models, which means that there is a transformer that encodes the text, and a transformer that decodes the text [8]. These models are pretrained using masked language modeling (MLM) on text. For example, given a sentence "the patient has a history of ankle sprains, hamstring tears, and shoulder dislocations", we will randomly replace words with a special token, and ask the model to produce those tokens:
Inputs = "the patient has a [MASK1] of ankle sprains, hamstring [MASK2], and [MASK3] dislocations"
Outputs = "[MASK1] history [MASK2] tears [MASK3] shoulder"
This task can be done in an unsupervised manner. The choice of which sentences to use, however, has been shown to improve/worsen performance on downstream tasks. The original T5 model uses a blend of various general sources (e.g., Wikipedia) as input to their MLM pretraining scheme. On the other hand, SciFive [12] further pretrains a T5 model only on PubMed. They then show that this boosts performance on biomedical tasks. We train and release four different models, all using the T5-base architecture, but from different weight initializations. With respect to clinical text, we attempt to determine if (1) is it better to train from scratch (more expensive) or is it okay to use a different initialization point and (2) if using an initialization point, is it important to use one trained on PubMed?
We describe each of the models below:
Clinical-T5-Base: This model was initialized from T5-Base [8]. As mentioned previously, T5-Base is trained on a variety of general text using the MLM training scheme shown above. Afterwards, T5-Base was trained on several downstream tasks, including SQUAD. We use this as our starting point for MLM task. We use MIMIC-III and MIMIC-IV as the input text for our MLM training.
Clinical-T5-Sci: This model was initialized from SciFive [12]. SciFive uses T5-Base as its initialization point. [12] then trains the model further for 200K steps on PubMed abstracts and PubMed Central. In the Clinical-T5-Sci version of the model, we use this the SciFive model our starting point for MLM task. We then use MIMIC-III and MIMIC-IV as the input text for our MLM training.
Clinical-T5-Scratch: We use the same architecture as T5-Base (220M), but randomly initialize the weights. Further, we construct a vocabulary for the model based on MIMIC notes. We then use the MLM task with chunks of text from MIMIC.
Clinical-T5-Large: We use the same architecture as T5-Large (770M), but randomly initialize the weights. Further, we construct a vocabulary for the model based on MIMIC notes. We then use the MLM task with chunks of text from MIMIC.
This repository comes with several files. These files are very standard across the transformers library [13]. In the section below, we describe what each of these files represent, however it is important to note that the transformer library will handle all of this for you:
-
config.json
: This is a general configuration file that contains information about the model. -
pytorch_model.bin
: This is the model weights. -
tokenizer.json
: This is contains all of the necessary information for the tokenizer to map words to indices. -
special_tokens_map.json
: Any special tokens that are used during the training process. -
tokenizer_config.json
: This is the configuration file for the tokenizer. -
added_tokens.json
: Tokens that were added to the vocabulary. This includes our list of DEID tags.
Technical Implementation
We train for an additional 10 epochs when initializing from T5-base and Sci-Five. We name these Clinical-T5-Base and Clinical-T5-Sci, respectively. These are trained on 8x48GB GPUs with a batch size of 32 per GPU and a sequence length of 512. This is effectively training for 15B more tokens. We train using the same inverse-square-root adafactor learning rate as the original T5 paper [8]. However, due to concerns over domain shift, we increase the number of warm-up steps from 10,000 to 40,000. This was in attempt to speed up training. For the from-scratch base-model, we train for 28 epochs with a similar set-up as previous. Since we are training from scratch, we use the exact same settings as the original T5 paper. For the T5-Large model, we train for 780,000 steps with a sequence length of 512, and batch size of 96. We were attempting to train for 28 epochs, however, there was an issue with Google Cloud that closed our TPU instance. We also replace all DEID tags with special tokens. These tokens are then added to the vocabulary of the models.
Installation and Requirements
You should install the torch and transformers library to use this model. We recommend using the latest transformers version. To load the model and tokenizer:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FOLDER_PATH_TO_MODEL")
model = AutoModelForSeq2SeqLM.from_pretrained("FOLDER_PATH_TO_MODEL")
To run the model, you make use the following sample code:
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
The forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss
loss.item()
Because we replace the DEID tokens with special tokens, you should also do this at train/test time. There is a script to do this. Here is some sample code of how to use the script:
from convert_deid_tags import replace_list_of_notes
replace_list_of_notes(["My name is [**FirstName 123**]"])
Please refer to the transformers documentation with questions about how to use the T5 library [13].
Usage Notes
These models are not finetuned for any specific task. Thus, we recommend not using them in a zero or few shot setting. These models can be finetuned for all NLP tasks (e.g., named-entity-recognition, sequence classification, question answering, etc.). We refer to [8] and [12] on implementation details and code.
It is highly encouraged to finetune using a GPU. We found that these models comfortably fit on a 12GB GPU when using a batch size of 8 and a sequence length of 256. For the T5 Large variant, we recommend using a 24GB GPU. We found that it fit using a batch size of 4 and a sequence length of 256. One known limitation of the base model trained from scratch is that the vocabulary is lower-cased. However, the larger model did not have this limitation, and was trained using cased vocabulary.
Ethics
These models should not be distributed. Research of LLMs has shown that these models may contain leakage. This is especially more likely for models that contain a decoder component. However, these sequence-to-sequence style models can be used for a more complex set of problems (e.g., summarization). These models were built using the MIMIC III and IV databases and exist under the same IRB.
Acknowledgements
We thank Xyla Inc. for funding these models.
Conflicts of Interest
N/A
References
- Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
- Alsentzer, E., Murphy, J.R., Boag, W., Weng, W., Jin, D., Naumann, T., & McDermott, M.B. (2019). Publicly Available Clinical BERT Embeddings. ArXiv, abs/1904.03323.
- Lee, Jinhyuk et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics 36 (2019): 1234 - 1240.
- Uzuner, Özlem et al. “2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.” Journal of the American Medical Informatics Association : JAMIA vol. 18,5 (2011): 552-6. doi:10.1136/amiajnl-2011-000203
- Sun, Weiyi et al. “Evaluating temporal relations in clinical text: 2012 i2b2 Challenge.” Journal of the American Medical Informatics Association : JAMIA vol. 20,5 (2013): 806-13. doi:10.1136/amiajnl-2013-001628
- Li, Y., Wehbe, R.M., Ahmad, F.S., Wang, H., & Luo, Y. (2022). Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences. ArXiv, abs/2201.11838.
- Yang, X., Pournejatian, N.M., Shin, H., Smith, K.E., Parisien, C., Compas, C.B., Martin, C., Flores, M.G., Zhang, Y., Magoc, T., Harle, C.A., Lipori, G.P., Mitchell, D.A., Hogan, W.R., Shenkman, E.A., Bian, J., & Wu, Y. (2022). GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records. ArXiv, abs/2203.03540.
- Raffel, C., Shazeer, N.M., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P.J. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv, abs/1910.10683.
- Carlini, N., Tramèr, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T.B., Song, D.X., Erlingsson, Ú., Oprea, A., & Raffel, C. (2020). Extracting Training Data from Large Language Models. USENIX Security Symposium.
- Lehman, E.P., Jain, S., Pichotta, K., Goldberg, Y., & Wallace, B.C. (2021). Does BERT Pretrained on Clinical Notes Reveal Sensitive Data? ArXiv, abs/2104.07762.
- Radford, Alec et al. “Language Models are Unsupervised Multitask Learners.” (2019).
- Phan, Long et al. “SciFive: a text-to-text transformer model for biomedical literature.” ArXiv abs/2106.03598 (2021): n. pag.
- https://github.com/huggingface/transformers
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
Required training:
CITI Data or Specimens Only Research
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/rj8x-v335
DOI (latest version):
https://doi.org/10.13026/aw2e-he88
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project