Model Credentialed Access

Fine-tuning foundational models to code diagnoses from veterinary health records

Adam Kiehl Nadia Saklou G Joseph Strecker Mayla Boguslav David Kott Tracy Webb Terri Ward

Published: Jan. 25, 2026. Version: 1.0.0


When using this resource, please cite:
Kiehl, A., Saklou, N., Strecker, G. J., Boguslav, M., Kott, D., Webb, T., & Ward, T. (2026). Fine-tuning foundational models to code diagnoses from veterinary health records (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/akca-bb83

Additionally, please cite the original publication:

Boguslav, M. R., Kiehl, A., Kott, D., Strecker, G. J., Webb, T., Saklou, N., Ward, T., & Kirby, M. (2026). Fine-tuning foundational models to code diagnoses from veterinary health records. PLOS Digital Health.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.

Abstract

Veterinary medical records represent a large data resource for application to veterinary and One Health clinical research efforts. Use of the data is limited by interoperability challenges including inconsistent data formats and data siloing. Clinical coding using standardized medical terminologies enhances the quality of medical records and facilitates their interoperability with veterinary and human health records from other sites. Previous studies, such as DeepTag and VetTag, evaluated the application of Natural Language Processing (NLP) to automate veterinary diagnosis coding, employing long short-term memory (LSTM) and transformer models to infer a subset of Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) diagnosis codes from free-text clinical notes. This study expands on these efforts by incorporating all 7,739 distinct SNOMED-CT diagnosis codes recognized by the Colorado State University (CSU) Veterinary Teaching Hospital (VTH) and by leveraging the increasing availability of pre-trained language models (LMs). Twelve freely-available pre-trained LMs (GatorTron, MedicalAI ClinicalBERT, medAlpaca, VetBERT, PetBERT, BERT, BERT Large, RoBERTa, GPT-2, GPT-2 XL, DeBERTa V3, and ModernBERT) were fine-tuned on the free-text notes from 246,473 manually-coded veterinary patient visits included in the CSU VTH's electronic health records (EHRs), which resulted in superior performance relative to previous efforts. The most accurate results were obtained when expansive labeled data were used to fine-tune relatively large clinical LMs, but the study also showed that comparable results can be obtained using more limited resources and non-clinical LMs. The results of this study contribute to the improvement of the quality of veterinary EHRs by investigating accessible methods for automated coding and support both animal and human health research by paving the way for more integrated and comprehensive health databases that span species and institutions.


Background

The use of veterinary medical records for clinical research efforts is often limited by interoperability challenges such as inconsistent data formats and clinical definitions, and data quality issues [1-4]. Clinical coding is used to transform medical records, often in free text written by clinicians, into structured codes in a classification system like the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) [5, 6]. SNOMED-CT is a "comprehensive clinical terminology that provides clinical content and expressivity for clinical documentation and reporting." It is designated as a United States standard for electronic health information exchange and includes clinical findings, procedures, and observable entities for both human and non-human medicine [6]. SNOMED-CT constitutes a hierarchy of standardized codes beginning with a top-level code such as "Clinical Finding" (SNOMED: 404684003) and terminating with a leaf code such as "Poisoning Due to Rattlesnake Venom" (SNOMED: 217659000). This work aims to improve methods for automating the clinical coding of veterinary medical records using hand-coded records from the Veterinary Teaching Hospital (VTH) at Colorado State University (CSU) as a training set.

This study aimed to extend and improve upon prior work in the field of automated clinical coding for veterinary EHRs to facilitate standardization of records, veterinary patient health and research, and the creation of data linkages to support One Health approaches to problem solving. Available manually coded data from the CSU VTH was used to determine if fine-tuning existing foundational models could achieve state-of-the-art results for automated veterinary clinical coding to 7,739 SNOMED-CT diagnosis codes, the largest set of diagnosis codes yet used in a veterinary context. It was found that fine-tuning the foundational model GatorTron [7] performed the best with an average weighted F1 score of 76.9 and an exact match rate of 52.2%.


Model Description

The GatorTron model fine-tuned for the downstream multi-label classification task of veterinary diagnosis coding is based on the GatorTron-Medium clinical foundational language model developed through a partnership between the University of Florida (UF) and Nvidia [7]. It features 3.9B parameters, 48 encoder blocks, a hidden dimension of 2,560, and a maximum of 512 input tokens, and is formatted for classification to 7,739 disease concepts. The original model was pretrained on 91B words of text including text from 2.9M UF clinical notes, MIMIC-III [8], PubMed [9], and WikiText [10]. The fine-tuned version was trained on the diagnosis and assessment sections from a corpus of 199,914 veterinary clinical notes from the Colorado State University Veterinary Teaching Hospital that were manually labeled with SNOMED-CT diagnosis codes.

Expected Input: The model expects a PyTorch tensor containing input text tokenized using the pretrained GatorTron tokenizer. By default, this tensor will be of length 512. The input text should minimally contain a diagnosis list or problem list for a single veterinary visit. Additional text such as a brief assessment of the animal's state and prognosis can be included to provide additional context to the diagnosis list. Basic preprocessing should include the removal of special characters such as tabs, carriage returns, and non-breaking spaces as well as HTML tags and leading/trailing whitespace.

Expected Output: The model will output a PyTorch tensor of length 7,739 containing raw logits for each diagnosis code. A sigmoid activation function should be applied to the tensor to convert logits to probabilities. These probabilities can be used to make code predictions by filtering the all_codes.pt tensor for only codes whose associated probabilities are greater than a chosen threshold (0.5 is recommended). An example is given below.


Technical Implementation

GatorTron was fine-tuned under the following framework:

  • Batch Size: 32
  • Loss Function: Binary Cross-Entropy
  • Optimizer: AdamW
  • Warmup Steps: 5,000
  • Initial Learning Rate: 5E-8
  • Plateau Learning Rate: 3E-5
  • Maximum Epochs: 50
  • Early Stopping Patience: 5 epochs

A custom pooler layer consisting of a fully-connected linear layer followed by a Tanh activation function, a dropout layer (rate of 0.25), and a linear classifier layer were appended to the model. The final 36 encoder layers and the pooler and classifier layers were allowed to update in the fine-tuning process. Fine-tuning was performed on a high-performance computing node featuring 4 Tesla A100 GPUs, each with 80 GB of VRAM.


Installation and Requirements

Ensure you've downloaded this project's files to a local directory. Ensure Python (version 3.14.2) and the torch (version 2.9.1) and transformers (version 4.57.3) libraries are installed in your environment. Listed software versions were used to build and test the model, but other versions may be functional. All required packages can be installed using the provided requirements.txt.

The model requires at least 16GB of RAM to run and can be run on either CPU or GPU setups. The example code provided utilizes CPU resources, but a GPU can be easily utilized by sending the model and tokenized inputs to the GPU using .to("cuda").


Usage Notes

This project contains several downloadable files:

  • sharded_model: A directory containing the sharded, trained model weights. These shards were generated using a custom script and can be read using the load_sharded_model function in load.py.
  • model.py: A custom PyTorch model class which must be used as a target to load fine-tuned GatorTron weights to. This class should be imported into any scripts that use the model.
  • load.py: A function used to load trained model weights. This function should be imported into any scripts that use the model.
  • all_codes.pt: A PyTorch tensor containing an ordered list of 7,739 SNOMED-CT diagnosis codes which must be used in the interpretation of model outputs.
  • example.ipynb: A notebook containing a minimal example of how the model can be initialized and used.
  • requirements.txt: A list of required Python packages.

To instantiate the fine-tuned GatorTron model, initialize the model from the custom PyTorch class and then load the saved model state:

# Import packages
import torch

# Import model class
from model import GatorTronClassifier

# Import model loading function
from load import load_sharded_model

# Initialize GatorTron model
model = GatorTronClassifier()

# Load trained model state
model = load_sharded_model(model, "./sharded_model")

To generate diagnostic probabilities from input text, use the downloaded GatorTron tokenizer:

# Import packages
import torch.nn as nn
from transformers import AutoTokenizer

# Initialize GatorTron tokenizer
tokenizer = AutoTokenizer.from_pretrained("UFNLP/gatortron-medium")

# Tokenize input text
text = "this is a test record for example purposes. atopic dermatitis and hypertension."
tokenized = tokenizer(text, padding="max_length", max_length=512, truncation=True, return_tensors="pt")

# Get raw model outputs
outputs = model(tokenized["input_ids"], tokenized["attention_mask"])

# Define activation function
activation_fn = nn.Sigmoid()

# Get probabilities from raw model outputs
probs = activation_fn(outputs)

To retrieve predicted codes from diagnostic probabilities, use the list of diagnosis codes:

# Import list of diagnosis codes
all_codes = torch.load("all_codes.pt")

# Print diagnosis codes with probabilities greater than 0.5
all_codes[(probs >= 0.5)[0]]

Release Notes

The current and first release is version v1.0.0.


Ethics

Although industry-standard efforts are used to de-identify and evaluate the security of data prior to use, continued advances in technology mean that large language models (LLMs) have the potential to inadvertently expose sensitive data. Purposeful attempts to use LLMs to discover sensitive data are outside the bounds of intended and ethical use of this model and undermine progress toward improved patient care. This model was developed using data sources that include MIMIC-III (under the same IRB oversight and regulatory constraints), as well as publicly available datasets from PubMed and Wikipedia and private clinical veterinary medical record data provided by Colorado State University.

This model is intended solely for research purposes in the fields of computational linguistics and medical informatics. It is not a substitute for professional medical judgment and should not be used for clinical diagnosis or decision-making without rigorous validation and appropriate regulatory clearance. The developers and contributors disclaim any liability for damages or claims arising from the use of this software, whether in contract, tort, or otherwise.

The shared files in this project contain only derived artifacts and include no identifiable clinical records. All data processing and model training were conducted within secure computing environments, consistent with institutional and regulatory requirements. A Not Human Research Determination was made for 'Fine-tuning foundational models to code diagnoses from veterinary health records' (CSU IRB Protocol #7654). The IRB determined that the proposed activity is not research involving human subjects as defined by DHHS and FDA regulations. IRB review and approval by CSU IRB is not required. The appropriate administrative and legal parties at Colorado State University have granted permission for the above data to be used for model development and for the derived model to be shared publicly. However, the private medical record data cannot be similarly shared.

This model is a derivative of the GatorTron architecture, originally released on HuggingFace on June 4, 2023, and is distributed under the Apache 2.0 license, in accordance with its terms.


Conflicts of Interest

The authors have no conflicts of interest to declare.


References

  1. Dong H, Falis M, Whiteley W, Alex B, Matterson J, Ji S, et al. Automated clinical coding: what, why, and where we are. NPJ Digit Med. 2022;5(1):159.
  2. Campbell S, Giadresco K. Computer-assisted clinical coding: a narrative review of the literature on its benefits, limitations, implementation and impact on clinical coding professionals. Health Inf Manag J. 2020;49(1):5–18.
  3. Paynter AN, Dunbar MD, Creevy KE, Ruple A. Veterinary big data: when data goes to the dogs. Animals (Basel). 2021;11(7):1872.
  4. Ouyang Z, Sargeant J, Thomas A, Wycherley K, Ma R, Esmaeilbeigi R, et al. A scoping review of ‘big data’,‘informatics’, and ‘bioinformatics’ in the animal health and veterinary medical literature. Animal health research reviews. 2019;20(1):1-18.
  5. Chang E, Mostafa J. The use of SNOMED CT, 2013–2020: a literature review. J Am Med Inform Assoc. 2021;28(9):2017–2026.
  6. National Library of Medicine. SNOMED CT United States Edition [Internet]. Bethesda (MD): National Library of Medicine; c2024 [cited 2024 Oct 19]. Available from: https://www.nlm.nih.gov/healthit/snomedct/us_edition.html
  7. Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5(1):194.
  8. Johnson AEW, Pollard TJ, Shen L, Lehman LWH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035.
  9. National Library of Medicine. PMC Open Access Subset [Internet]. Bethesda (MD): National Library of Medicine; 2003–2022 [cited 2024 Oct 19]. Available from: https://pmc.ncbi.nlm.nih.gov/tools/openftlist/
  10. Wikimedia Foundation. English Wikipedia data dump: enwiki-latest-pages-articles.xml.bz2 [Internet]. San Francisco (CA): Wikimedia Foundation; 2025 [cited 2024 Oct 19]. Available from: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Parent Projects
Fine-tuning foundational models to code diagnoses from veterinary health records was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Project Views

0

Current Version

0

All Versions
Project Views by Unique Registered Users
Corresponding Author
You must be logged in to view the contact information.

Files