Name: EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems
Published: Jan. 11, 2024
License: https://github.com/MIT-LCP/license-and-dua/tree/master/drafts

Database Credentialed Access

Konstantin Kotschenreuther

Published: Jan. 11, 2024. Version: 1.0.0

When using this resource, please cite: (show more options)
Kotschenreuther, K. (2024). EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems (version 1.0.0). PhysioNet. https://doi.org/10.13026/25fx-f706.

MLA	Kotschenreuther, Konstantin. "EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems" (version 1.0.0). PhysioNet (2024), https://doi.org/10.13026/25fx-f706.
APA	Kotschenreuther, K. (2024). EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems (version 1.0.0). PhysioNet. https://doi.org/10.13026/25fx-f706.
Chicago	Kotschenreuther, Konstantin. "EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems" (version 1.0.0). PhysioNet (2024). https://doi.org/10.13026/25fx-f706.
Harvard	Kotschenreuther, K. (2024) 'EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems' (version 1.0.0), PhysioNet. Available at: https://doi.org/10.13026/25fx-f706.
Vancouver	Kotschenreuther K. EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems (version 1.0.0). PhysioNet. 2024. Available from: https://doi.org/10.13026/25fx-f706.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

APA	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
MLA	Goldberger, A., et al. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
CHICAGO	Goldberger, A., L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220." (2000).
HARVARD	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P.C., Mark, R., Mietus, J.E., Moody, G.B., Peng, C.K. and Stanley, H.E., 2000. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
VANCOUVER	Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

Abstract

This dataset was designed and created to enable advancements in healthcare-focused large language models, particularly in the context of retrieval-augmented clinical question-answering capabilities. Developed using a self-constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.

Background

In recent years, the field of healthcare has increasingly leaned on technology to enhance various aspects of medical practice, including the efficient retrieval of information from medical records. Against this backdrop, this project was conceived to enable advancements in healthcare-focused large language models, particularly focusing on improving the performance of retrieval-augmented clinical question-answering systems. Historically, resources in this domain have included datasets such as "DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries" [1,2] and "Annotated Question-Answer Pairs for Clinical Notes in the MIMIC-III Database" [2-4]. The former revolves around SQL queries to engage with electronic health records within the MIMIC-III dataset to retrieve pertinent information based on a natural language query as stored in the EHR. The latter, while noteworthy, is intended for model testing rather than model training leading to far fewer question-and-answer pairs being present, and focuses on pinpointing exact answer locations within the context—a detail deemed less critical with the emergence of advanced transformer models.

The value in the approach utilized for this project is shown by the InstructGPT finetune of GPT-3. This finetune aimed to improve the performance of GPT-3 on a variety of tasks, including question answering from a given context. A dataset containing, among other elements, question and answer pairs together with a relevant context passage was used as training data. By finteuning GPT3 with this dataset, the inclusion of information not present in a provided context passage in response to a question answering or summarization task (hallucination rate) was reduced by 48.7 percent [5,6]. The resource presented in this project is grounded on discharge summaries from the MIMIC IV 2.2 dataset [2,7]. The MIMIC-IV database consists of over 40,000 patients’ records from the Beth Israel Deaconess Medical Center (BIDMC). The included discharge notes are published as the “MIMIC-IV-Note: Deidentified free-text clinical notes” dataset [2,8].

The goal of this project was leveraging the power of transformer-based large language models to synthesize a substantial dataset of question-and-answer pairs derived from clinical discharge summaries. The adoption of the Meta Llama 2 model [9]—recognized for its natural language capabilities while remaining feasible for local deployment due to its small parameter size enabled the realization of this project. While the Llama 2 model is capable of being run in a local environment, this is only possible with performant graphics cards that would not be found in a typical hospital or private practice computer. The hypothesis driving this work is that small large language models with approximately 1 billion trainable parameters, capable of running on everyday computers can be specifically fine-tuned to function as information retrieval augmented question answering agents for medical records given the right training data and generation pipelines. This project made use of two distinct pre-trained instruction finetuned Llama 2 models in generation of question and answer pairs for a given context. In addition to a standard instruction-finetune designed for a context length of 4096 tokens, a second instruction-finetune model specifically trained to 8192 tokens was employed in the generation of question and answer pairs. This allowed for the expansion of the original 4096 token limit to 8192 tokens through the extended context fine-tuning as well as the employment of NTK Rope Scaling [10], a common methodology used to improve large language model performance as context sizes increase. This dataset is primarily intended for data scientists and machine learning engineers developing natural language processing applications within the healthcare sector. It is made available to support a shift towards more time-efficient technology-powered practices in healthcare, thus potentially liberating physicians and healthcare workers from administrative constraints and allowing for a sharper focus on patient care [11].

Methods

Dataset Generation

The dataset consists of a JSON file containing a list of dictionaries as well as a CSV that mirrors the information contained within the JSON. Each dictionary in the JSON represents a discharge summary from the MIMIC-IV-Note dataset, and the associated generated question-and-answer pairs. These question-and-answer pairs were generated based on the clinical note contained within the dictionary using two open-source large language models: "Open-Orca/OpenOrca-Platypus2-13B" [11,12] for discharge summaries that were less than 3000 tokens in length and "OpenAssistant/llama2-13b-orca-8k-3319" [13,14] for notes between 3000 and 6000 tokens in length. Additionally, each dictionary contains an "extended_context" boolean attribute indicating the utilization of the extended length Llama model for generating the question-and-answer pairs if the discharge summary had a token length larger than 3000. The notes were selected sequentially in both ascending and descending orders, excluding notes exceeding 6000 tokens due to hardware constraints. Discharge summaries under 3000 tokens in length were selected both in ascending and descending ID order simultaneously, whereas clinical notes with a token size above 3000 were only processed by an ascending ID order for inclusion in this dataset.

Data Preprocessing

During the preprocessing phase, the initial subsection of each clinical note, which typically contains mainly placeholders for personally identifiable information, was removed. If the note contained allergy information, all preceding data was excluded, as the allergy section was usually the first section of the discharge summary containing usable information in regards to question and answer generation. This method ensured the exclusion of unnecessary data while retaining the substantive content necessary for the generation process. The omission of personally identifiable information leads to a skew in the training data. However, as this dataset was generated to be included in the training of models used by physicians, referencing a patient as “the patient” would be much more common than referring to the patient by their first or last name.

Model Implementation

Two models were obtained in a finetuned state from Hugging Face and utilized for the generation of question and answer pairs based on the discharge summaries: the "Open-Orca/OpenOrca-Platypus2-13B" and "OpenAssistant/llama2-13b-orca-8k-3319" models. Both of these models are instruction finetuned versions of the open-source base Llama 2 model which was released by meta for both research and commercial use. These instruction-tuned models were leveraged for the generation of question and answer pairs based on provided context as-is without additional fine-tuning and were implemented using the "ExLlama" library [15] for standard context length and Huggingface Transformers library [16] for extended context length in Python.

The "Open-Orca/OpenOrca-Platypus2-13B" model is a merge of two individual Llama 2 13-B finetunes, specifically the "Open-Orca/OpenOrcaxOpenChat-Preview2-13B" [17] model and the "garage-bAInd/Platypus2-13B" [18] model. The "Open-Orca/OpenOrcaxOpenChat-Preview2-13B" model is a full finetune of the 13-B Llama 2 model trained using 8x NVIDIA A100-80GB GPUs whereas the "garage-bAInd/Platypus2-13B" is a low rank adaptation (LoRA) adapter [19] finetune employing 1x NVIDIA A100-80GB GPU. The two models were subsequently merged resulting in the final model as released and used in this project. While training hyperparameters have not been published for the "Orca/OpenOrcaxOpenChat-Preview2-13B" model, the hyperparameter values used for training the "garage-bAInd/Platypus2-13B" model are as follows:

Hyperparameter	Value Used for Training
Learning Rate	4e-4
Batch Size	16
Warmup Steps	100
Epochs	1
Weight Decay	0
Max Length	4096

The "OpenAssistant/llama2-13b-orca-8k-3319" model is an instruction based finetune of the 13-billion parameter Llama 2 model leveraging RoPE Scaling. This extended context finetune of the 13-billion parameter Llama 2 model was completed using 8x NVIDIA A100-SXM4-80GB GPUs. This finetune employed the "togethercomputer/RedPajama-Data-1T" [20], "ehartford/dolphin" [21] and "shahules786/orca-chat" [22] datasets and the following hyperparameters for training:

Hyperparameter	Value Used for Training
Learning Rate	1e-5
Batch Size	2
Warmup Steps	100
Epochs	1
Weight Decay	0
Max Length	8192

The models were subjected to 4-bit quantization to reduce computational demands. The Open-Orca model was quantized with Act Order enabled and a group size of 64, whereas the OpenAssistant model was quantized with Act Order disabled and a group size of 128 to reduce computational requirements due to the extended context. The 8k model was loaded employing NTK Rope Scaling with an alpha value of 4 to compliment the context length extension achieved through the finetuning completed by OpenAssistant.

Generation Procedure

The generation procedure was guided by predefined prompt templates available in the GitHub repository for this project [23] . This template instructs the models to generate relevant, physician-level question-and-answer pairs based on the provided clinical note, strictly adhering to a specific output format delineated in the template. The prompt instructs the model to strictly focus on the given context and provide complete and detailed responses to the generated questions. Subsequently, a subset of these generated pairs was evaluated by a physician to verify the accuracy and completeness of the answers, serving as a quality control measure. The question was deemed valid or invalid based on whether it was completely and correctly answered based on the given context. If the question was not answerable from the context, it was still deemed valid in verification if the model gave a response indicating that this question is not answerable from the provided context. Partially correct answers where there were either hallucinations along with correct response elements or responses that did not cover all information relevant for the given query were also classified as incorrect.

Data Postprocessing

While the personally identifiable information placeholders present at the start of each discharge summary were mainly excluded through data preprocessing, certain notes still contained deidentified references to the patient’s name such as “Mr.” and “Mrs.” followed by underscores which were used by the model to generate and answer certain questions. In an attempt to rectify this, instances of “Mr.” and “Mrs.” followed by underscores were replaced with “the patient” as well as "Dr." followed by underscores being replaced with "the doctor". After replacing these deidentification references, all question-and-answer pairs containing remaining deidentificaton references were removed. Even though the model received specific instructions for how to format the output, it did not always adhere to these instructions, which made the employment of question and answer parsing code necessary. While the vast majority of cases were picked up, this was not always the case, leading to rare instances of the question-and-answer pairs being stored as the original unparsed string as an answer. In order to overcome this, the data was postprocessed in a second step to search and re-parse these instances. In specific instances, the model generated incomplete responses that contained certain special tokens normally used by the model as a stop sequence for example. These included "<s>" and "###" which were detected and replaced in data postprocessing. Furthermore, incomplete responses were also identifiable by missing a "." as a last character, leading to the elimination of any question-and-answer pair where the answer did not end with a period.

Data Description

The dataset is provided in the JSON and CSV format ("mimic_note_iv_qa.json" and "mimic_note_iv_qa.csv") and contains machine-generated question-and-answer pairs aligned with patient discharge summaries, sourced from the MIMIC-IV-Note dataset. Please note that the utf-8 encoded JSON was converted into a CSV in accordance with the RFC 4180 specification. However, as the "qa_pairs" key contains a list of dictionaries, it is recommended to parse this as a JSON string when attempting to work with the dataset. A second utf-8 encoded JSON and CSV file ("mimic_note_iv_qa_verified.json" and "mimic_note_iv_qa_verified.csv") are included containing only the entries that were verified by a physician.

File Structure and Formats

The JSON dataset is organized as a list of dictionaries, with each dictionary corresponding to a distinct record that pertains to an individual patient's discharge summary. Each dictionary in the dataset comprises several keys that correspond to relevant information.

Detailed Description of Fields

The following fields are present within each verified entry/dictionary. For non-verified entries, the "human_verified" key and "correct" key for each qa_pair entry are ommited:

original_id: A unique, project internal identifier used during the creation of this dataset.
qa_pairs: A list of dictionaries where each entry represents a machine-generated question-and-answer pair pertaining to the discharge summary in the 'note' field. The “correct” field is optional, and only present on questions that have been verified:
- question: A string containing a specific question focused on the patient's case as generated by the model.
- answer: A string that holds the answer to the corresponding question, generated by the model.
- correct: A boolean value indicating the verified accuracy of the generated question and answer as determined by a physician.
extended_context: A boolean field which denotes whether the total number of tokens encompassed by the note surpasses 3000. This was used when deciding whether to pass the question-and-answer generation task to the 4096 or 8192 context size model.
note_id: A string representation of the unique identifier assigned to each note copied from the MIMIC-IV-Note dataset. This data is directly extracted from the MIMIC-IV-Note dataset.
subject_id: A string representation of the identifier linking the note to a given patient from the MIMIC-IV dataset. This data is directly extracted from the MIMIC-IV-Note dataset.
hadm_id: A string representation of the identifier linking the note to the specific hospital admission for a given patient from the MIMIV-IV dataset. This data is directly extracted from the MIMIC-IV-Note dataset.
human_verified: A boolean field that indicates whether all of the question-and-answer pairs associated with the question have been human verified.

Summary Statistics

Some summary statistics for the final dataset are provided below. Please note that these statistics pertain to the dataset as made available after postprocessing. Please note that word counts were programmatically determined using the NLTK library.

# of generated QA pairs	156599
# of standard context QA pairs generated (<3000 tokens)	133562
# of extended context QA pairs generated	23037

# of discharge notes processed	21466
# of standard context notes processed (<3000 tokens)	18203
# of extended context notes processed	3263
# of Unique Patients from which notes were processed	14967
Average discharge summary word count	1250
Median discharge summary word count	1209
Range of discharge summary word count	(3, 5908)

# of Physician Verified QA Pairs	506
% of total QA pairs that are verified	0.323%
# of correct verified QA pairs	478
# of incorrect verified QA pairs	28
% of correct verified QA pairs	94.466%
# of verified standard context length (<3000 Tokens) QA pairs	441
# of correct standard context length (<3000 Tokens) QA pairs	418
% of correct standard context length (<3000 Tokens) QA pairs	94.78%
# of verified extended context length (>3000 Tokens) QA pairs	65
# of correct extended context length (>3000 Tokens) QA pairs	60
% of correct extended context length (>3000 Tokens) QA pairs	92.30%

Usage Notes

Known Limitations

Despite a substantial number of correct question-and-answer pairs, certain accuracy limitations persist within the dataset. Most errors manifest as omissions during list comprehensions, notably when delineating individual medications or diagnoses in response to questions such as “What medications were prescribed upon discharge?”. It is important to note that list comprehension responses were marked as incorrect if an item from the list was missing, even if all other items were correct. Extended context discharge summaries are more prone to answer invalidity due to the compounded complexity arising from the increased note length. Furthermore, instances exist where the model fails to identify answers within the provided context despite the availability of the required information, opting to indicate the absence of data rather than generating a factually incorrect, hallucinated response. Only rarely does the model provide factually incorrect hallucinated responses that contain information not originally included in the context.

The current study was limited by the available hardware capabilities, specifically the limited 24 gigabytes of video random access memory (VRAM), restricting the process to the utilization of quantized 13 billion parameter models and a maximum token length of 6000 for the discharge summaries. Future iterations of this project aim to recreate the process with larger context sizes and higher performance large language models as compute capability permits. While the results obtained from using quantized 13 billion parameter models are satisfactory, more nuanced and complex questions with appropriate answers can be expected as the model parameter size and performance increase. This adaptation would likely allow for the generation of more complex and diverse question-and-answer pairs.

Reuse Potential

This dataset can be used for developing, validating, and testing context-enhanced question-answering systems in healthcare scenarios.

GitHub Repository for this Project

The code used in the creation of this dataset is available on GitHub under "kkotsche1/EHR-DS-QA-Code-for-QA-Generation-over-EHR-Discharge-Summaries" [23]

Release Notes

Version: 1.0.0

Ethics

The dataset can serve as a resource for refining machine-learning models aimed at enhancing the analysis of medical discharge summaries and notes, thus possibly assisting in informed healthcare decision-making. However, the dataset is not without limitations. The few incompletely or incorrectly generated responses might introduce misinformation risks if not critically evaluated and accounted for when employing this dataset for training or evaluation of large language models. Moreover, despite being de-identified, the use of real patient data necessitates careful management to uphold privacy standards.

Acknowledgements

The project was completed without any external funding or external computing resources.

Conflicts of Interest

The author has no conflicts of interest to declare.

References

Bardhan J, Colas A, Roberts K, Wang D Z. DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries (version 1.0.0). PhysioNet. 2022. Available from: https://doi.org/10.13026/a849-cd06 [Accessed 18th September 2023]
Goldberger A, Amaral L, Glass L, Hausdorff J, Ivanov PC, Mark R et al. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220 [Accessed 18th September 2023]
Yue X, Zhang X F, Sun H. Annotated Question-Answer Pairs for Clinical Notes in the MIMIC-III Database (version 1.0.0). PhysioNet. 2021. Available from: https://doi.org/10.13026/j0y6-bw05 [Accessed 18th September 2023]
Lehman E, Lialin V, Legaspi KY, Sy AJ, Pile PT, Alberto NR, Ragasa RR, Puyat CV, Alberto IR, Alfonso PG, Taliño M. Learning to ask like a physician. arXiv preprint arXiv:2206.02696. 2022 Jun 6.Available from: https://doi.org/10.48550/arXiv.2206.02696 [Accessed 18th September 2023]
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, et al. Training language models to follow instructions with human feedback [Internet]. arXiv; 2022. Available from: https://arxiv.org/abs/2203.02155 [Accessed 18th September 2023]
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language Models are Few-Shot Learners [Internet]. arXiv; 2020. Available from: https://arxiv.org/abs/2005.14165 [Accessed 18th September 2023]
Johnson A, Bulgarelli L, Pollard T, Horng S, Celi L A, Mark R. MIMIC-IV (version 2.2). PhysioNet. 2023. Available from: https://doi.org/10.13026/6mm1-ek67 [Accessed 18th September 2023]
Johnson A, Pollard T, Horng S, Celi L A, Mark R. MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. 2023. Available from: https://doi.org/10.13026/1n74-ne17 [Accessed 18th September 2023]
Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv; 2023. Available from: https://arxiv.org/abs/2307.09288 [Accessed 18th September 2023]
Chen S, Wong S, Chen L, Tian Y. Extending Context Window of Large Language Models via Positional Interpolation. arXiv; 2023. Available from: https://arxiv.org/abs/2306.15595 [Accessed 18th September 2023]
Alberto IR, Alberto NR, Ghosh AK, Jain B, Jayakumar S, Martinez-Martin N, McCague N, Moukheiber D, Moukheiber L, Moukheiber M, Moukheiber S. The impact of commercial health datasets on medical research and health-care algorithms. The Lancet Digital Health. 2023 May 1;5(5):e288-94. Available from: https://doi.org/10.1016/S2589-7500(23)00025-0 [Accessed 18th September 2023]
Open-Orca/OpenOrca-Platypus2-13B · Hugging Face [Internet]. New York City, New York: Hugging Face, Inc.; 2023 [updated 08/21/2023; cited 2023 September 18th]. Available from: https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B [Accessed 18th September 2023]
Mukherjee S, Mitra A, Jawahar G, Agarwal S, Palangi H, Awadallah A. Orca: Progressive Learning from Complex Explanation Traces of GPT-4 [Internet]. arXiv; 2023. Available from: https://arxiv.org/abs/2306.02707 [Accessed 18th September 2023]
OpenAssistant/llama2-13b-orca-8k-3319 · Hugging Face [Internet]. New York City, New York: Hugging Face, Inc.; 2023 [updated 2023 July 27th; cited 2023 September 18th]. Available from: https://huggingface.co/OpenAssistant/llama2-13b-orca-8k-3319 [Accessed 18th September 2023]
turboderp/exllama: A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights. [Internet]. San Francisco, California: GitHub, Inc.; 2023 [updated 2023 September 12th; cited 2023 September 18th]. Available from: https://github.com/turboderp/exllama [Accessed 18th September 2023]
huggingface/transformers: Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX [Internet]. San Francisco, California: GitHub, Inc.; 2016 [updated 2023 September 15th; cited 2023 September 18th]. Available from: https://github.com/huggingface/transformers [Accessed 18th September 2023]
Open-Orca/OpenOrcaxOpenChat-Preview2-13B · Hugging Face [Internet]. New York City, New York: Hugging Face, Inc.; 2023 [updated 2023 August 21st; cited 2023 September 18th]. Available from: https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B [Accessed 18th September 2023]
garage-bAInd/Platypus2-13B · Hugging Face [Internet]. New York City, New York: Hugging Face, Inc.; 2023 [updated 2023 August 15th; cited 2023 September 18th]. Available from: https://huggingface.co/garage-bAInd/Platypus2-13B [Accessed 18th September 2023]
Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: Low-Rank Adaptation of Large Language Models [Internet]. arXiv; 2021. Available from: https://arxiv.org/abs/2106.09685 [Accessed 18th September 2023]
togethercomputer/RedPajama-Data: The RedPajama-Data repository contains code for preparing large datasets for training large language models [Internet]. San Francisco, California: GitHub, Inc.; 2023 [updated 2023 June 14th; cited 2023 September 18th]. Available from: https://github.com/togethercomputer/RedPajama-Data [Accessed 18th September 2023]
ehartford/dolphin · Datasets at Hugging Face [Internet]. New York City, New York: Hugging Face, Inc.; 2023 [updated 2023 July 31st; cited 2023 September 18th]. Available from: https://huggingface.co/datasets/ehartford/dolphin [Accessed 18th September 2023]
shahules786/orca-chat · Datasets at Hugging Face [Internet]. New York City, New York: Hugging Face, Inc.; 2023 [updated 2023 July 25th; cited 2023 September 18th]. Available from: https://huggingface.co/datasets/shahules786/orca-chat [Accessed 18th September 2023]
kkotsche1/EHR-DS-QA-Code-for-QA-Generation-over-EHR-Discharge-Summaries: Code for EHR-DS-QA: A synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems [Internet]. San Francisco, California: GitHub, Inc.; 2023 [updated 2023 September 20th; cited 2023 September 20th] . Available from: https://github.com/kkotsche1/EHR-DS-QA-Code-for-QA-Generation-over-EHR-Discharge-Summaries [Accessed 20th September 2023]