Database Credentialed Access

CORAL: expert-Curated medical Oncology Reports to Advance Language model inference

Madhumita Sushil Vanessa Kennedy Divneet Mandair Brenda Miao Travis Zack Atul Butte

Published: Feb. 7, 2024. Version: 1.0

When using this resource, please cite: (show more options)
Sushil, M., Kennedy, V., Mandair, D., Miao, B., Zack, T., & Butte, A. (2024). CORAL: expert-Curated medical Oncology Reports to Advance Language model inference (version 1.0). PhysioNet.

Additionally, please cite the original publication:

Sushil, Madhumita, et al. "Extracting detailed oncologic history and treatment plan from medical oncology notes with large language models." arXiv preprint arXiv:2308.03853 (2023).

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Both medical care and observational studies in oncology require a thorough understanding of a patient's disease progression and treatment history, often elaborately documented within clinical notes. As large language models (LLMs) are becoming more popular, it becomes important to evaluate their potential in oncology. However, no current information representation schema fully encapsulates the diversity of oncology information within clinical notes, and no comprehensively annotated oncology notes exist publicly, thereby limiting a thorough evaluation. We curated a new fine-grained, expert-labeled dataset of 40 de-identified breast and pancreatic cancer progress notes at University of California, San Francisco, and assessed three recent LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) in zero-shot extraction of detailed oncological information from two narrative sections of clinical progress notes. Model performance was quantified with BLEU-4, ROUGE-1, and exact match (EM) F1-score evaluation metrics. Our team of oncology fellows and medical students manually annotated 9028 entities, 9986 modifiers, and 5312 relationships. The GPT-4 model exhibited overall best performance, with an average BLEU score of 0.73, an average ROUGE score of 0.72, an average EM-F1-score of 0.51, and an average accuracy of 68% (expert manual evaluation on 20 notes). GPT-4 was proficient in tumor characteristics and medication extraction, and demonstrated superior performance in inferring symptoms due to cancer and considerations of future medications. Common errors included partial responses with missing information and hallucinations with note-specific information. LLMs are promising for performing reliable information extraction for clinical research, complex population management, and documenting quality patient care, but there is a need for further improvements.


Cancer care is complex, often involving multiple treatments across different institutions, with the majority of this complexity only being captured within the textual format of an oncologist’s clinical note. Optimal clinical decision-making as well as research studies based on real-world data require a nuanced and detailed understanding of this complexity, naturally leading to wide-spread interest in oncology information extraction research[1]. Recently, large language models (LLMs) like the GPT-4 model[2] have shown impressive performance on several natural language processing (NLP) tasks in medicine, including obtaining high scores on United States Medical Licensing Examination (USMLE) questions[3,4], medical question answering[5], promising performance for medical consultation, diagnosis, and education[6], identifying key findings from synthetic radiology reports[7], biomedical evidence and medication extraction[8], and breast cancer recommendations[9]. However, due to the lack of publicly available and comprehensively annotated oncology datasets, the analysis of these LLMs for information extraction and reasoning in real-world oncology data remains fragmented and understudied.

To date, prior studies on oncology information extraction have either focused on elements represented within ICD-O3 codes or cancer registries[10,11], or on a subset of cancer- or problem-specific information[12-16]. No existing information representation and annotation schema is adept enough to encompass comprehensive textual oncology information in a problem-, note type-, and disease-agnostic manner. Although similar frameworks are being created for tabular oncology data[17], efforts for textual data sources have been limited to pilot studies[18], surveys of oncology elements studied across different research contributions[19,20], and domain-specific schemas[21]. In this study, we aim to develop an expert-labeled oncology note dataset to enable the evaluation of LLMs in extracting clinically meaningful, complex concepts and relations by: (a) developing a schema and guidelines for comprehensively representing and annotating textual oncology information, (b) creating a dataset of 40 oncology progress notes labeled according to this schema, and (c) benchmarking the baseline performance of the recent LLMs for zero-shot extraction of oncology information.


To holistically represent oncology information within clinical notes, we developed a detailed schema, which comprised of the following broad categories: patient characteristics, temporal information, location-related information, test-related information, test results-related information, tumor-related information, treatment-related information, procedure-related information, clinical trial, and disease state. Broad concepts further encompassed several fine-grained subcategories, which included subcategory-specific details, for example, the subcategory of radiology test, genomic test, and diagnostic lab test were included within the concept “tumor test”. Each concept was further elaborated to incorporate nuances such as negations, the experiencer, and intent behind a test, and could be related to other concepts to elaborate information such as temporality of an event, causal information such as cause of an adverse event or the reason for prescribing a treatment. The elaborate information schema designed as such included nuanced care-related concepts determined by clinical experience, and was agnostic to cancer and note types under consideration. The schema was implemented through three annotation modalities: a) entities or phrases of a specific type, b) attributes or modifiers of entities, and c) relations between entity pairs. These relations could either be (i) descriptive, for example relating a biomarker name to its results, (ii) temporal, for example indicating when was a test conducted, or (iii) advanced, for example, relating a treatment to adverse events caused due to it. Together, the schema comprised of 59 unique entities, 23 attributes, and 26 relations.

We further collected a diverse set of 20 breast cancer and 20 pancreatic cancer patients from the University of California, San Francisco (UCSF) Information Commons, which contained patient data between 2012–2022, de-identified with Philter [22]. All dates within notes were shifted by a random, patient-level offset to maintain anonymity. Only patients with corresponding tabular staging data, documented disease progression, and an associated medical oncology note were considered for document sampling. Some specific gene symbols, clinical trial names and cancer stages were inappropriately redacted in our automated handling, and these were manually added back to the clinical notes under the UCSF IRB #18-25163 and #21-35084. These two diseases were chosen for their dissimilarity — while breast cancer is frequently curable, heavily reliant on biomarker and genetic testing and treatment plans integrating radiation, surgical, and medical oncology, pancreatic cancer has high mortality rates, and highly toxic traditional chemotherapy regimens. All narrative sections except direct copy-forwards of radiology and pathology reports were annotated using the knowledge schema described earlier by one of two oncology fellows and/or a medicine student. The skipped sections were marked as such within text, additionally indicating the reason for skipping. This resulted in a final corpus of 40 expert-annotated medical oncology notes.

Additionally, from the same de-identified dataset of UCSF Information Commons, an additional 100 notes each of breast cancer and pancreatic cancer were sampled while stratifying for a diverse race/ethnicity distribution. Race/ethnicities were sampled to either a uniform distribution case count or maximum counts in the UCSF dataset, whichever was smaller. Redaction errors were not manually corrected in this subset. The GPT-4 model with temperature 0, and version 0314 was used to automatically label information related to patient symptoms, tumor characteristics, radiology test-related information, procedure-related information, genetic test-related information, and medication-related information, using the same prompts as described in the manuscript and those used to benchmark the models on the manually labeled set. The details of prompts and the source code are available in the accompanying Github repository.

To establish the baseline capability of LLMs in extracting detailed oncological history, we evaluated three recent LLMs without any task-specific training (i.e. “zero-shot” extraction): the GPT-4 model[2], the GPT-3.5-turbo model (base model for the ChatGPT interface[23]), and the openly-available foundation model FLAN-UL2[24] on the following information extraction tasks: 

1) Identify all symptoms experienced by the patient, symptoms present at the time of cancer diagnosis, and symptoms experienced due to the diagnosed cancer, all further related to the datetime of their occurrence

2) List radiology tests conducted for the patient paired with their datetime, site of the test, medical indication for the test, and the test result.

3) List genetic and genomic tests conducted for the patient paired with the corresponding datetime and the test result.

4) Infer the datetime for the first diagnosis of cancer for the patient.

5) Extract tumor characteristics in the following groups: biomarkers, histology, stage (TNM and numeric), grade, and metastasis (along with the site of metastasis and the procedure that diagnosed metastasis), all paired with their datetime. 

6) Identify all interventional procedures conducted for the patient paired with their datetime, site, medical indication, and outcome.

7) List medications prescribed to the patient, linked to the beginning datetime, end datetime, reason for prescription, continuity status (continuing, finished, or discontinued early), and any hypothetical or confirmed adverse events attributed to the medication.

8) Infer medications that are either planned for administration or discussed as a potential option, paired with their consideration (planned or hypothetical) and potential adverse events discussed in the note.

We used the GPT models via the HIPAA-compliant Microsoft Azure OpenAI studio and application programming interface, so that no data was permanently transferred to or stored by Microsoft or OpenAI. Separately, we implemented the openly-available FLAN-UL2 model on the internal computing environment. Model inputs were provided in the format {system role description} {note section text, prompt}. Model temperature was set at 0. 0613 version of the GPT-3.5-turbo model and 0314 version of the GPT-4 model were used via the Microsoft Azure OpenAI studio platform for all the experiments. The API version was 2023-05-15. Task-specific prompts are provided in the corresponding manuscript. Since LLMs generate free-text outputs to represent entity mentions and their relations, model performance was quantified using two evaluation metrics for comparing pairs of text — BLEU-4 with smoothing[25] and ROUGE-1[26]. Furthermore, the exact match F1-score between model outputs and annotated phrases was quantified to evaluate the model’s ability to generate lexically-identical outputs. Additionally, the accuracy of the best-performing model was quantified for 11 entity extraction tasks (all excluding grade, TNM stage, and summary stage extraction), and 20 relation extraction tasks on a random subset of half of the notes — 10 notes from breast cancer and 10 from pancreatic cancer — through review by an independent oncologist.

Data Description

The annotation schema and the final annotated dataset are available in the BRAT[27] standoff format in the folder 'coral'. The folder contains two further sub-folders, one for manually-labeled data ('annotated') and one for data that is not manually labeled ('unannotated').

The 'annotated' folder further contains two folders: one for breast cancer notes (breastca), and one for pancreatic cancer notes (pdac). Each folder further consists of pairs of '.txt' and '.ann' files for each document, where each '.txt' file contains a note text, and the corresponding '.ann' file contains annotations in the standoff format for that note text, with one annotation in each line, consisting of annotation ID, concept type, text offsets for the annotations, and the annotation string. Further details of the format can be referenced in the description of BRAT software[28]. Furthermore, the corresponding demographic information for the annotated notes is present in a csv file (subject-info.csv) in the folder, 'coral/annotated/', where the coral_idx corresponds to the file name for each annotated note without the extension. The configuration files for brat are also available under three file names: annotation.conf (containing the newly-developed annotation schema in the format for BRAT), tools.conf (containing tool descriptions to assist manual annotations, for example ability to query google), kb_shortcuts.conf (containing keyboard shortcuts for annotating), and visual.conf (containing color coding for different entity types). While these are essentially free-text files, they can be read in relevant format by the BRAT software in conjunction with the annotations. In total, this annotated section consists of 9028 entities, 9986 modifiers, and 5312 relationships that are manually annotated by either an oncology fellow or a medical student.

Finally, the folder 'coral' contains one more folder, 'unannotated', for clinical notes that were not annotated manually by domain experts, but were rather processed with the GPT-4 model[2] with the same prompts that were used to benchmark the model against manual annotations. The folder contains a subfolder called 'data' that includes 100 breast cancer and 100 pancreatic cancer notes in two CSV files, one for each disease ('breastca_unannotated.csv' and 'pdac_unannotated.csv'), consisting of the demographic information and the corresponding medical oncology note for patients, and a subfolder 'gpt4_out' that provides the GPT-4 model outputs for these notes in two separate CSV files ('breastca_gpt4' and 'pdac_gpt4'), each linked with the same coral_idx as the corresponding unannotated files. These 200 notes are separate from the 40 notes that were annotated manually, and can potentially be used to expand upon the provided manual annotations.

Usage Notes

Please refer to the corresponding GitHub repository for relevant source codes to read the BRAT files and process the data[29]. Please refer to the BRAT manual for details about BRAT software, how to install it, and the annotated data format [28].

To view the manually annotated data in a visual interface, users will need to install BRAT python packages as discussed in their manual[28] and copy the downloaded contents within the BRAT data directory. This data directory is configured when installing the BRAT software, and by default, it is available under the 'data' directory in the main BRAT software directory after installation. The users can then run BRAT through 'python' and load the dataset files for a visual overview.


The dataset was constructed from de-identified patient data recorded during clinical visit, where clinical notes were deidentified with the Philter software, and dates within notes were consistently shifted by a random offset. This initial dataset was non-human subjects research data in lieu of the certified de-identification. Some specific de-identification errors in gene names, clinical trial names, lymph node fractions were manually put back in de-identified notes under the UCSF Institutional Board IRB #18-25163 and #21-35084. The resulting final dataset was validated as de-identified by an external expert reviewer before release for further research. 


This research would not have been possible without immense support from several people. The authors thank the UCSF AI Tiger Team, Academic Research Services, Research Information Technology, and the Chancellor’s Task Force for Generative AI for their software development, analytical and technical support related to the use of Versa API gateway (the UCSF secure implementation of large language models and generative AI via API gateway), Versa chat (the chat user interface), and related data asset and services. We thank Boris Oskotsky, and the Wynton high-performance computing platform team for supporting high-performance computing platforms that enable the use of large language models with de-identified patient data, and Binh Cao for data annotations. We further thank Jennifer Creasman, Alysa Gonzales, Dalia Martinez, and Lakshmi Radhakrishnan for help with correcting clinical trial and gene name redactions. We thank Prof. Kirk Roberts for helpful discussions regarding frame semantics-based annotations of cancer notes, Prof. Artuur Leeuwenberg for discussions about temporal relation annotation, and all members of the Butte lab for useful discussions in the internal presentations. Partial funding for this work is through the FDA grant U01FD005978 to the UCSF–Stanford Center of Excellence in Regulatory Sciences and Innovation (CERSI), through the NIH UL1 TR001872 grant to UCSF CTSI, through the National Cancer Institute of the National Institutes of Health under Award Number P30CA082103, and from a philanthropic gift from Priscilla Chan and Mark Zuckerberg. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Conflicts of Interest

MS, VEK, DM, and TZ report no financial associations or conflicts of interest. BYM is a paid consultant for drug development at SandboxAQ with no conflicts of interest for this research. AJB is a co-founder and consultant to Personalis and NuMedii; consultant to Mango Tree Corporation, and in the recent past, Samsung, 10x Genomics, Helix, Pathway Genomics, and Verinata (Illumina); has served on paid advisory panels or boards for Geisinger Health, Regenstrief Institute, Gerson Lehman Group, AlphaSights, Covance, Novartis, Genentech, and Merck, and Roche; is a shareholder in Personalis and NuMedii; is a minor shareholder in Apple, Meta (Facebook), Alphabet (Google), Microsoft, Amazon, Snap, 10x Genomics, Illumina, Regeneron, Sanofi, Pfizer, Royalty Pharma, Moderna, Sutro, Doximity, BioNtech, Invitae, Pacific Biosciences, Editas Medicine, Nuna Health, Assay Depot, and Vet24seven, and several other non-health related companies and mutual funds; and has received honoraria and travel reimbursement for invited talks from Johnson and Johnson, Roche, Genentech, Pfizer, Merck, Lilly, Takeda, Varian, Mars, Siemens, Optum, Abbott, Celgene, AstraZeneca, AbbVie, Westat, and many academic institutions, medical or disease specific foundations and associations, and health systems. AJB receives royalty payments through Stanford University, for several patents and other disclosures licensed to NuMedii and Personalis. AJB’s research has been funded by NIH, Peraton (as the prime on an NIH contract), Genentech, Johnson and Johnson, FDA, Robert Wood Johnson Foundation, Leon Lowenstein Foundation, Intervalien Foundation, Priscilla Chan and Mark Zuckerberg, the Barbara and Gerson Bakar Foundation, and in the recent past, the March of Dimes, Juvenile Diabetes Research Foundation, California Governor’s Office of Planning and Research, California Institute for Regenerative Medicine, L’Oreal, and Progenity. None of these entities had any role in the design, execution, evaluation, or writing of this manuscript. None of the authors have any conflicts of interest with this research.


  1. Savova GK, Danciu I, Alamudun F, et al. Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records. Cancer Research 2019;79(21):5463–70.
  2. OpenAI. GPT-4 Technical Report. 2023. Available from:
  3. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems.
  4. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health 2023;2(2):e0000198.
  5. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature 2023;620(7972):172–80.
  6. Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New England Journal of Medicine 2023;388(13):1233–9.
  7. Leveraging GPT-4 for Post Hoc Transformation of Free-Text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study. Available from:
  8. Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022. p. 1998–2022.Available from:
  9. Haver HL, Ambinder EB, Bahl M, Oluyemi ET, Jeudy J, Yi PH. Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT. Radiology 2023;230424.
  10. Alawad M, Yoon H-J, Tourassi GD. Coarse-to-fine multi-task training of convolutional neural networks for automated information extraction from cancer pathology reports. In: 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI). 2018. p. 218–21.
  11. Breitenstein MK, Liu H, Maxwell KN, Pathak J, Zhang R. Electronic Health Record Phenotypes for Precision Medicine: Perspectives and Caveats From Treatment of Breast Cancer at a Single Institution. Clinical and Translational Science 2018;11(1):85–92.
  12. Yala A, Barzilay R, Salama L, et al. Using machine learning to parse breast pathology reports. Breast Cancer Res Treat 2017;161(2):203–11.
  13. Odisho * Anobel, Park B, Altieri N, et al. Pd58-09 extracting structured information from pathology reports using natural language processing and machine learning. Journal of Urology 2019;201(Supplement 4):e1031–2.
  14. Li Y, Luo Y-H, Wampfler JA, et al. Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus. JCO Clinical Cancer Informatics 2020;(4):383–91.
  15. Altieri N, Park B, Olson M, DeNero J, Odisho AY, Yu B. Supervised line attention for tumor attribute classification from pathology reports: Higher performance with less data. Journal of Biomedical Informatics 2021;122:103872.
  16. Zhou S, Wang N, Wang L, Liu H, Zhang R. CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. Journal of the American Medical Informatics Association 2022;29(7):1208–16.
  17. Belenkaya R, Gurley MJ, Golozar A, et al. Extending the OMOP Common Data Model and Standardized Vocabularies to Support Observational Cancer Research. JCO Clinical Cancer Informatics 2021;(5):12–20.
  18. Roberts K, Si Y, Gandhi A, Bernstam E. A FrameNet for Cancer Information in Clinical Narratives: Schema and Annotation [Internet]. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA); 2018. Available from:
  19. Datta S, Bernstam EV, Roberts K. A frame semantic overview of NLP-based information extraction for cancer-related EHR notes. Journal of Biomedical Informatics 2019;100:103301.
  20. Mirbagheri E, Ahmadi M, Salmanian S. Common data elements of breast cancer for research databases: A systematic review. J Family Med Prim Care 2020;9(3):1296–301.
  21. Datta S, Ulinski M, Godfrey-Stovall J, Khanpara S, Riascos-Castaneda RF, Roberts K. Rad-SpatialNet: A Frame-based Resource for Fine-Grained Spatial Relations in Radiology Reports. In: Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association; 2020. p. 2251–60.Available from:
  22. Radhakrishnan L, Schenk G, Muenzen K, et al. A certified de-identification system for all clinical text documents for information extraction at scale. JAMIA Open 2023;6(3):ooad045.
  23. ChatGPT [Internet]. Available from: [Accessed 2/05/2024]
  24. A New Open Source Flan 20B with UL2. Yi Tay [Internet]. Available from: [Accessed 2/05/2024]
  25. Papineni K, Roukos S, Ward T, Zhu W-J. Bleu: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics; 2002. p. 311–8.Available from:
  26. Lin C-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics; 2004. p. 74–81.Available from:
  27. BRAT: Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsujii. 2012. brat: a Web-based Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 102–107, Avignon, France. Association for Computational Linguistics.
  28. BRAT rapid annotation tool manual [Internet]. Available from: [Accessed 2/05/2024]
  29. Zero-shot oncology information extraction with LLMs. [Internet]. Available from: [Accessed 2/05/2024]


Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.