Database Credentialed Access

Annotation dataset of social determinants of health from MIMIC-III Clinical Care Database

Marco Guevara Shan Chen Spencer Thomas Danielle Bitterman

Published: Jan. 24, 2024. Version: 1.0.1

When using this resource, please cite: (show more options)
Guevara, M., Chen, S., Thomas, S., & Bitterman, D. (2024). Annotation dataset of social determinants of health from MIMIC-III Clinical Care Database (version 1.0.1). PhysioNet.

Additionally, please cite the original publication:

Guevara, Marco, Shan Chen, Spencer Thomas, Tafadzwa L. Chaunzwa, Idalid Franco, Benjamin H. Kann, Shalini Moningi, et al. 2024. “Large Language Models to Identify Social Determinants of Health in Electronic Health Records.” NPJ Digital Medicine 7 (1): 6.

Please include the standard citation for PhysioNet: (show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.


Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data. We developed annotation guidelines for sentence-level annotation of SDoH that are not reliably available as structured data in the EHR: employment, housing, transportation, parental status, relationship, and social support. Sentences were labeled for both the presence of an SDoH mention and the presence of an adverse SDoH mention. After finalizing the annotation guidelines, two annotators manually annotated a separate corpus, which cannot be released due to PHI. A total of 300/800 (37.5%) of these notes underwent dual annotation. Before adjudication, dually-annotated notes had a Krippendorf’s alpha agreement of 0.86 and Cohen’s Kappa of 0.86 for any SDoH mention categories. For adverse SDoH mentions, notes had a Krippendorf’s alpha agreement of 0.76 and Cohen’s Kappa of 0.76. As an external validation, 200 notes from MIMIC-III written by physicians, social workers, and nurses were manually annotated by a single annotator. Here, we release this manually annotated corpus of 200 MIMC-III notes.


Health disparities have been extensively documented across medical specialties [1-3]. However, our ability to address these disparities remains limited by an insufficient understanding of their contributing factors. Social determinants of health (SDoH), are defined by the World Health Organization as “the conditions in which people are born, grow, live, work, and age...shaped by the distribution of money, power, and resources at global, national, and local levels” [4].  SDoH may be adverse or protective, impacting health outcomes at multiple levels as they likely play a major role in disparities by determining access to and quality of medical care. For example, a patient cannot benefit from an effective treatment if they don’t have transportation to make it to the clinic. There is also emerging evidence that exposure to adverse SDoH may directly affect physical and mental health via inflammatory and neuro-endocrine changes [5-8]. In fact, SDoH are estimated to account for 80-90% of modifiable factors impacting health outcomes [9].

SDoH are rarely documented comprehensively in structured data in the electronic health records (EHRs) [10-12], creating an obstacle to research and clinical care. Instead, issues related to SDoH are most frequently described in the free text of clinic notes, which creates a bottleneck for incorporating these critical factors into databases to research the full impact and drivers of SDoH, and for proactively identifying patients who may benefit from additional social work and resource support.

Natural language processing (NLP) could address these challenges by automating the abstraction of these data from clinical texts. Prior studies have demonstrated the feasibility of NLP for extracting a range of SDoH [13-23]. However, most publicly available datasets only include Social History sections. Clinically-impactful SDoH information is often scattered throughout other note sections and many note types, such as many inpatient progress notes and notes written by nurses and social workers, do not consistently contain Social History sections. We aimed to develop a dataset for developing methods to extract SDoH that have not been as commonly targeted by prior efforts: employment, housing, transportation, parental status, relationship, and social support. This dataset includes sentence-level annotations of full clinical notes. The annotations are linked to the categories and attributes from our annotation guidelines. Researchers can choose to use any or all of the available annotations. Full details of this study are available in the original publication listed above.


Data and Annotations

The dataset used to develop the annotation guidelines and measure inter-annotator agreement was a corpus of 800 clinic notes from 770 patients with cancer who received radiotherapy (RT) at the Department of Radiation Oncology at Brigham and Women’s Hospital/Dana-Farber Cancer Institute in Boston, Massachusetts from 2015-2022. We also created two validation datasets. First, we collected 200 clinic notes from 170 patients with cancer treated with immunotherapy at Dana-Farber Cancer, and not present in the RT dataset. Second, we collected 200 notes from 183 patients in the MIMIC (Medical Information Mart for Intensive Care)-III database [24-25], which includes data associated with patients admitted to the critical care units at Beth Israel Deaconess Medical Center in Boston, Massachusetts from 2001-2008. Only the MIIMC-III notes are made available here, as the other datasets contain PHI.

Only notes written by physicians, physician assistants, nurse practitioners, registered nurses, and social workers were included. To maintain a minimum threshold of information, we excluded notes with fewer than 150 tokens across all provider types. This helped ensure that the selected notes contained sufficient textual content. For notes written by all providers save social workers, we excluded notes containing any section longer than 500 tokens to avoid excessively lengthy sections that might have included less relevant or redundant information. For physician, physician assistant, and nurse practitioner notes, we used a customized medSpacy sectionizer to include only notes that contained at least one of the following sections: Assessment and Plan, Social History, and History/Subjective. For the MIMIC-III dataset released here, :only notes written by physicians, social workers, and nurses were included for analysis from the MIMIC III corpus. We focused on patients who had at least one social work note, without any specific date range criteria. Prior to annotation, all notes were segmented into sentences using the syntok[26] sentence segmenter as well as split on bullet points “•”. This method was used for all notes in the radiotherapy, immunotherapy, and MIMIC datasets for sentence-level annotation and subsequent classification.

Task definition and data labeling

We defined our label schema and classification tasks by first carrying out interviews with subject matter experts, including social workers, resource specialists, and oncologists, to determine SDoH that are clinically relevant but not already readily available as structured data in the EHR, especially as dynamic features over time. After initial interviews, a set of exploratory pilot annotations was conducted on a subset of clinical notes and preliminary annotation guidelines were developed. The guidelines were then iteratively refined and finalized based on the pilot annotations and additional input from subject matter experts. The following SDoH categories and their attributes were selected for inclusion in the project: Employment status (employed, unemployed, underemployed, retired, disability, student), Housing issue (financial status, undomiciled, other), Transportation issue (distance, resource, other), Parental status (if the patient has a child under 18 years old), Relationship (married, partnered, widowed, divorced, single), and Social support (presence or absence of social support). 

We defined two multilabel sentence-level classification tasks:

  1. Any SDoH mentions: The presence of language describing an SDoH category as defined above, regardless of the attribute.
  2. Adverse SDoH mentions: The presence or absence of language describing an SDoH category with an attribute that could create an additional social work or resource support need for patients:
  • Employment status: unemployed, underemployed, disability
  • Housing issue: financial status, undomiciled, other
  • Transportation issue: distance, resources, other
  • Parental status: having a child under 18 years old 
  • Relationship: widowed, divorced, single
  • Social support: absence of social support

After finalizing the annotation guidelines, two annotators manually annotated the RT corpus. A total of 300/800 (37.5%) of the notes underwent dual annotation. Before adjudication, dually-annotated notes had a Krippendorf’s alpha agreement of 0.86 and Cohen’s Kappa of 0.86 for any SDoH mention categories. For adverse SDoH mentions, notes had a Krippendorf’s alpha agreement of 0.76 and Cohen’s Kappa of 0.76. A single annotator then annotated the remaining radiotherapy notes, the immunotherapy dataset, and the MIMIC-III dataset. 

Data augmentation

We employed synthetic data generation methods to assess the impact of data augmentation for the positive class, and also to enable an exploratory evaluation of proprietary large LMs that could not be used with protected health information. In round 1, GPT-turbo-0301(ChatGPT) version of GPT3.5 via the OpenAI API[27] was prompted to generate new sentences for each SDoH category, using sentences from the annotation guidelines as references. In round 2, in order to generate more linguistic diversity, the sample synthetic sentences output from round 1 were taken as references to again generate another set of synthetic sentences. One hundred sentences per category were generated in each round. Please refer to our project github [28], which provides full details and associated code for our prompting methods and inference settings for GPT-turbo-0301. In brief, GPT-turbo-0301 settings were as follows: temperature=0.3, frequency penalty=1.9, presence penalty=0.9, max tokens=4000.

Manually-validated synthetic data

We manually validated 480 of the synthetic sentences so that models that cannot be used with protected health information could be evaluated. For all synthetic data generation methods, no real patient data were used in prompt development or fine-tuning. 

Data Description

MIMIC-III dataset: This dataset consists of sentences from MIMIC-III notes, labeled for the presence of SDoH mentions. The dataset is provided in a single .csv file, SDOH_MIMICIII_physio_release.csv, that includes 5,329 rows of annotated sentences from 200 MIMIC-III notes. As described above, notes were pre-processed by segmenting into sentences using the syntok[26] sentence segmenter as well as split on bullet points “•”. The file contains the following variables (columns):

  • provider_type: The author provider type of the note the sentence was taken from.
  • patient_id: The MIMIC-III "SUBJECT_ID"
  • note_id: The MIMIC-III "ROW_ID"
  • sentence_index: Index for that sentence
  • text: The plain text sentence.
  • TRANSPORTATION_distance through PARENT: The annotation for each category/attribute in the dataset. There is a variable for each unique combination of category and attribute that appeared in the dataset, and each variable is named using the following convention: CATEGORY_attribute. Of note, PARENT is the only category with no attributes. Please refer to the annotation guidelines for definitions of each category and attribute. Values:
    • 0 = negative for that category/attribute
    • 1 = positive for that category/attribute

Synthetic data for dataset augmentation (not manually verified): The datasets of synthetic sentences labeled with SDoH generated during round 1 and round 2 of synthetic data generation are provided in two .csv files: SyntheticSentences_Round1.csv and SyntheticSentences_Round1.csv, respectively. Each file consists of 901 synthetic sentences and their label. These sentences and labels have not been manually verified to be correct and were used directly for data augmentation during model development. The files contain the following variables (columns):

  • text: The plain text synthetic sentence.
  • label: The SDoH label for that sentence. Values:
    • housing
    • employment
    • parent
    • support
    • relationship
    • transportation
  • adverse: Whether the label is an adverse SDoH. Values:
    • adverse
    • nonadverse

Manually-verified synthetic data: 480 synthetic sentences whose SDoH labels have been manually reviewed and verified as correct are provided in a single .csv file, ManuallyAnnotatedSyntheticSentences.csv. The files contain the following variables (columns):

  • text: The plain text synthetic sentence.
  • label: The SDoH label for that sentence. Values:
    • housing
    • employment
    • parent
    • support
    • relationship
    • transportation
  • adverse: Whether the label is an adverse SDoH. Values:
    • adverse
    • nonadverse

Annotation Guidelines: In addition, we include our annotation guidelines in a .pdf file, SDOH_annotation_guidelines.pdf.

Usage Notes

This dataset was developed to investigate large language models to extract SDoH, and the paper reports our methods and results. It provides detailed information on the annotation guideline development, annotation process, patient and sentence-level details, and limitations of the dataset. Importantly, this dataset comes from a predominantly white population treated in Boston, Massachusetts in the United States of America. This limits the generalizability of methods and findings generated from this dataset.

Please carefully review the annotation guidelines before using this dataset. Many SDoH concepts are necessarily vague, and so we had to constrain our definitions. Therefore, annotations align strictly to our guidelines and may not align with broad definitions of each concept. We acknowledge that there are different interpretations of SDoH and their attributes.

Please remember that the labels in SyntheticSentences_Round1.csv and SyntheticSentences_Round1.csv were not manually validated and were used for data augmentation during language model fine-tuning. These labels should not be used for methods evaluation.The project github, include the code used to obtain our results, is available [28].

Release Notes

Initial release version 1.0.0


This study was approved by the Mass General Brigham institutional review board, and consent was waived as this was deemed exempt human subjects research.


We thank Susan Harper and Madeleine Goldstein for contributing their expertise in social work and resource support needs, which were foundational to the development of the annotation guidelines.

Conflicts of Interest



  1. Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016).
  2. GitHub: Accessed November 6, 2023.
  3. OpenAI: Accessed November 6, 2023.
  4. Lavizzo-Mourey, R. J., Besser, R. E. & Williams, D. R. Understanding and Mitigating Health Inequities - Past, Current, and Future Directions. N. Engl. J. Med. 384, 1681–1684 (2021).
  5. Chetty, R. et al. The Association Between Income and Life Expectancy in the United States, 2001-2014. JAMA 315, 1750–1766 (2016).
  6. Holmes Fee C, Hicklen RS, Jean S, Abu Hussein N, Moukheiber L, de Lota MF, Moukheiber M, Moukheiber D, Anthony Celi L, Dankwa-Mullan IStrategies and solutions to address Digital Determinants of Health (DDOH) across underinvested communities. PLOS digital health. 2023 Oct 12;2(10):e0000314.
  7. Social determinants of health.
  8. Franke, H. A. Toxic Stress: Effects, Prevention and Treatment. Children 1, 390–402 (2014).
  9. Nelson, C. A. et al. Adversity in childhood is linked to mental and physical health throughout life. BMJ 371, m3048 (2020).
  10. Shonkoff, J. P., Garner, A. S., Committee on Psychosocial Aspects of Child and Family Health, Committee on Early Childhood, Adoption, and Dependent Care & Section on Developmental and Behavioral Pediatrics. The lifelong effects of early childhood adversity and toxic stress. Pediatrics 129, e232–46 (2012).
  11. Turner-Cobb, J. M., Sephton, S. E., Koopman, C., Blake-Mortimer, J. & Spiegel, D. Social support and salivary cortisol in women with metastatic breast cancer. Psychosom. Med. 62, 337–345 (2000).
  12. Hood, C. M., Gennuso, K. P., Swain, G. R. & Catlin, B. B. County Health Rankings: Relationships Between Determinant Factors and Health Outcomes. Am. J. Prev. Med. 50, 129–135 (2016).
  13. Truong, H. P. et al. Utilization of Social Determinants of Health ICD-10 Z-Codes Among Hospitalized Patients in the United States, 2016-2017. Med. Care 58, 1037–1043 (2020).
  14. Heidari, E., Zalmai, R., Richards, K., Sakthisivabalan, L. & Brown, C. Z-code documentation to identify social determinants of health among Medicaid beneficiaries. Res. Social Adm. Pharm. 19, 180–183 (2023).
  15. Wang, M., Pantell, M. S., Gottlieb, L. M. & Adler-Milstein, J. Documentation and review of social determinants of health data in the EHR: measures and associated insights. J. Am. Med. Inform. Assoc. 28, 2608–2616 (2021).
  16. Conway, M. et al. Moonstone: a novel natural language processing system for inferring social risk from clinical narratives. J. Biomed. Semantics 10, 1–10 (2019).
  17. Bejan, C. A. et al. Mining 100 million notes to find homelessness and adverse childhood experiences: 2 case studies of rare and severe social determinants of health in electronic health records. J. Am. Med. Inform. Assoc. 25, 61–71 (2017).
  18. Topaz, M., Murga, L., Bar-Bachar, O., Cato, K. & Collins, S. Extracting Alcohol and Substance Abuse Status from Clinical Notes: The Added Value of Nursing Data. Stud. Health Technol. Inform. 264, 1056–1060 (2019).
  19. Gundlapalli, A. V. et al. Using natural language processing on the free text of clinical documents to screen for evidence of homelessness among US veterans. AMIA Annu. Symp. Proc. 2013, 537–546 (2013).
  20. Hammond, K. W., Ben-Ari, A. Y., Laundry, R. J., Boyko, E. J. & Samore, M. H. The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population. J. Trauma. Stress 28, 505–514 (2015).
  21. Han, S. et al. Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing. J. Biomed. Inform. 127, 103984 (2022).
  22. Rouillard, C. J., Nasser, M. A., Hu, H. & Roblin, D. W. Evaluation of a Natural Language Processing Approach to Identify Social Determinants of Health in Electronic Health Records in a Diverse Community Cohort. Med. Care 60, 248–255 (2022).
  23. Feller, D. J. et al. Detecting Social and Behavioral Determinants of Health with Structured and Free-Text Clinical Data. Appl. Clin. Inform. 11, 172–181 (2020).
  24. Yu, Z. et al. A Study of Social and Behavioral Determinants of Health in Lung Cancer Patients Using Transformers-based Natural Language Processing Models. AMIA Annu. Symp. Proc. 2021, 1225–1233 (2021).
  25. Lybarger, K. et al. Leveraging natural language processing to augment structured social determinants of health data in the electronic health record. J. Am. Med. Inform. Assoc. (2023) doi:10.1093/jamia/ocad073.
  26. Patra, B. G. et al. Extracting social determinants of health from electronic health records using natural language processing: a systematic review. J. Am. Med. Inform. Assoc. 28, 2716–2727 (2021).
  27. Johnson, A., Pollard, T. & Mark, R. MIMIC-III clinical database. (2023) doi:10.13026/C2XW26.
  28. GitHub: Accessed November 6, 2023.

Parent Projects
Annotation dataset of social determinants of health from MIMIC-III Clinical Care Database was derived from: Please cite them when using this project.

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
PhysioNet Credentialed Health Data License 1.5.0

Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.