Database Credentialed Access
NCH Sleep DataBank: A Large Collection of Real-world Pediatric Sleep Studies with Longitudinal Clinical Data
Harlin Lee , Boyue Li , Yungui Huang , Yuejie Chi , Simon Lin
Published: Oct. 27, 2021. Version: 3.1.0
When using this resource, please cite:
(show more options)
Lee, H., Li, B., Huang, Y., Chi, Y., & Lin, S. (2021). NCH Sleep DataBank: A Large Collection of Real-world Pediatric Sleep Studies with Longitudinal Clinical Data (version 3.1.0). PhysioNet. https://doi.org/10.13026/p2rp-sg37.
Lee, H., Li, B., DeForte, S. et al. A large collection of real-world pediatric sleep studies. Sci Data 9, 421 (2022). https://doi.org/10.1038/s41597-022-01545-6
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
In order to accelerate research on pediatric sleep and its connection to health, Nationwide Children's Hospital (NCH) and Carnegie Mellon University (CMU) introduce the NCH Sleep DataBank. This dataset has 3,984 pediatric sleep studies on 3,673 unique patients conducted at NCH in Columbus, Ohio, USA between 2017 and 2019, along with the patient's longitudinal clinical data. The published Polysomnography (PSG) contains the patient's physiological signals as well as the technician's assessment of the sleep stages and descriptions of additional irregularities.
The novelties of this dataset include: (1) Size: Its large size is suitable for discovering new scientific insights via data mining, (2) Patient population: It explicitly focuses on pediatric patients, (3) Clinical setting: The sleep studies were gathered in the real-world clinical setting at NCH as opposed to, for example, in a controlled clinical trial, and (4) Rich set of clinical data: The accompanying 5.6 million records of clinical data are extracted from the Electronic Health Record (EHR), and are separated into encounters, medications, measurements (e.g. body mass index), diagnoses, and procedures.
The NCH Sleep DataBank is a valuable resource for advancing automatic sleep scoring and real-time sleep disorder prediction, among many other potential scientific discoveries. Accompanying code in Python to assist users in interacting with the dataset is published on GitHub.
Sleep is an active process associated with physiological changes that involve multiple organ systems, and is vital for the maturation and daily functioning of infants, children, and adolescents. Some infants and children require an analysis while actually sleeping, called an overnight sleep study or Polysomnography (PSG), to accurately diagnose their sleep-related condition. The physiological data collected during a PSG provide a picture of clinically useful information about different sleep stages, sleep disruption, respiratory status during different sleep stages, leg movements, and changes in cardiac rate and rhythm during sleep.
Computational algorithms that learn from large amounts of data have seen remarkable success in healthcare, particularly with the proliferation of Electronic Health Records (EHR) and improved sensors. Regrettably, without a curated and comprehensive dataset of substantial size and accessibility, pediatric sleep has not been able to fully benefit from such opportunities yet. As a first step, we introduce the Nationwide Children's Hospital (NCH) Sleep DataBank, which has 3,984 pediatric sleep studies on 3,673 unique patients conducted at NCH between 2017 and 2019, along with the patients' longitudinal clinical data.
The NCH Sleep DataBank contains sleep studies acquired under standard care at NCH between Dec. 16, 2017 and Dec. 31, 2019 using Natus Sleepworks versions 8 and 9. We then used each patient's last name, date of birth, and Medical Record Numbers (MRNs) extracted from Natus to retrieve patient records from the EHR. When matches could not be confidently made to the EHR, the sleep studies were removed from the dataset. A random date shift of +/- 180 days were used to adjust all identified patient data pulled from the EHR, as well as the sleep study dates recorded in Natus. Finally, as an extra precaution against re-identification, rare diagnosis codes were redacted from the diagnosis table. We defined rare diagnoses as final diagnoses given to less than 10 unique patients from the entire NCH patient population (not limited to the sleep study patients) during a given time period. This process affected a total of 6,460 rows and 834 unique patients in our diagnosis table.
As this project concerns analysis on de-identified data, the project did not fit the definition of Human Subjects Research as defined by the United States Department of Health and Human Services and Food and Drug Administration. Therefore, this study received NCH Institutional Review Board (IRB) exemption with HIPAA waiver. The protocol that concerns the de-identification and processing of the data, which requires handling identified data, and the collection and publication of data and summary statistics, was approved under "STUDY00000505: Preparation of sleep study data" on September 22, 2019.
The NCH Sleep DataBank consists of two folders:
Sleep_Data contains annotated PSG recordings, while
Health_Data contains patient demographic and clinical data extracted from the EHR.
Sleep_Data, PSG sleep studies are provided in the EDF format , and annotations are provided in a separate WFDB-format and Tab-Separated Value (TSV) files. Sleep studies and their matched annotations share the same file name (
SLEEP_STUDY_ID) but different extensions (.edf for the waveforms and .atr/.tsv for the annotations).
Clinical data in
Health_Data are in Comma-Separated Value (CSV) files, and they are linked to the files in
Sleep_Data through the same
STUDY_PAT_ID. Variables follow EHR conventions and descriptions can be found in the file
Sleep_Study_Data_File_Format.pdf. This includes patient demographics and longitudinal clinical data such as encounters, medication, measurements, diagnoses, and procedures.
The Python code that was used to analyze patient data, read EDF files, and run baseline sleep stage classifier is published on Github .
We expect the NCH Sleep DataBank will be used to study many problems related to pediatric sleep, including but not limited to:
- Automatic sleep stage classification, especially algorithms that combine modalities beyond Electroencephalogram (EEG) or Electrocardiogram (ECG)
- Automatic real-time sleep disorder (e.g. apnea) detection
- Diagnosis prediction
- Identifying patient subgroups that could affect their symptoms or best courses of treatment
- Treatment (e.g. medications and procedures) efficacy analysis
Replaced DIAGONSIS.csv with the redacted table.
Updated access policy.
Updated access policy.
This is the release of the full dataset.
The initial release of this dataset (Version 0.1.0) contains only 10 records. These initial records will be used to gather feedback from the community. Later releases will contain the full set of record files.
Research reported in this work was supported by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under Award Number R01EB025018. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
The team thanks Tim Held for data identification, Melody Kitzmiller for data query, Dan Digby for data pipelines, Rajesh Ganta for data validation, Rahul Ragesh, Ramachandra Mannava, and Jacob Hoffman for help with sleep stage classifier development, Daniel Mobley and Michael Rueschman for uploading the data to NSRR, and Tom Pollard and Lucas McCullum for uploading the data to PhysioNet.
Conflicts of Interest
The authors declare no competing interests.
- Sleep study data analysis code on Github. https://github.com/liboyue/sleep_study. [Accessed on 1 May 2021]
- Kemp, B., Värri, A., Rosa, A. C., Nielsen, K. D., & Gade, J. (1992). A simple format for exchange of digitized polygraphic recordings. Electroencephalography and clinical neurophysiology, 82(5), 391–393. https://doi.org/10.1016/0013-4694(92)90009-7.
- Lee, H., Li, B., DeForte, S. et al. A large collection of real-world pediatric sleep studies. Sci Data 9, 421 (2022). https://doi.org/10.1038/s41597-022-01545-6
Only credentialed users who sign the DUA can access the files.
License (for files):
PhysioNet Credentialed Health Data License 1.5.0
Data Use Agreement:
PhysioNet Credentialed Health Data Use Agreement 1.5.0
CITI Data or Specimens Only Research
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project