Database Restricted Access
KURIAS-ECG: a 12-lead electrocardiogram database with standardized diagnosis ontology
Published: Nov. 8, 2021. Version: 1.0
When using this resource, please cite:
(show more options)
Yoo, H., Yum, Y., Park, S., Lee, J. M., Jang, M., Kim, Y., Kim, J., Park, H., Han, K. S., Park, J. H., & Joo, H. J. (2021). KURIAS-ECG: a 12-lead electrocardiogram database with standardized diagnosis ontology (version 1.0). PhysioNet. https://doi.org/10.13026/kga0-0270.
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
The 12-lead electrocardiogram (ECG) is the fundamental test used to evaluate the electrophysiological state of the heart. Most previous ECG databases use the diagnosis confirmed by experts. Therefore, ECG diagnosis is not standardized and its quality is not uniform. In addition, there is a problem in that the diagnosis statements differ depending on the ECG machine. This limits the use of the database for further research, such as artificial intelligence and clinical research. Of note, modern ECG machines provide computerized rule-based ECG diagnoses comparable to physicians’ interpretations.
Here a high-quality 12-lead ECG database with standard vocabularies (SNOMED-CT and OMOP-CDM), which was transformed from a computerized ECG diagnosis of ECG machines was developed. A total of 147 ECG diagnoses were grouped into 10 categories of the Minnesota code classification. To improve the quality of the database, ECG cases of inappropriate ECG, such as poor quality, reverse arm ECG, and missing data, were removed. In addition, to minimize skew of the database, 2000 ECG cases were extracted for each Minnesota classification category. As a result of database construction, the database consisted of 20000 records for 10 categories of the Minnesota code classification from 13862 patients. The standardized database can be utilized for comprehensive research on the diagnosis of cardiac disease and the development of robust artificial intelligence technology.
Electrocardiography (ECG) is the most basic test for diagnosing or screening cardiac diseases. Recently, many studies have been conducted to advance the pre-processing and diagnostic algorithms of ECG signals using artificial intelligence and deep learning technologies. The most important part of clinical and artificial intelligence research using ECG data is the structure and quality of the database, which is the base data for this research. ECG data consist of various types of data, such as diagnosis statements, waveform data, and analyzed parameter results through ECG machines.
For ECG diagnosis, there was a difference in the diagnosis of each machine. In addition, the diagnosis statement of the machine is analyzed in a rule-based manner, providing high accuracy, but ultimately requires a physician's decision. These factors have a problem in that it is difficult to apply a standardized diagnostic system to construct an ECG database. In addition, low-quality ECG data generated due to patient movement and device errors during electrocardiogram recording are a major cause of database quality deterioration. Therefore, in this study, a standardized 12-lead ECG database was constructed using the Minnesota classification category while removing low-quality data based on the 12-lead ECG obtained from 2017 to 2020.
KURIAS-ECG database was created primarily from 12-lead ECG data acquired over second periods 10 seconds at 500 Hz between the years of 2017 to 2020. The database is composed of four sections: general metadata, analyzed parameters, and waveform data obtained directly from the ECG recording; in addition, diagnosis statements obtained directly from ECG recordings and standardized diagnosis statements through SNOMED-CT and OMOP-CDM.
Four main steps were taken to construct a high-quality ECG database. First, we examined 12-lead ECGs collected between 2017 and 2020 and removed data classified as poor quality or where the ECG report noted suspected arm ECG leads reversal (n=286,542). Second, from a total of 402,774 12-lead ECG waveforms, we removed records with a sampling rate of less than 500 Hz or with missing data (n=7,644), and selected only the waveforms recorded on the first visit to a hospital (n=157,594). In the third step, 2000 waveforms corresponding to 10 Minnesota classification categories were extracted chronologically for 157,594 patients. In this process, all cases with less than 100 cases for each ECG diagnosis were included. In the last step, a process to remove noise using signal processing was carried out to improve the quality of waveforms.
Diagnoses include statements generated by the data post-processing system and the results of mapping the OMOP-CDM vocabulary, SNOMED-CT, and Minnesota classification corresponding to diagnosis statements. The concept-id corresponding to the diagnosis statement of the machine is used as the base information for the Minnesota classification. The concept-id and Minnesota codes corresponding to the diagnosis statement were verified by a physician. In addition, quality verification of the waveform was performed through baseline variability analysis and accuracy of the diagnosis classification model.
The study protocol was approved by the Institutional Review Board of Korea University Anam Hospital (IRB NO. 2021AN0261). The written informed consents were waived, because of the retrospective study design with minimal risk to participants. The study also complied with the Declaration of Helsinki.
Data preprocessing was carried out in two steps. First, the XML data was de-identified in the hospital's internal server loaded with source data, and then converted into a database management system for efficient data extraction. In the de-identification step, information such as name, date of birth, and other personal information was deleted, and the patient ID was replaced with the ID of the hospital's common data model in order to be combined with electronic health records in the future. For the de-identified xml data, the tree structure was converted into a table structure using Python, and the migration operation was performed to transfer the table to MS-SQL.
KURIAS-ECG Database consists of a CSV file and 20,000 waveform database files. The database comprises 20,000 ECG data from 13,862 patients. The average age of the patients is 58 years (±20), and the ratio of males to females is 56% and 44%, respectively. The ECG data consists of 10 classifications based on the Minnesota system, and each classification can be subdivided into statements provided by the ECG device. Table 1 shows the 10 Minnesota classification categories and the top three detailed statements.
Table 1. Minnesota classification category and ECG statement
|Minnesota classification category||Statement|
|Unclassified||Sinus rhythm, Sinus arrhythmia, QT interval (prolonged)|
|QRS axis deviation||Left axis deviation, Right axis deviation, Indeterminate axis|
|High amplitude R wave||LVH, RVH, Ventricular hypertrophy|
|Arrhythmia||Sinus rhythm (bradycardia), Atrial fibrillation, Sinus rhythm (tachycardia)|
|AV conduction defect||AV block, AV block (1st degree), PR interval (short)|
|Ventricular conduction defect||RBBB, RBBB (incomplete), rSr pattern in V1 and V2|
|Q and QS pattern||Myocardial infarction (inferior), Myocardial infarction (septal), Myocardial infarction (anterior)|
|ST junction and segment depression||Myocardial ischemia (lateral), ST-T abnormality (non-specific), Myocardial ischemia (anterior)|
|T wave item||T wave (abnormal), T wave (inverted), T wave (flattened)|
|Miscellaneous||ST segment elevation, P wave (abnormal), Voltage (decreased)|
The ECG diagnosis names above were abbreviated and modified slightly to reduce space. AV, atrioventricular; LVH, left ventricular hypertrophy; LVH, left ventricular hypertrophy; RBBB, right bundle branch block; RVH, right ventricular hypertrophy; STEMI, ST-segment elevation myocardial infarction.
The KURIAS-ECG database includes two file types: CSV and WFDB. Files provided in CSV format include general metadata, derived parameters, diagnosis statements, and waveforms. Waveforms are also provided in the WFDB format [3,4]. Data contained in CSV files are as follows:
- General metadata: Information about ECG data, such as person ID, gender, age, and acquisition date. Person ID is defined as a de-personalized identifier and contains information replaced with the ID of the common data model of the hospital.
- Analyzed parameters: Parameters automatically analyzed by the ECG machine, such as heart rate, PR interval, and QRS Duration.
- Diagnosis Statements and Standards: Diagnosis statements automatically analyzed by the post-processing system of the ECG machine and the results of mapping the OMOP-CDM vocabulary, SNOMED-CT, and Minnesota classification corresponding to the diagnosis statements.
Data contained in WFDB format are:
- Waveform data: 12-lead ECG signals recorded at 500 Hz for 10 s. The header file contains general information about the signal, such as sampling rate and units, including the column name of the signal, and the data file contains 12 signal information converted to 16 bits.
In this study, the ECG2SQL code was used to facilitate the use of the database. The ECG2SQL code includes converting waveform data provided in WFDB format into original data, merging with data provided in CSV, and transmitting data to a database management system. The KURIAS-ECG database is intended to support a range of ECG studies, in particular those exploring the relationship between ECG conditions and high-resolution waveforms. Python code for working with the KURIAS-ECG database is available on GitHub .
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2021R1I1A1A01059747).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
- Prineas, R. J., Crow, R. S., & Zhang, Z. M. (2009). The Minnesota code manual of electrocardiographic findings. Springer Science & Business Media.
- Donnelly, K. (2006). SNOMED-CT: The advanced terminology and coding system for eHealth. Studies in health technology and informatics, 121, 279.
- Moody, G., Pollard, T., & Moody, B. (2021). WFDB Software Package (version 10.6.2). PhysioNet. https://doi.org/10.13026/zzpx-h016.
- Xie, C., McCullum, L., Johnson, A., Pollard, T., Gow, B., & Moody, B. (2021). Waveform Database Software Package (WFDB) for Python (version 3.3.0). PhysioNet. https://doi.org/10.13026/g35g-c061.
- KU-RIAS code on GitHub: https://github.com/KU-RIAS [Accessed: 1 October 2021].
Only logged in users who sign the specified data use agreement can access the files.
License (for files):
PhysioNet Restricted Health Data License 1.5.0