23 March: Registration is now open. The software provided for challenge participants has been updated.
27 March: Register and submit a preliminary entry by 25 April if you wish to particpate in the Challenge.
26 April: Phase 1 has ended and no further entries may be submitted until Phase 2 begins on 1 June.
8 May: The data sets have been revised for Phase 2 (see details).
30 August: Final scores for both events have been posted.
The development of methods for prediction of mortality rates in Intensive Care Unit (ICU) populations has been motivated primarily by the need to compare the efficacy of medications, care guidelines, surgery, and other interventions when, as is common, it is necessary to control for differences in severity of illness or trauma, age, and other factors. For example, comparing overall mortality rates between trauma units in a community hospital, a teaching hospital, and a military field hospital is likely to reflect the differences in the patient populations more than any differences in standards of care. Acuity scores such as APACHE and SAPS-II are widely used to account for these differences in the context of such studies.
By contrast, the focus of the PhysioNet/CinC Challenge 2012 is to develop methods for patient-specific prediction of in-hospital mortality. Participants will use information collected during the first two days of an ICU stay to predict which patients survive their hospitalizations, and which patients do not.
See the Quick Links at the top of this page to download the Challenge data!
The data used for the challenge consist of records from 12,000 ICU stays. All patients were adults who were admitted for a wide variety of reasons to cardiac, medical, surgical, and trauma ICUs. ICU stays of less than 48 hours have been excluded. Patients with DNR (do not resuscitate) or CMO (comfort measures only) directives were not excluded.
Four thousand records comprise training set A, and the remaining records form test sets B and C. Outcomes are provided for the training set records, and are withheld for the test set records.
Up to 42 variables were recorded at least once during the first 48 hours after admission to the ICU. Not all variables are available in all cases, however. Six of these variables are general descriptors (collected on admission), and the remainder are time series, for which multiple observations may be available.
Each observation has an associated time-stamp indicating the elapsed time of the observation since ICU admission in each case, in hours and minutes. Thus, for example, a time stamp of 35:19 means that the associated observation was made 35 hours and 19 minutes after the patient was admitted to the ICU.
Each record is stored as a comma-separated value (CSV) text file. To simplify downloading, participants may download a zip file or tarball containing all of training set A or test set B. Test set C will be used for validation only and will not be made available to participants.
Update (8 May 2012): The extraneous ages that were present in the previous versions of some data files have been removed, and a new general descriptor (ICUType, see below) has been added in each data file.
Five additional outcome-related descriptors, described below, are known for each record. These are stored in separate CSV text files for each of sets A, B, and C, but only those for set A are available to challenge participants.
All valid values for general descriptors, time series variables, and outcome-related descriptors are non-negative (≥ 0). A value of -1 indicates missing or unknown data (for example, if a patient's height was not recorded).
As noted, these six descriptors are collected at the time the patient is admitted to the ICU. Their associated time-stamps are set to 00:00 (thus they appear at the beginning of each patient's record).
The ICUType was added for use in Phase 2; it specifies the type of ICU to which the patient has been admitted.
These 37 variables may be observed once, more than once, or not at all in some cases:
The time series measurements are recorded in chronological order within each record, and the associated time stamps indicate the elapsed time since admission to the ICU. Measurements may be recorded at regular intervals ranging from hourly to daily, or at irregular intervals as required. Not all time series are available in all cases.
In a few cases, such as blood pressure, different measurements made
using two or more methods or sensors may be recorded with the same or only
slightly different time-stamps. Occasional outliers should be expected as well.
*Note that Weight is both a general descriptor (recorded on admission) and a time series variable (often measured hourly, for estimating fluid balance).
The outcome-related descriptors are kept in a separate CSV text file for each of the three record sets; as noted, only the file associated with training set A is available to participants. Each line of the outcomes file contains these descriptors:
The Length of stay is the number of days between the patient's admission to the ICU and the end of hospitalization (including any time spent in the hospital after discharge from the ICU). If the patient's death was recorded (in or out of hospital), then Survival is the number of days between ICU admission and death; otherwise, Survival is assigned the value -1. Since patients who spent less than 48 hours in the ICU have been excluded, Length of stay and Survival never have the values 0 or 1 in the challenge data sets. Given these definitions and constraints,
Survival > Length of stay ⇒ Survivor
Survival = -1 ⇒ Survivor
2 ≤ Survival ≤ Length of stay ⇒ In-hospital death
To begin, we recommend studying the training set as preparation for the Challenge itself. In particular, note that the SAPS-I score can be calculated readily from the time series, as the sample entries below do. To succeed in the Challenge, you should aim to outperform the sample entries (see Software below).
All entries in the Challenge must be in the form of source code that analyses a single Challenge record, producing a prediction (0: survival, or 1: in-hospital death) and an estimate of the risk of death (as a number between 0 and 1, where 0 is certain survival and 1 is certain death).
Your entry may be written in portable (ANSI/ISO) C or MATLAB/Octave m-code; other languages, such as Java, Perl, and R, may be acceptable (see special requirements for entries in other languages below), but please ask us first, and do so no later than 7 April 2012. Entries must accept properly-formatted input and produce properly-formatted output, either as physionet2012.m does (if written in m-code), or as physionet2012.c does (otherwise).
Acceptable entries are evaluated and scored by PhysioNet using an automated test framework, two versions of which are also available to participants for testing their entries unofficially prior to submitting them. The framework starts execution of an entry, supplies data from a single Challenge record, and collects the entry's analysis for that record; this process is a "run". The framework performs a separate run for each of the 4000 records in set B or set C.
Entries will be restarted for each run (each test record); they may not store information for use in later runs (for example, by writing files to be read later, or, in MATLAB entries, by setting global variables). Entries may include files that may be read but not modified during the test.
Awards will be presented to the most successful eligible participants during Computing in Cardiology (CinC) 2012. To be eligible for an award, you must:
An important goal of this Challenge, and of others in the annual series of PhysioNet/CinC Challenges, is to accelerate progress on the Challenge questions, not only during the limited period of the Challenge, but also afterward. In pursuit of this goal, we strongly encourage participants to submit open-source entries that will be made freely available after the conclusion of the Challenge via PhysioNet. If your entry is not intended as an open-source entry, please state this clearly within its first few lines.
Eligible authors of the entries that receive the best set C scores in each Challenge event will receive award certificates during the closing plenary session of CinC on 12 September 2012. In recognition of their contributions to further work on the Challenge problem, eligible authors of the open-source entries that receive the best set C scores will also receive monetary awards. No team or individual will receive more than one such monetary award.
We have provided sample entries written in MATLAB m-code and in C, test frameworks that can be used for batch-processing a set of Challenge data using a sample entry or your own entry, code for calculating unofficial scores, as well as the outputs of the sample entries for set A. Use this software to test your entry before submitting it, to verify that it can accept properly-formatted input and produce properly-formatted output. If you wish, you may incorporate code from the sample entries within your own entry, but you will have to add something of your own creation in order to succeed in the Challenge!
For participants developing entries using MATLAB:
A valid entry written in m-code must be a function named physionet2012, with this signature:
[risk,prediction]=physionet2012(time,param,value)The function must be able to run this way within the test framework, genresults.m (below), on a 64-bit GNU/Linux platform running MATLAB R2010b (or a later version). See genresults.m for definitions of the input and output variables. With prior approval, your entry may use most MATLAB toolboxes.
Scores calculated by lemeshow.m may differ slightly from the official scores (calculated using score.c, below) due to differences in rounding. Scores calculated by score.c will be used to determine the final rankings.
For participants developing entries using C or (with prior approval) another language:
A valid entry written in any language other than m-code must be provided in source form with instructions (a commented Makefile would be ideal) for producing an executable program named physionet2012 from the source file(s). The executable program must be able to run in this way within the test framework on a 64-bit GNU/Linux platform:
physionet2012 <input-file >output-filei.e, reading the contents of input-file (a Challenge data file such as set-b/142675.txt) from its standard input, and writing its analysis of the input to its standard output, as a single newline-terminated line in this format:
142675,0,0.123where the three fields are the RecordID, the binary prediction, and the risk estimate, as described below.
For participants developing entries in R:
Rscript physionet2012.R <132539.txt >output.txtThis creates an output file output.txt, containing one line:
The sample R entry doesn't analyze the input; it simply reads it and produces a correctly-formatted output line. Use physionet2012.R as a model for your R-code. You can test your entry on set A using genresults.sh if you replace this line in it:
./physionet2012 <$R >>$OUTwith this one:
Rscript physionet2012.R <$R >>$OUT
As in previous challenges, participants may compete in multiple events:
Entries must output both predictions and risk estimates, but if you do not wish to compete in one of the two events your entry may output any acceptable values for that event.
Scoring for Event 1 is based on 2 metrics: Sensitivity (Se) and positive predictivity(+P). We define the numbers of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN) as below:
Using these definitions, the 2 metrics and the scoring for Event 1 are given by:
|Se = TP / (TP + FN)||[the fraction of in-hospital deaths that are predicted]|
|+P = TP / (TP + FP)||[the fraction of correct predictions of in-hospital deaths]|
|Score1 = min(Se,+P)||[the minimum of Sensitivity and positive predictivity]|
The sample MATLAB entry based on SAPS-1 earns an unofficial Event 1 score of 0.296 on set A, whereas random guessing yields a score of 0.139. A perfect (and almost certainly unattainable) Event 1 score is 1, so there is much room for improvement on the sample entry.
Scoring for Event 2 is based on the Hosmer-Lemeshow H statistic (a common measure of model calibration), and on the spread, D, of risk estimates.
To calculate the H statistic for a given entry, the in-hospital mortality risks predicted by that entry are first sorted and the corresponding records are binned into deciles designated by g = 1,2,3...10. Thus the first decile (g = 1) of the 4000-record set B contains the 400 records with the lowest predicted risk, the second decile contains the next 400 records, etc. The H statistic is then calculated as:
where for each decile g, Og is the observed number of in-hospital deaths, Eg is the predicted number of deaths, Ng is the number of records (400), and πg is the mean estimated risk for records in the decile. This definition is modified as shown with respect to the standard definition of H by the addition of 0.001 in the denominator; we do this to avoid division by zero if πg is zero or one. This can occur if an entry estimates a risk of zero (or one) for 400 or more records in a set; the effect of doing so, regardless of accuracy, will cause a substantial increase in H (lower values are better), so we strongly recommend not allowing your entry to output risk values below about 0.01 or above about 0.99.
To be useful as input to medical decisions for individual patients, risk estimates should accurately reflect individual patient risks, rather than simply the risk for the entire population of patients. For this reason, the event 2 score is also based on the range of risks, specifically on the difference, D, between the mean risk estimates in the top and bottom deciles from the H calculation (π10-π1). The event 2 score is thus the range-normalized H statistic, defined as H/D.
The sample MATLAB entry based on SAPS-1 achieves an unofficial Event 2 score of 68 on set A, and random guessing scores 9666. As for Event 1, an ideal score (0 in this case) is almost certainly unattainable, but it should be possible to improve on the sample entry. The figure below shows the observed and predicted numbers of deaths for each decile of risk assigned by the sample MATLAB entry for set A.
We would really like to say yes! What will determine our answer is whether it is practical and reasonably efficient for us to evaluate your entry, and you can help us by testing and documenting the procedure needed to do so thoroughly.
If you can show us how to run your entry in an unmodified copy of our test framework for entries written in C, using only compilers, interpreters, libraries, and other standard components that are freely available for GNU/Linux, we encourage you to submit code written in the language of your choice. We will accept entries written in Java, Perl, or R under these conditions, and (if you ask nicely) we will consider adding other languages to this short list.
Please give us as much extra time as possible to review entries not written in C or m-code.
Sorry, no. We do not require participants to submit open-source entries, but we will not test executables that we have not compiled ourselves from source code that we have inspected.
No. We are trying to encourage both experimentation with multiple approaches and sustained effort. In past Challenges some participants have used their entire allowance of entries before the first deadline, and others have saved their entries until hours before the final deadline. The most successful participants have usually reflected on each set of results, refining their ideas (and not merely their decision thresholds) before submitting the next entry. This approach yields better results, and it also allows us to review your entries and provide scores more rapidly than if we receive a large fraction of them just before the deadlines.
Yes, certainly. SAPS-I uses fewer than half of the variables provided in the Challenge data sets, and only the first half (24 of 48 hours) of those. There is plenty of room for improvement in both events.
The scores are different because the two sample entries are not exactly the same algorithm. The SAPS calculations are very similar, but the C version makes use of data with time stamp of '24:00' (i.e., exactly 24 hours from the start) and the m-code version stops at '23:59'. This accounts for the very slight difference in the event 1 scores even though both entries use the same decision threshold. The risk estimates (which determine the event 2 scores) are completely different; the m-code entry uses a function that was fitted to the observed distribution of risk in set A, and the C entry uses a lookup table that is not optimized at all.
Fund another award and we will consider adding another event!
No single measure can summarize all of the important aspects of performance on the Challenge problem in a single number; that's why we have multiple events. Since scoring metrics inevitably influence how participants design and refine their entries, we have chosen metrics that provide incentives to make useful predictions. Many alternatives would reward trivial and clinically irrelevant strategies; for example, one can obtain a high overall accuracy by predicting that all patients will survive, but such a predictor is of no value whatsoever.
On the off-chance that someone on Baker or Howard Islands in the mid-Pacific, or on a ship at sea between longitudes 172.5° W and 180°, submits an entry just before midnight local time on 25 April, and in order to be fair to everyone else, the deadline for phase 1 was noon GMT on 26 April; at that time, the software that collects entries stopped doing so.
The deadline is one of several that are important if you would like to be eligible for the Challenge awards (see above). Anyone who wishes to participate without eligibility for awards is welcome to join in unofficially during Phase 2 (1 June through 25 August).