For this task, a total of 69 cases were collected and provided to an expert chest radiologist, who reviewed each case and marked the PEs. The cases were randomly divided into training and test sets. The training set includes 38 positive cases and 8 negative cases, while the test set contains the remaining 23 cases. The test group is sequestered and will only be used to evaluate the performance of the final system.
Note: It turns out that patients numbers 3111 and 3126 in the test data duplicate patients 3103 and 3115, respectively, in the training data. Therefore, 3111 and 3126 were dropped from the final evaluation for the competition. These patients should be discarded from testing for any future comparisons to the KDD Cup results. The file drop_duplicate_patients_mask.dat, in the KDD Cup archive file, can be used to identify the excluded rows in the KDDPETest.txt file.
More data descriptions are provided in this PDF file.
The training data will consist of a single data file plus a file containing field (feature) names. Each line of the file represents a single candidate and comprises a number of whitespace-separated fields:
Field 0: Patient ID (unique integer per patient)
Field 1: Label - 0 for "negative" - this candidate is not a PE; >0 for "positive" - this candidate is a PE
Field 2+: Additional features
The testing data will be in the same format data file, except that Field 1 (label) will be "-1" denoting "unknown".