A breast cancer screen typically consists of 4 X-ray images; 2 images of each breast from different directions (these views are called MLO and CC). Thus, most (but not all) patients would have MLO and CC images of both their breasts, giving a total of 4 images per patient. For the purposes of the KDD Cup, each image is represented by several candidates (see stage 1 above). For each candidate, we provide the image ID and the patient ID, (x,y) location, several features, and a class label indicating whether or not it is malignant. We provide features computed from several standard image processing algorithms - 117 in all - but due to confidentiality reasons we are unable to provide some additional proprietary features. The labels indicate whether a candidate is malignant or benign (based on either a radiologist's interpretation or a biopsy or both). Note that several candidates can correspond to the same lesion. Thus, we also provide a unique lesion-ID for the malignant lesions in the training data. However lesion-ID information will not be included in the test data.
To support this KDD Cup challenge, training information is provided for a set of 118 malignant patients (patients with at least one malignant mass lesion). We also include data from 1594 normal patients - where all candidates are presumed to be benign. The training set consists of a total of 102,294 candidate ROIs, each described by 117 features, but only an extremely small fraction of these 102,294 candidates is actually malignant.
We provide data from over 1000 patients in the same format, except no class label or lesion-ID will be provided.
We also provide a software function written in Matlab for plotting Free Response Receiver Operating Curves (FROC). This function plots the sensitivity with which malignant cancers are detected (on the y-axis) versus the average number of false alarms (on the x-axis). A malignant cancer is correctly identified if at least one of the examples corresponding to the lesion is labeled as malignant by the classification algorithm.
NOTE: In order to better distinguish between the participants' entries, the training and testing data have been enriched with some difficult cases; further, proprietary features are not included in the dataset. The accuracy of the participants' entries to KDD Cup 2008 should not be considered to be representative of the underlying Siemens CAD software that generated the features.