KDD Cup 2001 involves 3 tasks, based on two data sets. The two training datasets are available from the links below, as zip files. The first dataset is a little over half a gigabyte when uncompressed and comes as a single text file, with one row per record and fields separated by commas. The second is a little over 7 megabytes uncompressed. It includes a single text file with all the data; again, the format is one row per record with comma-separated fields. But this data set is quite relational in nature, so improved accuracy may be possible by constructing more complex features or using a relational data mining technique (see the README file that comes with it). Nevertheless, we've tried to pre-compute some of the interesting relations as added fields, so that standard feature-vector algorithms can compete well. For both datasets, "names" files also are included that give the names of the fields; the names are "meaningful" only for the second dataset. For both datasets a README file is included that describes the nature of the task. The README files are repeated at the bottom of this page for those who wish to read about the data/task before choosing to download the data.
The following are the keys that were used for scoring. Several points are worth noting regarding Function and Localization keys. First, submissions varied widely in the use of punctuation, case, and spelling for function and localization names. Because of this variation, we decided to have our code remove punctuation and look at only a long enough prefix of a name to distinguish it from all others -- the name was then converted into a shorter standard form. These shorter forms are the ones given in the keys below. We also handchecked entries and converted forms. Second, one gene in the test set had two localizations (contradicting our earler statement that each gene had only one localization). For this gene, the predicted localization was counted correct if it matched *either* of the correct localizations. Third, one function appeared in a test set gene but in no training set gene. This of course made it impossible to get 100% accuracy, but everyone was subject to this same constraint, and we think it just goes with the territory of a real-world task