- Dataset 1: Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin
- Dataset 2: Prediction of Gene/Protein Function and Localization
Description of Dataset 1: Prediction of Molecular Bioactivity for Drug Design -- Binding to Thrombin
Drugs are typically small organic molecules that achieve their desired activity by binding to a target site on a receptor. The first step in the discovery of a new drug is usually to identify and isolate the receptor to which it should bind, followed by testing many small molecules for their ability to bind to the target site. This leaves researchers with the task of determining what separates the active (binding) compounds from the inactive (non-binding) ones. Such a determination can then be used in the design of new compounds that not only bind, but also have all the other properties required for a drug (solubility, oral absorption, lack of side effects, appropriate duration of action, toxicity, etc.).
The present training data set consists of 1909 compounds tested for their ability to bind to a target site on thrombin, a key receptor in blood clotting. The chemical structures of these compounds are not necessary for our analysis and are not included. Of these compounds, 42 are active (bind well) and the others are inactive. Each compound is described by a single feature vector comprised of a class value (A for active, I for inactive) and 139,351 binary features, which describe three-dimensional properties of the molecule. The definitions of the individual bits are not included - we don't know what each individual bit means, only that they are generated in an internally consistent manner for all 1909 compounds. Biological activity in general, and receptor binding affinity in particular, correlate with various structural and physical properties of small organic molecules. The task is to determine which of these properties are critical in this case and to learn to accurately predict the class value. To simulate the real-world drug design environment, the test set contains 636 additional compounds that were in fact generated based on the assay results recorded for the training set. In evaluating the accuracy, a differential cost model will be used, so that the sum of the costs of the actives will be equal to the sum of the costs of the inactives. In other words, it is just as important to minimize your error rate on the actives as it is to minimize your error rate on the inactives, even though the training set contains more inactive than actives (and the test set might also).
We thank DuPont Pharmaceuticals for graciously providing this data set for the KDD Cup 2001 competition. All publications referring to analysis of this data set should acknowledge DuPont Pharmaceuticals Research Laboratories and KDD Cup 2001.
Description of Dataset 2: Prediction of Gene/Protein Function and Localization
The genomes of several organisms have now been completely sequenced, including the human genome -- depending on one's definition of "completely" . Interest within bioinformatics is therefore shifting somewhat away from sequencing, to learning about the genes encoded in the sequence. Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another, in order to perform crucial functions. The present data set consists of a variety of details about the various genes of one particular type of organism. Gene names have been anonymized and a subset of the genes have been withheld for testing. The two tasks are to predict the functions and localizations of the proteins encoded by the genes. A gene/protein can have more than one function, but (at least in this data set) only one localization. The other information from which function and localization can be predicted includes the class of the gene/protein, the phenotype (observable characteristics) of individuals with a mutation in the gene (and hence in the protein), and the other proteins with which each protein is known to interact.
The full data set is in Full_File.data. But please notice that the task is quite "relational." For example, one might wish to learn a rule that says a gene G has function F if G interacts with another gene G1 that has function F. We have made an effort to build such features into Full_File.data. (For example, for each gene we give the number of interacting genes with a given function -- these features are probably useful for predicting at least one or two of the functions). But participants may wish to construct their own additional features or to use a relational data mining algorithm. While this certainly can be done from Full_File.data, it may be easier to do this from the relational tables that we used to build Full_File.data. These are in Genes_relation.data and Interactions_relation.data. Each of the data files has a corresponding names file as well.
Detailed knowledge of the biology should not be necessary for this application. This is so much the case that we almost even anonymized all the other fields as well as the gene field. But in the end we decided instead to leave the other fields alone, since this might make the data set more interesting. One word of caution: your predictor for function should not use localization, and your predictor for localization should not use function, since *both* fields will be withheld from the test genes when they are provided. Also note that, because a gene may have more than one function, we will test for correct prediction of every (gene, function) pair. By the time we provide the test data, we will provide full specification of the format for submission of your predictions.