This page lists questions that were asked together with answers. Only questions of general interest appear here, without duplication; numerous other questions are omitted, as well as numerous other phrasings of questions that already appear here.
I would like to ask whether one can focus on the analysis of just one of the two dataset in order to participate to the KDD Cup.
Certainly. You may submit predictions for one, two, or all three tasks. Three winners will be identified, one for each of the three tasks: (1) prediction of active compounds for Dataset 1, (2) prediction of function for Dataset 2, (3) prediction of localization for Dataset 2.
Regarding Dataset 2, if the test data will contain the table similar to Interactions_relation will it contain data on interactions (test gene A - test gene B) in addition to interactions (test gene - learning gene) or the latter kind of interactions only?
The test data will contain interactions of both types, test genes with training genes and test genes with other test genes.
Regarding Dataset 1, can we expect that the ratio of the number of active compounds to the number of inactive compunds in the final test set to be roughly same as the ratio in the given training set?
No, and to keep this similar to the real-world scenario, we don't want to say what the ratio is. But being generous people :-), we will give away 2 items of information. (1) The compounds in the test set were made after chemists saw the activity results for the training set, so as you might expect the test set has a higher fraction of actives than did the training set. (2) Nevertheless, the test set still has more inactives than actives. We realize our method of testing makes the task tougher than it would be with the standard assumption that test data are drawn according to the same distribution as the training data. But this is a common scenario for those working in the pharmaceutical industry.
How will the entries be judged? Accuracy, speed of computation, conciseness of rule?
Entirely by test set accuracy, or 1 - error. But please note that for Dataset 1, error will be the sum of error on actual actives and error on actual inactives. Thus, for example, if there are 10 actives and 100 inactives in the test set, then each active will effectively count 10 times as much as each inactive.
How do we submit our entries? Do we ftp our algorithm to you or just simply the test results? What are the constraints? Processing window? Hardware?
You will send us simply the test results in a text file. We will specify the format of this file when we provide the test data. You may arrive at your predictions for the test data in any way you like (we impose no constraints).
What is an essential gene and what is a complex?
An essential gene is one without which the organism dies. Some proteins are complexes of several peptides (each encoded by a single gene). So if several genes have the same complex, it means they code for different parts of the same protein. Your data mining system should get good use of these fields without your having to give the fields any kind of special treatment based on domain knowledge. You may just treat them as nominal (discrete-valued) fields with the possible values listed in the Genes_relation.names file. The same is true for phenotype, class, motif, etc.
For Dataset 2, what is the meaning of the Corr (real-valued) field for two interacting genes.
This is the correlation between gene expression patterns for the two genes. A correlation far from 0 implies that these genes are likely to influence one another strongly.
For Dataset 1, assume that the test set contains NA active substances and NI inactive ones. My procedure classifies correctly NAcor of NA active substances and NIcor of NI inactive substances. Is it correct that the measure of error of my procedure would be Err = (NA - NAcorr)/NA + (NI - NIcor)/NI ? Is it Err I should minimize?
Yes. The person/group that minimizes this on the test set wins. This is equivalent to minimizing the ordinary error with differential costs. For example, if the test set contains 10 actives and 100 inactives, we want to maximize accuracy (minimize ordinary error) where each active counts 10 times but each inactive only once. This is the standard accuracy with differential misclassification costs as is used throughout data mining and machine learning.
I was looking at the Thrombin data set, and found that 593 of the 1909 samples are all zero (i.e., none of the bits are 1). Included in this set of 593 are two Active compounds. Is this correct, or could it be a mistake?
This is correct. Clearly one cannot get 100% training set accuracy because of this, but one probably can remove the two all-zero actives from the training set without incurring any disadvantage. I would not suggest removing any of the all-zero inactives, since their contributions to the frequencies of the different attributes are important (at least for frequency-based approaches such as decision/classification trees).
A small question w.r.t. dataset 2. Some attributes have missing values (now denoted by a '?'), but there are also 'Unknown' values. What is the difference between these?
There is no difference between 'unknown' and missing values. In some cases, the experimentalist assigned the particular gene to the class 'unknown'. In other cases where the class was not known it was left blank. In the data cleansing step both should be assigned to missing values.
Is there an error concerning gene G239017? I found two localizations given for this gene.
According to the classification scheme used, a gene can have multiple localizations. Actually the localizations refer to the gene product (protein) rather than the gene itself (e.g. a cytoplasmic transcription factor that moves into the nucleus given a specific signal). Nevertheless, almost all of the proteins have only one localization, and we will ensure that each protein in the test set has only one localization.
Just for completeness: will all functions and localizations of test examples come from the set of functions and localizations of training examples? There could be at least one different case, indicated to me by the occurrence of "TRANSPOSABLE ELEMENTS VIRAL AND PLASMID PROTEINS " in the file Genes_relation.names as one possible value for the function attribute, while I did not find this value for any training example.
It is possible that some cases are not represented in the training set. This reflects the real situation where genes of a given class have not been found or confirmed yet.
I am a bit puzzled by some issues of reflexive and symmetrical interaction relations contained in Interactions_relation.data. As far as I can see, there are 44 cases of genes interacting with themselves. (Why not all?) Moreover, I found 14 gene pairs, where gene#1 interacts with gene#2, and also gene#2 with gene#1. (Again, why are these cases just sporadic?) Could you maybe give some background information about these matters?
Interactions are not necessarily reflexive. Certain protein molecules bind to form homo-dimers. All interactions are symmetrical however. If gene1 interacts with gene2, then gene2 interacts with gene1. We have tried to list those interactions only once. However, in some cases both pairs made it to the final table.These should be considered as duplicate records.
On the thrombin data set, is there any particular order in the way 139351 binary features were generated?
For function prediction, the README file gives as an example of a function '"Auxotrophies, carbon and"'. But this is not a function, it is a phenotype. What's up with this?
This was my (David's) mistake. I wanted to give an example that involved the double-quotes, but I mistakenly copied out the wrong string. I should have used as the example '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS "'. So don't let this confuse you. But don't worry if you omitted the double-quotes or comma (see next answer).
The double-quotes in some of the function names look odd. Are we supposed to include them? And what about the blanks at the end of some of the strings, such as in '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS "'? And what about commas? In the Genes_relation.data file, there was a comma in this function after CELL GROWTH. What should we do?
Use the function names as they appear in the .data file, e.g., use '"CELL GROWTH CELL DIVISION AND DNA SYNTHESIS "'. But actually we have written our scoring code so it can handle the case where you omit the double quotes, and so that it only looks at enough of the string to distinguish it from the other functions, so that no one should be penalized for differences in punctuation or decisions about trailing blanks, etc.
Regarding Dataset 2, you suggested using features which depend on functions/localizations of the genes that gene G interacts with. While I certainly can do this for the training set, it will be impossible to do this for test set, since I will not know the function/localization... Is there something I do not understand?
For any gene G in the test set, you will know function and localization for the training set genes with which G interacts. So if a test set gene G interacts with a training set gene G1 that has function F, then you might infer the test set gene G has function F. You can also "pull yourself up by your boostraps" as follows. If G interacts with another test set gene G2, and you have a high-confidence prediction that G2 has function F, you might infer that G has function F.
Will you weight the accuracy score for Task 2 and Task 3 in the same way you do for Task 1?
No. Let's go through these in detail, starting with Task 3 because it is easier. Every gene (actually, protein) in the test set has exactly one localization. For each gene, your prediction is either correct or incorrect. Accuracy is simply the fraction of localizations that are correctly predicted. (A non-prediction for a gene counts as an incorrect prediction.) The highest accuracy wins. Now let's go to Task 2. Because most proteins have multiple functions, we consider how many of the possible (protein,function) pairs are correctly predicted. If you include a (protein,function) pair that is known to be correct, i.e., that appears in our key, this is a true positive. If you include a pair that is not in our key, this is a false positive. If you fail to include a pair that appears in our key, this is a false negative. And you get credit for each pair that you do not predict that also does not appear in our key -- this is a true negative. The accuracy of your predictions is just the standard (true positive + true negative)/(true positive + true negative + false positive + false negative). It is worth noting that we very seriously considered using a second scoring function, a weighted accuracy, for this task. But we decided not to use a second function because we saw no compelling reason to assume that errors of omission are any more or less costly than errors of commission for this task.
Regarding Task 3, if my model fails to predict localization for some gene, how should I specify this?
Just don't include any entry for that gene in your results file. But because of the scoring function, you might as well just guess a localization for that gene (e.g., the localization that appears most often in the training set).
Regarding the test set for Dataset 2, if we use the composite variables with the number of interactions... On the training set, I assume that these variables only take a count of the number of interactions with training genes. What about on the test set? It seems they should count only the interactions with the training set genes, in order for the numbers to be comparable. If this is not the case, could you please create a version of the test set in which this is the case.
These variables (even when appearing in the test set) do indeed count only interactions with training set genes, in order to maintain consistency.
Regarding Task 1, in the first question period, you made reference to the fact that there would be more inactives than actives in the test set. I just want to make sure you're sticking by that statement, and I can count on that fact.
We're sticking by this statement -- we're using exactly the test set that we had in mind all along. There are more inactives than actives. But because the test set molecules were synthesized after the chemists looked at the activity levels of the training set molecules, you can expect there's a higher fraction of actives in the test set than in the training set. As we mentioned before, this makes matters tougher than under the fairly standard assumption that the test set is drawn according to the same distribution as the training set (or that it's a held-out set drawn randomly, uniformly, without replacement). But we're using it because, as stated before, it models the real world setting where this type of task arises. Can the data mining systems do better than the chemists alone (can they make a contribution that will be useful to the chemists)? If your predictions are strong on this test set, it indicates that your model would have been useful to the chemists in choosing the next round of compounds to make.