This year's competition focuses on various performance measures for supervised classification. Real world applications of data-mining typically require optimizing for non-standard performance measures that depend on the application. For example, in direct marketing, error rate is a bad indicator of performance, since there is a strong imbalance in cost between missing a customer and making the advertisement effort slightly too broad. Even for the same dataset, we often want to have different classification rules that optimize different criteria.
In this year's competition, we ask you to design classification rules that optimize a particular performance measure (e.g. accuracy, ROC area, lift). For each of the two datasets described below, we provide a supervised training set and a test set where we held back the correct labels. For each of the two test sets, you will submit multiple sets of predictions. Each set is supposed to maximize the performance according to a particular measure. We will provide software for computing the performance measures in the software area. The same software will be used to determine the winners of the competition. The particular performance measures are described in the documentation of the software.
Particle Physics Task
The goal in this task is to learn a classification rule that differentiates between two types of particles generated in high energy collider experiments. It is a binary classification problem with 78 attributes. The training set has 50,000 examples, and the test set is of size 100,000. Your task is provide 4 sets of predictions for the test set that optimize
- accuracy (maximize)
- ROC area (maximize)
- cross entropy (minimize)
- q-score: a domain-specific performance measure sometimes used in particle physics to measure the effectiveness of prediction models. (maximize: larger q-scores indicate better performance.)
We thank the author of the dataset, who wishes to remain anonymous until after the KDD-Cup.
Software that calculates q-score (and the other seven performance measures) is available from the Software download web page. We'll add a description of q-score to the FAQ web page in the next few days. Until then, just think of it as a black-box performance metric for which you are supposed train models that will maximize this score.
Protein Homology Prediction Task
Additional information about the Protein Homology problem is now available from Ron Elber.
For this task, the goal is to predict which proteins are homologous to a native sequence. The data is grouped into blocks around each native sequence. We provide 153 native sequences as the training set, and 150 native sequences as the test set. For each native sequences, there is a set of approximately 1000 protein sequences for which homology prediction is needed. Homologous sequences are marked as positive examples, non-homologous sequences (also called "decoys") are negative examples.
If you are not familiar with protein matching and structure prediction, it might be helpful to think of this as a WWW search engine problem. There are 153 queries in the train set, and 150 queries in the test set. For each query there are about 1000 returned documents, only a few of the documents are correct matches for each query. The goal is to predict which of the 1000 documents best match the query based on 74 attributes that measure match. A good set of match predictions will rank the "homologous" documents near the top.
Evaluation measures are applied to each block corresponding to a native sequence, and then averaged over all blocks. Most of the measures are rank based and assume that your predictions provide a ranking within each block. Your task is provide 4 sets of predictions for the test set that optimize
- fraction of blocks with a homologous sequence ranked top 1 (maximize)
- average rank of the lowest ranked homologous sequence (minimize)
- root mean squared error (minimize)
- average precision (average of the average precision in each block) (maximize)
Note that three of the measures (TOP 1, Average Last Rank, and average precision) depend only on the relative ordering of the matches within each block, not on the predicted values themselves.
We thank the author of the dataset, who wishes to remain anonymous until after the KDD-Cup.
Metrics for the Particle Physics Problem
We use the usual definition of accuracy -- the number of cases predicted correctly, divided by the total number of cases. The predictions submitted for accuracy can be boolean or continuous. When submitting predictions, you will be asked for a prediction threshold: predictions above or equal to this threshold will be treated as class 1, predictions below threshold are treated as class 0. An accuracy of 1.00 is perfect prediction. Accuracy near 0.00 is poor.
*** Area Under the ROC Curve (maximize):
We use the usual definition for ROC Area (often abbreviated AUC). An ROC plot is a plot of true positive rate vs. false positive rate as the prediction threshold sweeps through all the possible values. If you are familiar with sensitivity and specificity, this is the same as plotting sensitivity vs. 1-specificity as you sweep the threshold. AUC is the area under this curve. AUC of 1 is perfect prediction -- all positive cases sorted above all negative cases. AUC of 0.5 is random prediction -- there is no relationship between the predicted values and truth. AUC below 0.5 indicates there is a relationship between predicted values and truth, but the model is backwards, i.e., predicts smaller values for positive cases! Another way to think of ROC Area is to imagine sorting the data by predicted values. Suppose this sort is not perfect, i.e., some positive cases sort below some negative cases. AUC effectively measures how many times you would have to swap cases with their neighbors to repair the sort.
AUC = 1.0 - (# of swaps to repair sort)/((# of positives) x (# of negatives))
There's a little more info on AUC in the comments at the top of the perf source code, but you probably don't need to look at it.
*** Cross-Entropy (minimize):
We use the usual definition for cross-entropy. Cross-entropy, like squared error, measures how close predicted values are to target values. Unlike squared error, cross-entropy assumes the predicted values are probabilities on the interval [0,1] that indicate the probability that the case is class 1.
Cross-Entropy = SUM [(Targ)*log(Pred) + (1-Targ)*log(1-Pred)]
where Targ is the Target Class (0 or 1) and Pred is the predicted probability that the case is class 1. Note that the terms (Targ) and (1-Targ) are alternately 0 or 1 so log(Pred) is added to the sum for positive cases and log(1-Pred) is added for negative cases. Note that cross entropy is infinity of the Targ is 0 and Pred is 1 or Targ is 1 and Pred is 0. In the code we protect against this by just returning a very, very large number (that may be platform dependent). If you get cross-entropies that are very large, you probably have at least one prediction that is like this.
To make cross-entropy independent of data set size, we use the mean cross-entropy, i.e., the sum of the cross-entropy for each case divided by the total number of cases.
*** SLAC Q-Score (maximize):
SLQ is a domain-specific performance metric devised by researchers at the Stanford Linear Accellerator (SLAC) to measure the performance of predictions made for certain kinds of particle physics problems. SLQ works with models that make continuous predictions on the interval [0-1]. It breaks this interval into a series of bins. For the KDD-CUP we are using 100 equally sized bins. The 100 bins are: 0.00-0.01, 0.01-0.02, ..., 0.98-0.99, 0.99-1.00.
PERF's SLQ option allows you to specify the number of bins. You should use 100 bins for the KDD-CUP competition. For example under unix:
perf -slq 100 < testdata
SLQ places predictions into the bins based on the predicted values. If your model predicts a value of 0.025 for a case, that case will be placed in the third bin 0.02-0.03. In each bin SLQ keeps track of the number of true positives and true negatives. SLQ is maximized if bins have high purity, e.g., if all bins contain all 0's or all 1's. This is unlikely, so SLQ computes a score based on how pure the bins are.
The score is computed as follows: suppose a bin has 350 positives and 150 negatives. The error of this bin if we predict positives (the predominant class) for all cases is 150/(350+150) = 0.30 = err. The contribution of this bin to the total SLQ score is (1-2*err)^2, times the total number of points in this bin divided by the total number of points in the data set. Dividing by the size of the total data set normalizes the score so that it is more independent of the size of the data set. Multiplying by the number of points in the bin gives more weight to bins that have more points, and less weight to bins with fewer points. The term (1-2*err)^2 is maximized (= 1) when all points in that bin are the same class, i.e., when the error due to predicting the predominant class is minimized. This is somewhat similar to measures of node purity in decision trees. The sum over the 100 bins is maximized when the contribution of each bin (weighted by the number of points in each bin) is maximized.
Collaborators at SLAC say that this is an important quantity for their work. They want to find models that maximize it. Other than that, we do not know exactly why the SLQ metric is defined this way. SLQ is a domain-specific performance metric custom designed for this particular class of problems.
Note that SLQ only cares about the purity in each node. A model would have poor accuracy and ROC below 0.5 if you switch the labels used for positive and negatives after training, but SLQ is insensitive to this.
Metrics for the Biology/Protein Problem
Unlike the particle physics problem, the biology problem comes in blocks. Performance is measured on each block independently, and then the average performance across the blocks is used as the final performance metric.
*** TOP1 (maximize):
The fraction of blocks with a true homolog (class 1) predicted as the best match. In each block the predictions are sorted by predicted value. If the case that sorts to the top (highest predicted value) is class 1, you score a 1 for that block. If the top case is class 0, you score 0. (If there are ties for the top case, you score 0 unless all of the tied cases are class 1. This means ties do not help.) The best TOP1 that can be achieved is to predict the highest value for a true homolog in each block. This would yield TOP1 = 1.0, perfect TOP1 prediction. TOP1 = 0.0 is poor.
*** Average Rank of Last Homolog (minimize):
Like TOP1, cases are sorted by predicted value. This metric measures how far down the sorted cases the last true homolog falls. A robust model would predict high values for all true homologs, so the last homolog would still sort high in the ordered cases. A rank of 1 means that the last homolog sorted to the top position. This is excellent, but can only happen if there is only 1 true homolog in the block. As with TOP1, ties do not help -- if cases are tied perf assumes the true homologs sort to the low end of the ties. It is not possible to achieve an average rank for the last homolog of 1 on this data because most blocks have more than one homolog. Average ranks near 1000 indicate poor performance.
*** Root-Mean-Squared-Error (minimze):
We use the usual defintion for squared error, and then divide it by the number of cases to make it independent of data set size, and then take the square root to convert it back to the natural scale of the targets and predictions.
RMSE = SQRT( 1/N * SUM (Targ-Pred)^2 )
For the KDD-CUP, the targets should be 0 and 1 and the predictions should be on the interval [0,1].
*** Mean Average Precision (maximize):
This is the metric that we got wrong in the first release of PERF. Hopefully you are using the newer release of PERF. One problem with average precision is that there are a variety of definitions in the literature. To further the confusion, we are using a relatively non-standard definition sometimes called "expected precision". If you are familiar with precision/recall plots, precision is the percentage of points predicted to be positive (above some threshold) that are positives. High precision is good. We sweep the threshold so that each case becomes predicted positive one at a time, calculate the precision at each of these thresholds, and then take the average of all of these precisions. This is kind of like the area under a precision recall plot, but not exactly. One problem with average precision is how to handle ties between positive and negative cases. For the KDD-CUP competition we developed a new method for handling ties that calculates the average precision of all possible orderings of the cases that are tied. Fortunately, we are able to do this without enumerating all possible orderings! Unfortuantely, it is still expensive when there are many ties (order n^2 in the number of ties), so the code backs off to simpler pessimistic calculation when there are thousands of ties. (You shouldn't run into this in the KDD-CUP since each block has only about 1000 cases.)
If you are not familiar with precision, there is a description of it in the comments at the top of the source code for PERF. But you probably don't need to look at it.
Average precision is measured is measured on each block, and then the mean of each block's average precision is used as the final metric. A mean average precision of 1 is perfect prediction. The lowest mean average precision possible is 0.0.