Conditions of participation: Anybody who complies with the rules of the challenge (KDDcup 2009) is welcome to participate. Only the organizers are excluded from participating. The KDDcup 2009 is part of the competition program of the Knowledge Discovery in Databases conference (KDD 2009), Paris June 28-July 1st, 2009. Participants are not required to attend the KDDcup 2009 workshop, which will be held at the conference, and the workshop is open to anyone who registers. The proceedings of the competition will be published by the Journal of Machine Learning Research Workshop and Conference Proceedings (JMLR WC&P).
Anonymity: All entrants must identify themselves by registering on the KDDcup 2009 website. However, they may elect to remain anonymous by choosing a nickname and checking the box "Make my profile anonymous". If this box is checked, only the nickname will appear in the result tables instead of the real name. Participant emails will not appear anywhere on the website and will be used only by the organizers to communicate with the participants. To be eligible for prizes the participants will have to publicly reveal their identity and uncheck the box "Make my profile anonymous".
Data: The dataset is available for download from the Data page to registered participants. The data are available in several archives to facilitate downloading and two versions are made available ("small" with 230 variables, and "large" with 15,000 variables). The participants may enter results on either or both versions, which correspond to the same data entries, the 230 variables of the small version being just a subset of the 15,000 variables of the large version. Both training and test data are available without the true target labels. For practice purpose, "toy" training labels are available together with the training data from the onset of the challenge in the fast track. The results on toy targets (T) will not count for the final evaluation. The real training labels of the tasks "churn" (C), "appetency" (A), and "up-selling" (U), will be made available for download separately half-way through the challenge.
Challenge duration and tracks: The challenge starts March 10, 2009 and ends May 11, 2009. There are two challenge tracks:
- FAST (large) challenge: Results submitted on the LARGE dataset within five days of the release of the real training labels will count towards the fast challenge.
- SLOW challenge: Results on the small dataset and results on the large dataset not qualifying for the fast challenge, submitted before the KDDcup 2009 deadline May 11, 2009, will count toward the SLOW challenge.
If more than one submission is made in either track and with either dataset, the last submission before the track deadline will be taken into account to determine the ranking of participants and attribute the prizes. You may compete in both tracks. There are prizes in both tracks.
On-line feed-back: During the challenge, the training set performances will be available on the Results page as well as partial information on test set performances: The test set performances on the toy task (T) and performances on a fixed 10% subset of the test examples for the real tasks (C, A, U). After the challenge is over, the performances on the whole test set will be calculated and substituted in the result tables.
Submission method: The method of submission is via the form on the Submission page. To be ranked, submissions must comply with the Instructions. A submission should include results on both training and test set on at least one of the tasks (T, C, A, U), but it may include results on several tasks. A submission will be considered "complete" and eligible for prizes if it contains 6 files corresponding to training and test data predictions for the tasks C, A, and U, either for the small or for the large dataset (or for both). Results on the practice task T will not count as part of the competition. If you encounter problems with the submission process, please contact the Challenge Webmaster. Multiple submissions are allowed, but please limit yourself to 5 submissions per day maximum. For your final entry in the slow track, you may submit results on either or both small and large datasets in the same archive (hence you get 2 chances of winning).
Evaluation and ranking: For each entrant, only the last valid entry will count towards determining the winner in each track (fast and slow). We limit each participating person to a single final entry in each track (see the FAQs page for the conditions under which you can work in teams). Valid entries must include results on all three real tasks. The method of scoring is posted on the Tasks page. Prizes will be attributed only to entries performing better than the baseline method (Naive Bayes). The results of the baseline method are provided in the Result page. These are not the best results obtained by the organization team at Orange, they are easy to outperform, but difficult to attain by chance.
Reproducibility: Participation is not conditioned on delivering code nor publishing methods. However, we will ask the top ranking participants to voluntarily fill out a fact sheet about their methods, contribute papers to the proceedings, and help reproducing their results.