The data set for this year's Cup has been generously provided by the Paralyzed Veterans of America (PVA). PVA is a not-for-profit organization that provides programs and services for US veterans with spinal cord injuries or disease. With an in-house database of over 13 million donors, PVA is also one of the largest direct mail fund raisers in the country.
Participants in the CUP will demonstrate the performance of their tool by analyzing the results of one of PVA's recent fund raising appeals. This mailing was dropped in June 1997 to a total of 3.5 million PVA donors. It included a gift "premium" of personalized name & address labels plus an assortment of 10 note cards and envelopes. All of the donors who received this mailing were acquired by PVA through premium-oriented appeals like this. The analysis data set will include:
- A subset of the 3.5 million donors sent this appeal
- A flag to indicate respondents to the appeal and the dollar amount of their donation
- PVA promotion and giving history
- Overlay demographics, including a mix of household and area level data.
Unlike least year, all available information about the fields will be made available in the project documentation.
The objective of the analysis will be to identify response to this mailing - a classification or discrimination problem.
The CUP is aimed at recognizing the most accurate, innovative, efficient and methodologically advanced data mining tools in the marketplace.
The participants will again be evaluated based on the performance of their algorithm on the validation or hold-out data set. The KDD-CUP program committee will consider the following metrics in their evaluations:
- Lift curve or gains table analysis listing the cumulative percent of targets recovered in the top quantiles of the file
- Receiver operating characteristics (ROC) curve analysis and the area under the ROC curve
- Several statistical tests to ensure the robustness of the results.
Last year, the performance in the top 10 percent of the file was considered as a measure of precision while the performance in the top 40 percent of the file was considered as a measure of stability and marketing coverage. The average performance up to the 40th percentile was also looked at as a measure of overall performance.