This year's competition focuses on problems motivated by network mining and the analysis of usage logs. Complex networks have emerged as a central theme in data mining applications, appearing in domains that range from communication networks and the Web, to biological interaction networks, to social networks and homeland security. At the same time, the difficulty in obtaining complete and accurate representations of large networks has been an obstacle to research in this area.
This KDD Cup is based on a very large archive of research papers that provides an unusually comprehensive snapshot of a particular social network in action; in addition to the full text of research papers, it includes both explicit citation structure and (partial) data on the downloading of papers by users. It provides a framework for testing general network and usage mining techniques, which will be explored via four varied and interesting task. Each task is a separate competition with its own specific goals.
The first task involves predicting the future; contestants predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference. For the second task, contestants must build a citation graph of a large subset of the archive from only the LaTex sources. In the third task, each paper's popularity will be estimated based on partial download logs. And the last task is open! Given the large amount of data, contestants can devise their own questions and the most interesting result is the winner.
About the Data
The e-print arXiv, initiated in Aug 1991, has become the primary mode of research communication in multiple fields of physics, and some related disciplines. It currently contains over 225,000 full text articles and is growing at a rate of 40,000 new submissions per year. It provides nearly comprehensive coverage of large areas of physics, and serves as an on-line seminar system for those areas. It serves 10 million requests per month, including tens of thousands of search queries per day. Its collections are a unique resource for algorithmic experiments and model building. Usage data has been collected since 1991, including Web usage logs beginning in 1993. On average, the full text of each paper was downloaded over 300 times since 1996, and some were downloaded tens of thousands of times.
The Stanford Linear Accelerator Center SPIRES-HEP database has been comprehensively cataloguing the High Energy Particle Physics (HEP) literature online since 1974, and indexes more than 500,000 high-energy physics related articles including their full citation tree.