## Tasks

#### KDD Cup 2003: Network mining and usage log analysis

**I. Citation Prediction**

**Goal**

The goal of this task is to predict changes in the number of citations to individual papers over time.

**Input**

Contestants will be given:

- The LaTeX source of all papers in the hep-th portion of the arXiv through March 1, 2003. For each paper, this includes the main .tex file but not separate include files or figures. It also includes the hep-th arxiv number as a unique ID.
- The abstracts for all of the hep-th papers in the arXiv. For each paper the abstract file contains:
- arXiv submission date
- revised date(s)
- title
- authors
- abstract

- The SLAC/SPIRES dates for all hep-th papers. Some older papers were uploaded years after their intial publication and the arXiv submission date from the abstracts may not correspond to the publication date. An alternative date has been provided from SLAC/SPIRES that may be a better estimate for the initial publication of these old papers.
- The complete citation graph for the hep-th papers, obtained from SLAC/SPIRES. Each node will be labeled by its unique ID from (1). Note that revised papers may have updated citations. As such, citations may refer to future papers, i.e. a paper may cite another paper that was published after the first paper.

**Output**

For each paper P in the collection, contestants should report the predicted difference between

- the number of citations P will receive from hep-th papers submitted during the period May 1, 2003 - July 31, 2003, and
- the number of citations P will receive from hep-th papers submitted during the period February 1, 2003 - April 30, 2003. (So if there were more citations during the period May 1, 2003 - July 31, 2003, then the prediction should be a positive number.)

The format for the submission is a simple 2 column vector of [arxiv id] [difference] sorted by arxiv id.

Update May 6, 2003: This difference does not need to be an integer; floating point numbers are valid predictions.

**Evaluation**

The target result is a vector V with one coordinate for each paper in the initial collection (1) that receives at least 6 citations during the period February 1, 2003 - April 30, 2003. The P-th coordinate of V will consist of the true difference in number of citations for paper P.

Based on a contestant's predictions, a vector W will be constructed, over the same set of paper; the P-th coordinate of W will consist of the predicted difference in number of citations for paper P.

The score of a prediction vector W will be equal to the L_1 difference between the vectors V and W.

**II. Data Cleaning**

**Goal**

It is often estimated that data cleaning is the most expensive and arduous tasks of the knowledge discovery process. Whether for industry or government databases, the problem of linking records and identifying identical objects in dirty data is a hard and costly task.

The goal of this task is to clean a very large set of real-life data: We would like to re-create the citation graph of about 35,000 papers in the hep-ph portion of the arXiv.

**Input**

Contestants will be given the LaTeX sources of all papers in the hep-ph portion of the arXiv on April 1, 2003. For each paper, this includes the main .tex file, but not separate include files or figures. The references in each paper have been "sanitized" through a script by removing all unique identifiers such as arXiv codes or other code numbers. No attempts have been made to repair any damages from the sanitization process. Each paper has been assigned a unique number.

**Output**

For each paper P in the collection, a list of other papers {P1, ..., Pk) in the collection such that P cites P1, ..., Pk. Note that P might cite papers that are not in the collection.

The format for submission is a plain ASCII file with 2 columns. The left column will be the citing-from paper id and the right column will be the cited-to paper id. The file should also be sorted.

**Evaluation**

The target is a graph G=(V,E) with each paper P a node in the graph, and each citation a directed edge in the graph. Assuming that a contestant submits a graph G'=(V,E'), the score is the size of the symmetric difference between E and E': |E-E'|.

**III. Download Estimation**

**Goal**

The goal of this task is to estimate the number of downloads that a paper receives in its first two months in the arXiv.

**Input**

Contestants will be given:

all of the datasets available for Task 1: Citation Prediction.

for papers published in the following months, the downloads received from the main site in each of its first 60 days in the arXiv.

- February and March of 2000
- February and April of 2001
- March and April of 2002

**Output**

For each paper P submitted during the periods:

- April 2000
- March 2001
- February 2002

contestants should report the estimated total number of downloads of P during its first 60 days in the arXiv. Note that this is a single number for each paper P, whereas the given data (3) provides a download log for the sixty days.

**Evaluation**

For each of the output periods (April 2000, March 2001, Feb 2002), the target result is a vector X with one coordinate for the top 50 papers with the greatest number of downloads in their first 60 days. For each of these papers P, the value of P-th coordinate is the number of downloads of P during its first 60 days.

Based on a contestant's download estimations, a vector Y will be constructed, over the same set of 150 papers (50 from each period); the P-th coordinate of Y will consist of the estimated number of downloads of P during its first 60 days.

The score of a prediction vector W will be equal to the L_1 difference between the vectors X and Y.

**IV. Open Task**

**Goal**

Contestants will be given the LaTeX sources of all papers in the hep-th portion of the arXiv on April 6, and the citation graph of the hep-th portion of the arXiv on that date.

For this "open task", the goal is to define as interesting a question as possible to ask on the data, and then to show the result of mining the data for the answer. The question addressed could be based on identifying an interesting structure, trend, or relationship in the data; posing further predictive tasks for the data; evaluating the performance of a novel algorithm on the data; or any of a number of other activities.

The results should be written up in the KDD submission format, using at most 10 pages. The write-up should cite and discuss relevant prior work. A committee of judges will select the winning entry, based on novelty, soundness of methods and evaluation, and relevance to the arXiv dataset.