KDD Cup 2005

News

August 30, 2005: KDD-Cup presentation slides from the KDD conference

August 30, 2005: Winning Teams

August 30, 2005: Labeled Query Data

August 10, 2005: Solution Evaluation Result

Introduction

The KDD-Cup 2005 Knowledge Discovery and Data Mining competition will be held in conjunction with the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The task is selected to be interesting to participants from both academia and industry. In particular, we encourage the participation of students. This year's competition is about classifying internet user search queries. We are looking forward to an interesting competition and encourage your participation.

Contest Rules

Agreement

By sending the registration email, you indicate your full and unconditional agreement and acceptance of these contest rules.

Eligibility

The contest is open to any party planning to attend KDD 2005. A person can participate in only one group. Multiple submissions per group are allowed, since we will not provide feedback at the time of submission. Only the last submission before the deadline will be evaluated and all other submissions will be discarded.

Integrity

The contestant takes the responsibility of obtaining any permission to use any algorithms/tools/data that are intellectual property of third party.

Winner Selection

There will be three prizes awarding "Query Categorization Precision Award" , "Query Categorization Performance Award", and "Query Categorization Creativity Award". One winner will be selected for each award.

The winners will be determined according to the following method. All participants are ranked according to their overall performance and average precision on the test set. Participants will also be ranked based on their creativity of their methodologies.

Winner of "Query Categorization Performance Award" is the participant who has the best average performance rank in terms of F1 defined below.

Among the participants who have top 10 F1 scores, we will honor the winner of "Query Categorization Precision Award" to be the one who has the best average precision.

Winner of "Query Categorization Creativity Award" is the participant whose model has a top 20 average rank in terms of F1 defined below and is highly outstanding at its creative ideas judged by the KDD Cup co-chairs and a group of search experts. The scalability and level of automation of the model will also be considered in the judgment.

An honorable mention will be awarded for each prize.

Tasks

This year's competition focuses on query categorization.

Your task is to categorize 800,000 queries into 67 predefined categories. The meaning and intention of search queries is subjective. A search query "Saturn" might mean Saturn car to some people and Saturn the planet to others. We will use multiple human editors to classify a subset of queries selected from the total set given to you. The collection of human editors is assumed to have the most complete knowledge about internet as compared with any individual end user. A portion of the editor labeled queries is given to you (CategorizedQuerySample.txt in the zip file for downloading) and the rest will be held back for evaluation. You will not know which queries will be used for evaluation and are asked to categorize all queries given.

You should tag each query with up to 5 categories. If the submission does not contain all search queries, those not included will be treated as having no category tags.

Please follow the instruction under section "format" when you submit your result.

The evaluation will run on the held back queries and rank your results by how closely they match to the results from human editors. Here are the set of measures we will use to evaluate results submitted by the contestants:








You will be asked to submit your algorithms (please see "Submission of Categorization" for details). The interestingness, scalability, and efficiency of the algorithms will also be judged. New ideas in handling search queries and internet content will be valued and most innovative ideas will be selected by KDD Cup co-chairs.

KDD Cup 2005

May 2, 2005 Tasks and datasets available online
July 12, 2005 Submissions of query categorization results due (by midnight PST)
July 15, 2005 Submissions of detailed algorithm due (by midnight PST)
August 21-24, 2005 KDD 2005 Conference

Datasets

The data set is 800,000 search queries from end user internet search activities. Data is in a text file, one query per line.

Registration

Before downloading the datasets, you should register. This will give us a way to contact you in case it is necessary. We will keep your data private, and registering does not indicate any commitment to participation.

To register, please send us an email with:
Name of contact person,
Email of contact person,
Organization

Download

Download data set in zip format (7.5MB)

Format

The file you downloaded is an archive that is compressed with WinZip format. Most decompression programs (e.g. winzip, RAR) can decompress these formats. If you run into problems, send us email. The archive should contain three files:
  • Queries.txt:
    800K search queries. Each line is one query.

  • CategorizedQuerySample.txt:
    This is a sample file containing 111 queries and the manual categorization information. Each line starts with one query followed by its top 5 categories labeled by human experts, separated by tab. There may be fewer than 5 categories for some of the queries.

  • Categories.txt:
    Contains 67 the predefined categories. Each line contains one category name.

To give an example, the first line in CategorizedQuerySample.txt looks like this:

"auto price        Shopping\Stores & Products     Living\Car & Garage
Shopping\Buying Guides & Researching            Shopping\Bargains &
Discounts            Information\Companies & Industries"

"auto price" is the query.

"Shopping\Stores & Product", "Living\Car & Garage", "Shopping\Buying Guides & Researching", "Shopping\Bargains & Discounts", and "Information\Companies & Industries" are the category labels for this query.

Elements in each line are separated by tab "\t".

Submission of Categorization

The FTP server for uploading your submissions is open. The address of the ftp server is: ftp://kddcup.kdd2005.com (for use with web browsers) and kddcup.kdd2005.com (for use with ftp clients, a good FTP client is SmartFTP if you need one).

The submission files should follow the below filename scheme:

For categorization results: lastname-firstname-result-year-month-day.zip

Example: Li-Ying-result-2005-07-01.zip

For algorithm description: lastname-firstname-algorithm-year-month-day.zip

Example: Li-Ying-algorithm-2005-07-01.zip

You should use common accepted compressed format (zip, rar, gz, tar.gz, or arj).

The file of categorization results should be ANSI plain text file. You should use ".txt" as the file name suffix. The format is the same as CategorizedQuerySample.txt:

<Query> <Category_1> <Category_2> <Category_3> <Category_4> <Category_5>

Please use CategorizedQuerySample.txt as an example of your submission of categorization result. It is allowed that you have fewer than 5 category labels for some of the queries. If you submit more than 5 category labels in one line, we will only consider first 5 labels for that query. Elements in each line are separated by tab "\t". Each line ends with a line feed ("\n") or a carriage return immediately followed by a line feed ("\r\n").

Please strictly follow the file format specified above. Results submitted with incorrect format risk being wrongly evaluated.

You should also have another file describing your algorithm. The description states the methodology, the logic and the reasons behind your algorithm. If you do not want to share the details of your techniques, you can just give a high level outline of your approach and please indicate "a brief summary" at the beginning of your description. In this case, you will not participate in "Query Categorization Creativity Award".

The description file stem should be "readme". The file extension can be txt, pdf, doc, or ps. The description should be no more than 5 pages, with font size not smaller than 10, single line and single column.

After a file has been uploaded, it cannot be overwritten, read or edited. You can submit multiple versions and we will take the last submission from each participant. If you need to change your submission within the same day, you can add a version number after the date in the file name, such as: Li-Ying-result-2005-07-01-01.zip

Please also be reminded to submit early to avoid the last minute congestion on the FTP server.

Frequently Asked Questions and News

News

Solution Evaluation Result

The following table contains the evaluation results for the submitted solutions we received. The solutions are listed in random order. The organizer will send the "Submission ID" to each individual participant. Once you receive your "Submission ID" for your solution, you can use it to access your evaluation result in the following table.

Submission ID

Precision

F1

1

0.145099

0.146839

2

0.116583

0.139732

3

0.339435

0.309754

4

0.110885

0.124228

5

0.31068

0.085639

6

0.254815

0.246264

7

0.263953

0.306359

8

0.454068

0.405453

9

0.264312

0.306612

10

0.334048

0.342248

11

0.107045

0.116521

12

0.196117

0.207787

13

0.326408

0.357127

14

0.317308

0.312812

15

0.271791

0.26545

16

0.050918

0.060285

17

0.264009

0.218436

18

0.206167

0.247854

19

0.136541

0.127008

20

0.127784

0.126848

21

0.340883

0.34009

22

0.414067

0.444395

23

0.237661

0.250293

24

0.244565

0.258035

25

0.753659

0.205391

26

0.255726

0.274579

27

0.206919

0.205302

28

0.148503

0.17614

29

0.171081

0.1985

30

0.145467

0.173173

31

0.108305

0.108174

32

0.16962

0.232654

33

0.469353

0.255096

34

0.198284

0.191618

35

0.32075

0.384136

36

0.211284

0.129937

37

0.423741

0.426123

KDD-Cup presentation:

The KDD-Cup presentation slides from the organizers and the three winning teams.

Winning Teams:

Winners

  • Query Categorization Precision Award
    Hong Kong University of Science and Technology team
    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang
  • Query Categorization Performance Award
    Hong Kong University of Science and Technology team
    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang
  • Query Categorization Creativity Award
    Hong Kong University of Science and Technology team
    Dou Shen, Rong Pan, Jiantao Sun, Junfeng Pan, Kangheng Wu, Jie Yin and Qiang Yang

Runner-ups

  • Query Categorization Precision Award
    Budapest University of Technology team
    Zsolt T. Kardkov�cs, Domonkos Tikk, Zolt�n B�ns�ghi
  • Query Categorization Performance Award
    MEDai/AI Insight/ Humboldt University team
    David S. Vogel, Steve Bridges, Steffen Bickel, Peter Haider, Rolf Schimpfky, Peter Siemen, Tobias Scheffer
  • Query Categorization Creativity Award
    Budapest University of Technology team
    Zsolt T. Kardkov�cs, Domonkos Tikk, Zolt�n B�ns�ghi

Labeled Query Data

The 800 queries with labels from the three human labelers are available to download.


Questions

Question: Can we submit a separate solution for each evaluation criteria (award)?

Answer:
Yes. When submitting, you can include, in the compressed file, multiple files with one for each solution. In this case, you need to specify which file is for which award clearly. However, your solution dedicated to precision must be in the top 10 F1 scores in order to become the candidate of "Query Categorization Precision Award". Also, in the algorithm description part, you need to clearly specify which algorithm was used for which award. Note that only one solution is allowed for each of the "Query Categorization Precision Award" and "Query Categorization Performance Award" from any participant team. The Query Categorization Creativity Award" will be based on the description of the algorithm(s) used for the other two awards.

Question: Are we allowed to use external sources (e.g. documents from directories) to increase the knowledge of the classifier?

Answer: Yes, you will decide what methodology or resource to use to classify the queries. There is no restriction on what data you can/can't use to build your models.

Question: How to label trash queries or non-English queries?

Answer: The evaluation set contains only valid English queries. Participants may have their system return no labels on this type of non-English or trash query.

Question: Do I have to submit an algorithm description?

Answer: You need to submit an algorithm description. If you do not want to share the details of your techniques, you can just give a high level description of your approach and please indicate "a brief summary" at the beginning of your description. In this case, you will not participate in "Query Categorization Creativity Award".

Question: How will the evaluation query set be selected?

Answer:
1. The queries for evaluation will be selected randomly.
2. Foreign language queries / trash queries / improper content queries will be dropped from the evaluation set during the selection process.

New Categories and Samples:

Based on some feedbacks from participants, we have made modifications to the categories to better reflect the most agreeable views people may have on search queries. We believe this is a good change and we trust that this should help you better classify the search queries.

The category taxonomy given here is one view we can best come up to cover the internet search queries. When designing a categorization system, you should consider making it work well when this category taxonomy is replaced with another reasonable one. To deal with the potential that this category taxonomy is not exhaustive, if a query does not fit any of the categories, your system should return no labels and we will consider that in the result evaluation.

We are also providing at least one sample query per category. The modified category list and the new sample file can be downloaded from http://www.acm.org/sigs/sigkdd/kdd2005/kddcup/KDDCUPData.zip.

Contact

Questions to Organizers

Email questions to KDD Cup 2005 Organizers.

Co-chairs

Ying Li
Microsoft Corp.
Phone: 425-703-8739

Zijian Zheng
Amazon.com
Webmaster: Michal Sabala