KDD Cup 2005: Internet user search query categorization

Task Description

This year's competition focuses on query categorization.

Your task is to categorize 800,000 queries into 67 predefined categories. The meaning and intention of search queries is subjective. A search query "Saturn" might mean Saturn car to some people and Saturn the planet to others. We will use multiple human editors to classify a subset of queries selected from the total set given to you. The collection of human editors is assumed to have the most complete knowledge about internet as compared with any individual end user. A portion of the editor labeled queries is given to you (CategorizedQuerySample.txt in the zip file for downloading) and the rest will be held back for evaluation. You will not know which queries will be used for evaluation and are asked to categorize all queries given.

You should tag each query with up to 5 categories. If the submission does not contain all search queries, those not included will be treated as having no category tags.


The evaluation will run on the held back queries and rank your results by how closely they match to the results from human editors. Here are the set of measures we will use to evaluate results submitted by the contestants:

You will be asked to submit your algorithms. The interestingness, scalability, and efficiency of the algorithms will also be judged. New ideas in handling search queries and internet content will be valued and most innovative ideas will be selected by KDD Cup co-chairs.

