|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Organizational Sponsors: |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
KDD Cup 2005NewsAugust 30, 2005: KDD-Cup presentation slides from the KDD conference August 30, 2005: Winning Teams August 30, 2005: Labeled Query Data August 10, 2005: Solution Evaluation Result IntroductionThe KDD-Cup 2005 Knowledge Discovery and Data Mining competition will be held in conjunction with the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The task is selected to be interesting to participants from both academia and industry. In particular, we encourage the participation of students. This year's competition is about classifying internet user search queries. We are looking forward to an interesting competition and encourage your participation.Contest RulesAgreementBy sending the registration email, you indicate your full and unconditional agreement and acceptance of these contest rules.EligibilityThe contest is open to any party planning to attend KDD 2005. A person can participate in only one group. Multiple submissions per group are allowed, since we will not provide feedback at the time of submission. Only the last submission before the deadline will be evaluated and all other submissions will be discarded.IntegrityThe contestant takes the responsibility of obtaining any permission to use any algorithms/tools/data that are intellectual property of third party.Winner SelectionThere will be three prizes awarding "Query Categorization Precision Award" , "Query Categorization Performance Award", and "Query Categorization Creativity Award". One winner will be selected for each award.The winners will be determined according to the following method. All participants are ranked according to their overall performance and average precision on the test set. Participants will also be ranked based on their creativity of their methodologies. Winner of "Query Categorization Performance Award" is the participant who has the best average performance rank in terms of F1 defined below. Among the participants who have top 10 F1 scores, we will honor the winner of "Query Categorization Precision Award" to be the one who has the best average precision. Winner of "Query Categorization Creativity Award" is the participant whose model has a top 20 average rank in terms of F1 defined below and is highly outstanding at its creative ideas judged by the KDD Cup co-chairs and a group of search experts. The scalability and level of automation of the model will also be considered in the judgment. An honorable mention will be awarded for each prize. TasksThis year's competition focuses on query categorization.Your task is to categorize 800,000 queries into 67 predefined categories. The meaning and intention of search queries is subjective. A search query "Saturn" might mean Saturn car to some people and Saturn the planet to others. We will use multiple human editors to classify a subset of queries selected from the total set given to you. The collection of human editors is assumed to have the most complete knowledge about internet as compared with any individual end user. A portion of the editor labeled queries is given to you (CategorizedQuerySample.txt in the zip file for downloading) and the rest will be held back for evaluation. You will not know which queries will be used for evaluation and are asked to categorize all queries given. You should tag each query with up to 5 categories. If the submission does not contain all search queries, those not included will be treated as having no category tags. Please follow the instruction under section "format" when you submit your result. The evaluation will run on the held back queries and rank your results by how closely they match to the results from human editors. Here are the set of measures we will use to evaluate results submitted by the contestants:
You will be asked to submit your algorithms (please see "Submission of Categorization" for details). The interestingness, scalability, and efficiency of the algorithms will also be judged. New ideas in handling search queries and internet content will be valued and most innovative ideas will be selected by KDD Cup co-chairs. KDD Cup 2005
DatasetsThe data set is 800,000 search queries from end user internet search activities. Data is in a text file, one query per line.RegistrationBefore downloading the datasets, you should register. This will give us a way to contact you in case it is necessary. We will keep your data private, and registering does not indicate any commitment to participation.To register, please send us an email with:
Name of contact person,
Email of contact person, Organization DownloadDownload data set in zip format (7.5MB)FormatThe file you downloaded is an archive that is compressed with WinZip format. Most decompression programs (e.g. winzip, RAR) can decompress these formats. If you run into problems, send us email. The archive should contain three files:
"auto price Shopping\Stores & Products Living\Car & Garage Shopping\Buying Guides & Researching Shopping\Bargains & Discounts Information\Companies & Industries" "auto price" is the query. "Shopping\Stores & Product", "Living\Car & Garage", "Shopping\Buying Guides & Researching", "Shopping\Bargains & Discounts", and "Information\Companies & Industries" are the category labels for this query. Elements in each line are separated by tab "\t". Submission of CategorizationThe FTP server for uploading your submissions is open. The address of the ftp server is: ftp://kddcup.kdd2005.com (for use with web browsers) and kddcup.kdd2005.com (for use with ftp clients, a good FTP client is SmartFTP if you need one). The submission files should follow the below filename scheme: For categorization results: lastname-firstname-result-year-month-day.zip Example: Li-Ying-result-2005-07-01.zip
For algorithm description: lastname-firstname-algorithm-year-month-day.zip Example: Li-Ying-algorithm-2005-07-01.zip
You should use common accepted compressed format (zip, rar, gz, tar.gz, or arj). The file of categorization results should be ANSI plain text file. You should use ".txt" as the file name suffix. The format is the same as CategorizedQuerySample.txt:
<Query> <Category_1> <Category_2> <Category_3> <Category_4> <Category_5>
Please use CategorizedQuerySample.txt as an example of your submission of categorization result. It is allowed that you have fewer than 5 category labels for some of the queries. If you submit more than 5 category labels in one line, we will only consider first 5 labels for that query. Elements in each line are separated by tab "\t". Each line ends with a line feed ("\n") or a carriage return immediately followed by a line feed ("\r\n"). Please strictly follow the file format specified above. Results submitted with incorrect format risk being wrongly evaluated. You should also have another file describing your algorithm. The description states the methodology, the logic and the reasons behind your algorithm. If you do not want to share the details of your techniques, you can just give a high level outline of your approach and please indicate "a brief summary" at the beginning of your description. In this case, you will not participate in "Query Categorization Creativity Award". The description file stem should be "readme". The file extension can be txt, pdf, doc, or ps. The description should be no more than 5 pages, with font size not smaller than 10, single line and single column. After a file has been uploaded, it cannot be overwritten, read or edited. You can submit multiple versions and we will take the last submission from each participant. If you need to change your submission within the same day, you can add a version number after the date in the file name, such as: Li-Ying-result-2005-07-01-01.zip Please also be reminded to submit early to avoid the last minute congestion on the FTP server. Frequently Asked Questions and NewsNews Solution Evaluation ResultThe following table contains the evaluation results for the submitted solutions we received. The solutions are listed in random order. The organizer will send the "Submission ID" to each individual participant. Once you receive your "Submission ID" for your solution, you can use it to access your evaluation result in the following table.
KDD-Cup presentation:The KDD-Cup presentation slides from the organizers and the three winning teams.Winning Teams:Winners
Runner-ups
Labeled Query DataThe 800 queries with labels from the three human labelers are available to download. QuestionsQuestion: Can we submit a separate solution for each evaluation criteria (award)? Answer:
Question: Are we allowed to use external sources (e.g. documents from directories) to increase the knowledge of the classifier? Answer: Yes, you will decide what methodology or resource to use to classify the queries. There is no restriction on what data you can/can't use to build your models. Question: How to label trash queries or non-English queries? Answer: The evaluation set contains only valid English queries. Participants may have their system return no labels on this type of non-English or trash query. Question: Do I have to submit an algorithm description? Answer: You need to submit an algorithm description. If you do not want to share the details of your techniques, you can just give a high level description of your approach and please indicate "a brief summary" at the beginning of your description. In this case, you will not participate in "Query Categorization Creativity Award". Question: How will the evaluation query set be selected? Answer:
New Categories and Samples: Based on some feedbacks from participants, we have made modifications to the categories to better reflect the most agreeable views people may have on search queries. We believe this is a good change and we trust that this should help you better classify the search queries. The category taxonomy given here is one view we can best come up to cover the internet search queries. When designing a categorization system, you should consider making it work well when this category taxonomy is replaced with another reasonable one. To deal with the potential that this category taxonomy is not exhaustive, if a query does not fit any of the categories, your system should return no labels and we will consider that in the result evaluation. We are also providing at least one sample query per category. The
modified category list and the new sample file can be downloaded from
http://www.acm.org/sigs/sigkdd/kdd2005/kddcup/KDDCUPData.zip.
ContactQuestions to OrganizersEmail questions to KDD Cup 2005 Organizers.Co-chairs
Ying Li
Microsoft Corp. Phone: 425-703-8739
Zijian Zheng
Amazon.com |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||