Panel

KDD 2007 PANEL
Mining at the Crossroads: Successes, Failures and Learning From Them

Moderator: Srinivasan Parthasarathy, Ohio State University

Abstract

Since the 1989 workshop on knowledge discovery in databases, the field has seen sustained growth and interest and has attained significant maturity. The main objectives of this panel will be to reflect on the successes and failures in the field of data mining over the last eighteen years and to examine what insights we can take with us as we move forward.

Introduction

Over the last eighteen years, the field of knowledge discovery and data mining has matured considerably. Although the field has evolved as a result of synergistic co-operation among researchers in databases, artificial intelligence, statistics and systems, it has maintained its own identity. From a single workshop in 1989, the field can now lay claim to at least 5 major conferences and numerous symposium devoted to its central theme.

At an abstract level, the theme of the field is concerned with extracting actionable and interpretable knowledge from data in as efficient a manner as possible. The primary purpose of this panel, in the context of this underlying theme, is to consider the following questions. What have been the major successes and breakthroughs that we as a field can point to with pride? What have been the critical mistakes or mis-steps that have been taken along the way? And finally, what can we hope to learn from both our successes and mistakes and how can this knowledge be used to determine how to focus our efforts in the future?

The panelists will be asked to examine and reflect on the above questions along multiple dimensions.

  • In the context of progress made in various KDD sub-fields such as data preprocessing, classification, clustering, frequent pattern mining, outlier detection, visualization and interpretation.
  • In terms of impact on real end applications such as biomedical informatics, science and engineering informatics, and security. This issue is particularly important since the field itself is, in some senses, driven by applications.
  • In relation to emerging technology trends, both software and hardware, with an eye towards the future.

Related issues such as the importance of reproducibility of results and availability of suitable benchmarks, education of data mining graduate students and their readiness for an industrial setting are also fair game for discussion.

Successes

Success is the result of perfection, hard work, learning from failure, loyalty and persistence.

This aspect of the panel discussion will seek to examine the significant foundational, practical, and organizational, success stories in the field over the last eighteen years. A basic question here is what have been the significant foundational successes? The goal is to reflect on algorithmic success stories, which are ably backed by theoretical and empirical evidence. For example, the entire sub-field of association rule mining came to the fore about 15 years ago and has since spawned a prolific following that has since diversified to more generic frequent pattern (e.g. trees, graphs) mining algorithms. Important advances have been made in areas such as support vector machines, ensemble learning, scalable clustering, mining on data streams, time-series analysis and mining with tensors. Algorithms like PageRank have simply revolutionized the way we think about web search. All of these, and this is by no means a comprehensive list, correspond to significant foundational success stories.

A related question is where have data mining and knowledge discovery algorithms played an important role in an applied setting? Similarly, what approaches have resulted in important improvements for business and financial operations, breakthrough scientific discoveries and general benefits to mankind? Examples abound and include the use of data mining technologies to detect fraudulent and money laundering activities, the ability to process terabytes of image data from the Sky Survey database to detect and characterize fuzzy elements in the sky, the ability to detect early onset of diseases and predicting their progression to enable better disease-control measures, and the ability to predict the efficacy and potential toxicity of drugs to enable more informed drug design.

Another relevant question in this context is to ask how successful have we been in the context of educating our graduate students and preparing them for data mining careers in industry, government and academia? Have we been more successful in preparing them for academia versus industry and moreover is this really a trend specific to data mining? Similarly, what are the elements, in particular, that contributed to the success of conferences in the field and how can we ensure that the interest is sustained?

Mistakes and Failures

Our greatest glory is not in never failing, but in rising every time we fall -- Confucius

Over the last eighteen years, while there have clearly been successful deployment of knowledge discovery and data mining solutions, there have also undoubtedly been mistakes and failures. This aspect of the panel discussion will examine the low-lights -- important mistakes and failures -- with the end goal of trying to learn from them.

A topic worth reflecting on, in this context, is whether there has been a failure to progress in line with expectations in various sub-fields of KDD? For example in a recent article, Hand suggests that recent progress in classification technology is perhaps at best an illusion. As a counter-point, Friedman suggests that initial progress in any new field is usually substantial (picking on the low hanging fruit) whereas subsequent progress, while very important, inevitably happens at a much slower rate.

A related theme would be to examine the above question in the context of end application drivers. For example, in business applications, the profit margin is often the bottom line. While a more accurate data mining model may yield better profits, the cost of training personnel to adapt and use the model may be quite expensive suggesting a preference for the simpler model in terms of overall profit margin. A similar set of problems hold in the biomedical context -- a practicing clinician often prefers an easily interpretable model for disease diagnosis over a slightly more accurate black box modeling tool. Are these instances examples of successes or failures? Similarly, a tool that outputs 100,000 patterns (potential hypothesis) that bear further experimental validation is not very useful by itself to an experimental biologist, since the experimental validation of even 10 of those patterns may consume the better part of their career!

Another topic of interest here is to highlight some of the classic mistakes made in the field. Topics of interest here could range from the use of non-representative training data to the ignorance of population drift when modeling time-varying data, from not accounting for errors in data or labels in the model to an over reliance on a single technique for the task on hand and from asking the wrong question in the context of the application driver to sampling without care. A related topic here might be to think about the role of benchmark datasets and algorithms, and reflect on the general importance and requirement for repeatable and reproducible results.

Outlook

Each success only buys an admission ticket to a more difficult problem. -- Henry Kissinger.

Early success in any new field typically point to the plucking of low-hanging fruit. Following up on success, early or late, is inevitably harder. How then can we build on our successes of the last eighteen years as we tackle new and exciting problems over the next eighteen? What are the important problems brought to light as a result of our past successes?

Failure is success if we learn from it -- Malcolm Forbes.

This section of the panel will also include a discussion on insights gleaned from failures of the past. In other words what can we learn from our mistakes and failures? Moving forward what errors and mishaps can we avoid and how can we ensure that such failures do not recur?

As part of this section of the discussion, panelists will be asked to identify and suggest topics or areas of research that in their opinion are worth pursuing. Similarly they may also reflect on topics they consider solved or are best left alone, i.e., not that important in their opinion. Panelists will also be asked to identify emerging technologies in hardware (e.g. multicore processors) and software as well as emerging application areas, that they feel are particularly important and are likely to have an impact on the field as we move forward.

Panel Format

The terms "success" and "failure" often convey fuzzy semantics that are open to interpretation. As part of the discussion, it is expected that panelists will offer their thoughts on the aforementioned questions while defining their interpretation of these terms in the context of particular domains. After an initial round of discussions by the panelists, the floor will then be opened to an interactive session with the audience. Finally, panelists will be asked to conclude their presentations with their outlook on how one can learn from the successes and failures of the past 15+ years and what in their opinion are the critical opportunities for the field in the future.

Panelist Biographies

In this section we include brief biographies of the distinguished panelists.

Dr. Pavel Berkhin earned his Ph.D. in Mathematics from the Institute of Mathematics, Novosibirsk, USSR, under the supervision of Professor Sergey Sobolev. He has worked on theoretical challenges facing search technologies including link-based spam detection, personalization, and new ways for PageRank computing. He is currently a vice president of Data Mining and Research with Yahoo!. His group is involved in diversified efforts to utilize Yahoo! data, from the development of Yahoo! data mining platform to anomaly detection, from behavioral targeting to modeling for search advertisement, and from studies of user adoption patterns to keyword set expansions. Prior to Yahoo!, Dr. Berkhin was Chief Scientist with Accrue Software, Inc., a web analysis company, and a Chief Scientist of Neo Vista, Inc., a provider of industrial data mining software.

Dr. John Elder obtained a BS and ME in Electrical Engineering from Rice University, and a PhD in Systems Engineering from the University of Virginia, where he has recently been an adjunct professor, teaching Optimization. He currently heads Elder Research Inc., a company that focuses on scientific and commercial applications of pattern discovery and optimization, including stock selection, image recognition, medical text mining, biometric identification, drug efficacy, credit scoring, cross-selling, investment timing, and fraud detection. He has authored innovative data mining tools and is active in Statistics, Engineering, and Finance conferences and boards. He is also a frequent keynote conference speaker, and was a Program co-chair of the 2004 SIGKDD conference. He was honored by being selected to serve for 5 years on a panel appointed by the President to guide the National Security Agency on technology.

Dr. Christos Faloutsos, received his BS from the National Technical University of Athens in Electrical Engineering and MS and PhD degrees in Computer Science from the University of Toronto. He is currently a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), the Research Contributions Award in ICDM 2006, ten ``best paper'' awards, and several teaching awards. He has served as a member of the executive committee of SIGKDD; he has published over 160 refereed articles, 11 book chapters and one monograph. He holds five patents and has given over 20 tutorials and 10 invited distinguished lectures. His research interests include data mining for streams and graphs, fractals, database performance, and indexing for multimedia and bio-informatics data.

Dr. Jiawei Han received his PhD from the University of Wisconsin in Computer Science in 1985. He is currently a professor, at the Department of Computer Science in the University of Illinois at Urbana-Champaign. He has been working on research into data mining, data warehousing, database systems, data mining from spatiotemporal data, multimedia data, stream and RFID data, Web data, social network data, and biological data, with over 300 journal and conference publications. He has chaired or served on over 100 program committees of international conferences and workshops, including PC co-chair of 2005 (IEEE) International Conference on Data Mining (ICDM), Americas Coordinator of 2006 International Conference on Very Large Data Bases (VLDB). He is also serving as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data. He is an ACM Fellow and has received 2004 ACM SIGKDD Innovations Award and 2005 IEEE Computer Society Technical Achievement Award. His book "Data Mining: Concepts and Techniques" (2nd ed., Morgan Kaufmann, 2006) has been popularly used as a textbook worldwide.

Dr. Haym Hirsh received his BS degree from the Mathematics and Computer Science Departments at UCLA and his MS and PhD from the Computer Science Department at Stanford University. He is a Professor of Computer Science at Rutgers University, and has also held visiting positions at Bar-Ilan University, CMU, MIT, NYU, and the University of Zurich. He is currently Director of the Division of Information and Intelligent Systems at the U.S. National Science Foundation's Directorate for Computer and Information Science and Engineering. Haym's research is on foundations and applications of machine learning, data mining, and information retrieval.

Dr. Srinivasan Parthasarathy earned his PhD from the University of Rochester in Computer Science in 2000. He is currently an Associate professor in the Computer Science and Engineering Department at the Ohio State University (OSU). His research interests are in data mining, bioinformatics and high performance computing. He is a recipient of an NSF CAREER award, a DOE Early Career Award, and an Ameritech Faculty fellowship. His papers have received several awards from leading conferences in the field, including ones at the SIAM international conference on data mining (SDM), the IEEE international conference on data mining (ICDM), the Very Large Databases Conference (VLDB) and most recently at ACM SIGKDD. He is a member of the ACM and the IEEE and serves on the editorial boards of IEEE Intelligent Systems and the Data Mining and Knowledge Discovery: An International Journal and also served as one of the program chairs of SIAM Data Mining in 2007.


KDD-2007 CALL FOR PANEL PROPOSALS

The KDD-2007 organizing committee invites proposals for panels to be held at the conference. Panels provide a forum for discussing emerging topics and controversial issues. Panel proposals are expected to address new, exciting, and controversial issues. They should be provocative and informative. Of special interest are proposals that address emerging themes of research in KDD that are likely to have long-term relevance and impact. A mix of industry, government, and academic panel members is encouraged.

IMPORTANT DATES:

Panel proposals due: Feb 28, 2007
Notification of acceptance/rejection: April 16, 2007

PROPOSAL DETAILS:

Panel proposals should be no more than four pages in length and must include the following:

  • Title of the panel
  • The topic and issues to be discussed in the panel
  • Name, affiliation, and contact information for the panel chair
  • Names and affiliations of up to four panelists (in addition to the panel chair) who have made a commitment to participate
  • Brief biography of each participant

Panel proposals should be sent by e-mail in PDF or ASCII format to the:

Panel Chair: Vipin Kumar (kumar[at]cs[dot]umn[dot]edu)
URL: http://www.cs.umn.edu/~kumar

Links