KDD 2007 PANEL
Mining at the Crossroads: Successes, Failures and Learning From Them
Moderator: Srinivasan Parthasarathy, Ohio State University
Abstract
Since the 1989 workshop on knowledge discovery in databases, the field has seen
sustained growth and interest and has attained significant maturity. The main objectives of this panel
will be to reflect on the successes and failures in the field of data mining over the last
eighteen years and to examine what insights we can take with us as we move forward.
Introduction
Over the last eighteen years, the
field of knowledge discovery and data mining has
matured considerably. Although the field has evolved as a result of
synergistic co-operation
among researchers in databases, artificial intelligence, statistics and
systems, it has maintained its own identity. From a single workshop in
1989, the field
can now lay claim to at least 5 major conferences and numerous symposium
devoted to its central theme.
At an abstract level, the theme of the field is concerned
with extracting actionable and interpretable
knowledge from data in as efficient a manner as possible.
The primary purpose of this panel, in the context of this underlying
theme, is to consider the following questions.
What have been the major successes and breakthroughs that we as a field
can point to with pride?
What have been the critical mistakes or mis-steps that have been taken
along the way?
And finally, what can we hope to learn from both our successes and
mistakes and how can this
knowledge be used to determine how to focus our efforts in the future?
The panelists will be asked to examine and reflect on
the above questions along multiple dimensions.
- In the context of progress made in
various KDD sub-fields such as data preprocessing,
classification, clustering, frequent pattern mining, outlier
detection, visualization and interpretation.
- In terms of impact on real end applications
such as biomedical informatics,
science and engineering informatics, and security. This
issue is particularly important since the field itself
is, in some senses, driven by applications.
- In relation to emerging technology trends, both
software and hardware, with an eye towards the future.
Related issues such as the importance of reproducibility of results and
availability of suitable benchmarks, education of data mining graduate
students and their readiness for an industrial setting are also fair game for
discussion.
Successes
Success is the result of perfection, hard work, learning
from failure, loyalty and persistence.
This aspect of the panel discussion will seek to examine the significant
foundational, practical, and organizational,
success stories in the field over the last eighteen years.
A basic question here is what have been
the significant foundational successes? The goal is to
reflect on algorithmic success stories,
which are ably backed by theoretical and empirical evidence.
For example, the entire sub-field of association rule mining came
to the fore about 15 years ago and has since spawned a prolific following
that has since diversified to more generic frequent pattern (e.g. trees,
graphs)
mining algorithms.
Important advances have been made in areas such as
support vector machines, ensemble learning, scalable clustering, mining on
data streams, time-series analysis and mining with tensors.
Algorithms like PageRank
have simply revolutionized the way we think about web search.
All of these, and this is by no means a comprehensive list, correspond to
significant
foundational success stories.
A related question is
where have data mining and knowledge discovery
algorithms played an important role in an applied setting?
Similarly, what approaches have resulted in important improvements
for business and financial operations, breakthrough scientific discoveries
and
general benefits to mankind? Examples abound and include the use of
data mining technologies to detect
fraudulent and money laundering activities, the
ability to process terabytes of image data from the Sky Survey database to
detect and characterize fuzzy elements in the sky, the ability to
detect early onset of diseases and predicting their progression to enable
better disease-control measures, and
the ability to predict the efficacy and potential toxicity of
drugs to enable more informed drug design.
Another relevant question in this context is to ask how successful have we
been
in the context of educating our graduate students and preparing them
for data mining careers in industry, government and academia?
Have we been more successful in preparing them for academia versus
industry
and moreover is this really a trend specific to data mining? Similarly,
what
are the elements, in particular, that contributed to the success of
conferences in the field and
how can we ensure that the interest is sustained?
Mistakes and Failures
Our greatest glory is not in never failing, but in rising every time
we
fall -- Confucius
Over the last eighteen years, while there have clearly been successful
deployment
of knowledge discovery and data mining solutions, there have also
undoubtedly been mistakes
and failures. This aspect of the panel discussion will examine
the low-lights -- important mistakes and failures --
with the end goal of trying to learn from them.
A topic worth reflecting on, in this context, is whether there has been a
failure to progress in line with
expectations in various sub-fields of KDD?
For example in a recent article,
Hand
suggests that recent progress in
classification technology is perhaps at best an illusion.
As a counter-point, Friedman suggests that
initial progress in any new field
is usually substantial (picking on the low hanging fruit) whereas
subsequent
progress, while very important,
inevitably happens at a much slower rate.
A related theme would be to examine the above question in the context
of end application drivers. For example, in business applications, the
profit
margin is often the bottom line. While a more accurate data mining
model may yield better profits, the cost of training personnel to adapt
and use the model may be quite expensive suggesting a preference for
the simpler model in terms of overall profit margin. A similar set of
problems
hold in the biomedical context -- a practicing clinician often prefers
an easily interpretable model for disease diagnosis
over a slightly more accurate black box modeling tool. Are these
instances
examples of successes or failures? Similarly, a tool that outputs 100,000
patterns (potential hypothesis) that bear further experimental validation
is not very useful by itself to an experimental biologist, since the
experimental validation of
even 10 of those patterns may consume the better part of their career!
Another topic of interest here is to highlight
some of the classic mistakes
made in the field. Topics of interest here could range
from the use of non-representative training data to the
ignorance of population drift when modeling time-varying data,
from not accounting for errors in data or labels in the model
to an over reliance on a single technique for the task on hand and
from asking the wrong question in the context of the application driver
to sampling without care. A related topic here might be to think about
the role of benchmark datasets and algorithms, and reflect on
the general importance and requirement for repeatable and reproducible
results.
Outlook
Each success only buys an admission ticket to a more difficult
problem. -- Henry Kissinger.
Early success in any new field typically point to the plucking of
low-hanging fruit. Following up on success, early or late,
is inevitably harder.
How then can we build on our successes of the last eighteen years as we
tackle new and exciting problems over the next eighteen?
What are the important problems brought to light as a result of our past
successes?
Failure is success if we learn from it -- Malcolm Forbes.
This section of the panel will also include a discussion on
insights gleaned from failures of the past.
In other words what can we learn from our mistakes and failures?
Moving forward what errors and mishaps can we avoid and how can
we ensure that such failures do not recur?
As part of this section of the discussion,
panelists will be asked to identify and
suggest topics or areas of research that in their
opinion are worth pursuing.
Similarly they may also reflect
on topics they consider solved or are best left alone, i.e., not that
important in their opinion. Panelists
will also be asked
to identify emerging technologies in hardware
(e.g. multicore processors) and software
as well as emerging application areas,
that they feel are particularly important and are likely to have an impact
on the field as we move forward.
Panel Format
The terms "success" and "failure" often convey fuzzy semantics
that are open to interpretation.
As part of the discussion, it is expected that panelists will offer
their thoughts on the aforementioned
questions while defining their interpretation
of these terms in the context of particular domains.
After an initial round of discussions by the panelists, the floor will
then be
opened to an interactive session with the audience.
Finally, panelists will be asked to conclude their
presentations with their outlook on how one can learn from the successes
and failures of the past 15+ years and what in their opinion are
the critical opportunities for the field in the future.
Panelist Biographies
In this section we include brief biographies of the distinguished
panelists.
Dr. Pavel Berkhin earned his Ph.D. in Mathematics from the Institute
of Mathematics, Novosibirsk, USSR, under the supervision of Professor
Sergey
Sobolev. He has worked on theoretical challenges facing search
technologies including link-based spam detection, personalization, and new
ways for
PageRank
computing. He is currently a vice president of Data Mining and Research
with Yahoo!.
His group is involved in diversified efforts to utilize Yahoo! data, from
the development of Yahoo! data mining platform to anomaly detection, from
behavioral
targeting to modeling for search advertisement, and from studies of user
adoption patterns to keyword set expansions.
Prior to Yahoo!, Dr. Berkhin was Chief Scientist with Accrue Software,
Inc., a web analysis company, and a Chief Scientist of Neo Vista, Inc., a
provider of
industrial data mining software.
Dr. John Elder obtained a BS and ME in Electrical Engineering from
Rice University, and a PhD in Systems Engineering from the University of
Virginia,
where he has recently been an adjunct professor, teaching Optimization. He
currently heads Elder Research Inc., a company that focuses on scientific
and
commercial applications of pattern discovery and optimization, including
stock selection, image recognition, medical text mining, biometric
identification,
drug efficacy, credit scoring, cross-selling, investment timing, and fraud
detection.
He has authored innovative data mining tools and is active in Statistics,
Engineering, and Finance conferences and boards. He is also a frequent
keynote
conference
speaker, and was a Program co-chair of the 2004 SIGKDD conference. He was
honored by being selected to serve for 5 years on a panel appointed by the
President
to guide the National Security Agency on technology.
Dr. Christos Faloutsos,
received his BS from the National Technical University of Athens in
Electrical
Engineering and MS and PhD degrees in Computer Science from the University
of Toronto.
He is currently a Professor at Carnegie Mellon University.
He has received the Presidential Young Investigator Award by
the National Science Foundation (1989),
the Research Contributions Award in ICDM 2006,
ten ``best paper'' awards, and several teaching awards.
He has served as a member of the executive committee of SIGKDD;
he has published over 160 refereed articles, 11 book chapters
and one monograph. He holds five patents and
has given over 20 tutorials and 10 invited distinguished lectures.
His research interests include data mining
for streams and graphs, fractals, database performance,
and indexing for multimedia and bio-informatics data.
Dr. Jiawei Han
received his PhD from the University of Wisconsin in Computer
Science in 1985.
He is currently a professor, at the
Department of Computer Science in the University of
Illinois at Urbana-Champaign. He has been working on research into data
mining, data warehousing, database systems, data mining from spatiotemporal
data, multimedia data, stream and RFID data, Web data, social network data,
and biological data, with over 300 journal and conference publications.
He has chaired or served on over 100 program committees of international
conferences and workshops, including PC co-chair of 2005 (IEEE) International
Conference on Data Mining (ICDM), Americas Coordinator of 2006 International
Conference on Very Large Data Bases (VLDB). He is also serving as the
founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from
Data. He is an ACM Fellow and has received 2004 ACM SIGKDD Innovations
Award and 2005 IEEE Computer Society Technical Achievement Award. His book
"Data Mining: Concepts and Techniques" (2nd ed., Morgan Kaufmann, 2006) has
been popularly used as a textbook worldwide.
Dr. Haym Hirsh received his BS degree from the Mathematics and
Computer
Science Departments at UCLA and his MS and PhD from the Computer Science
Department at Stanford University. He is a Professor of Computer Science
at Rutgers University, and has also held visiting positions at Bar-Ilan
University, CMU, MIT, NYU, and the University of Zurich. He is currently
Director of the Division of Information and Intelligent Systems at the
U.S. National Science Foundation's Directorate for Computer and
Information Science and Engineering. Haym's research is on foundations
and applications of machine learning, data mining, and information
retrieval.
Dr. Srinivasan Parthasarathy earned his PhD from the University of Rochester in
Computer Science in 2000. He is currently an Associate professor in the
Computer Science and Engineering Department at the Ohio State University (OSU). His research interests are in data mining, bioinformatics and high performance computing.
He is a recipient of an NSF CAREER award, a DOE Early Career Award, and an Ameritech Faculty fellowship.
His papers have received several awards from leading conferences in the field,
including ones at the SIAM international conference on data mining (SDM), the IEEE
international conference on data mining (ICDM),
the Very Large Databases Conference (VLDB) and most recently
at ACM SIGKDD.
He is a member of the ACM and the IEEE and serves on the editorial boards
of IEEE Intelligent Systems and the Data Mining and Knowledge Discovery: An International Journal and also served as one
of the program chairs of SIAM Data Mining in 2007.
KDD-2007 CALL FOR PANEL PROPOSALS
The KDD-2007 organizing committee invites proposals for panels to be held at the conference. Panels provide a forum for discussing emerging topics and controversial issues. Panel proposals are expected to address new, exciting, and controversial issues. They should be provocative and informative. Of special interest are proposals that address emerging themes of research in KDD that are likely to have long-term relevance and impact. A mix of industry, government, and academic panel members is encouraged.
IMPORTANT DATES:
Panel proposals due: Feb 28, 2007
Notification of acceptance/rejection: April 16, 2007
PROPOSAL DETAILS:
Panel proposals should be no more than four pages in length and must include the following:
- Title of the panel
- The topic and issues to be discussed in the panel
- Name, affiliation, and contact information for the panel chair
- Names and affiliations of up to four panelists (in addition to the panel chair) who have made a commitment to participate
- Brief biography of each participant
Panel proposals should be sent by e-mail in PDF or ASCII format to the:
Panel Chair: Vipin Kumar (kumar[at]cs[dot]umn[dot]edu)
URL:
http://www.cs.umn.edu/~kumar