Organization
	Chairs
	Program Committee


Call for Papers
	Important dates
	Call for Papers HTML
	Call for Papers PDF
	Call for Proposals - Workshops - Tutorials - Panels
	Call for Exhibits
	KDD 2002 Poster


Program
	Research Track
	Industrial Track
	Workshops
	Tutorials
	Exhibits



	Related Conferences

The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

KDD 2002 - Tutorials

Multivariate Density Estimation and Visual Clustering
David W. Scott, Rice University
Text Mining for Bioinformatics
Hinrich Schuetze, Novation Biosciences
Russ Altman, Stanford University Medical Center
Link Analysis : Current State of the Art
Ronen Feldman, ClearForest Corporation NY
Common Reasons Data Mining Projects Fail
Monte F. Hancock, CSI Corporation
Querying and Mining Data Streams: you only get one look
Rajeev Rastogi, Minos Garofalakis, Bell Labs
Johannes Gehrke, Cornell University
Visual Data Mining: Background, Techniques, and Drug Discovery Applications
Georges Grinstein, University of Massachusetts Lowell
Mihael Ankerst, Boeing
Daniel A. Keim, ATT Research & University Konstanz

Multivariate Density Estimation and Visual Clustering

David W. Scott, Rice University

Tutorial Abstract
Density estimation has long been recognized as an important tool when used with univariate and bivariate data. But the computer revolution of recent years has provided access to data of unprecedented complexity in ever-growing volume. New tools are required to detect and summarize the multivariate structure of these difficult data. This tutorial is derived from the tutorial leader's 1992 text "Multivariate Density Estimation: Theory, Practice, and Visualization." We demonstrate that density estimation retains its explicative power even when applied to trivariate and quadrivariate data and beyond.

By presenting the major ideas in the context of the classical histogram, we quickly grasp an understanding of advanced estimators. We develop links between the intuitive histogram and other methods that are more statistically efficient. The theoretical results outlined are those particularly relevant to application and understanding. The focus is on methodology, new ideas, and practical advice. Also, detailed discussions of nonparametric dimension reduction, nonparametric regression, and classification are included. Because visualization is a key element in effective multivariate nonparametric analysis, we begin with that topic. Density estimation is both an exploratory tool as well as a confirmatory methodology. One of the most important and difficult tasks in data mining is clustering. We describe how density estimation can help.

Intended Audience
The intended audience includes anyone with an interest in data understanding. Any technical background will suffice. No advanced statistical training will be assumed or required. Individuals with no statistical training have enjoyed this course.

Lecturer's Biography
David Scott is the Noah Harding Professor of Statistics at Rice University in Houston. His research interests include multivariate data analysis, nonparametric function estimation, clustering, data mining, robust estimation, outlier detection, and statistical computing. He is author of the 1992 Wiley book ``Multivariate Density Estimation: Theory, Practice, and Visualization'' as well as numerous scientific papers.

He is currently editor of the Journal of Computational and Graphical Statistics. He is a member of the National Research Council's Committee on Applied and Theoretical Statistics, which is organizing a workshop on massive data sets and real-time data mining. He is also a member of the John Wiley Editorial Board on Probability and Statistics.

He is a Fellow of the American Statistical Association, the Institute of Mathematical Statistics, and the American Association for the Advancement of Science. He is also an elected member of the International Statistics Institute, and was named the Texas Statistician of the Year in 1993.

Text Mining for Bioinformatics

Hinrich Schuetze, Novation Biosciences
Russ Altman, Stanford University Medical Center

Tutorial Abstract
Our goal is to make this tutorial a practical guide for how to use text mining in bioinformatics while at the same time highlighting some of the interesting research issues that arise when mining techniques are applied in bioinformatics. Participants will be able to broaden the set of tools they are comfortable with if they work in bioinformatics (drug discovery, pharmaceutical companies etc). Or they will learn about one of the most exciting areas of application of data discovery and analysis techniques if they are data miners currently working on non-biological problems. Previous exposure to biology will be helpful, but the tutorial will be accessible to those who have no biology background. We will assume familiarity with basic statistical and probabilistic concepts.This tutorial is a joint work with Russ B. Altman, MD, Associate Professor in the Medical Informatics Group at the Stanford University Medical Center.

Intended Audience
Our goal is to make this tutorial a practical guide for how to use text mining in bioinformatics while at the same time highlighting some of the interesting research issues that arise when mining techniques are applied in bioinformatics. Participants will be able to broaden the set of tools they are comfortable with if they work in bioinformatics (drug discovery, pharmaceutical companies etc). Or they will learn about one of the most exciting areas of application of data discovery and analysis techniques if they are data miners currently working on non-biological problems. Previous exposure to biology will be helpful, but the tutorial will be accessible to those who have no biology background. We will assume familiarity with basic statistical and probabilistic concepts.

Lecturer's Biography
After receiving a Ph.D. in Natural Language Processing from Stanford University in 1995, Hinrich Sch�tze joined the Xerox Palo Alto Research Center, where he developed a scaleable approach to semantic analysis of natural language based on mining of association data. He then co-founded Outride, a search personalization company, and led the development of personalization software that learns user preferences from surfing behavior. He is author of the best-selling textbook on data-driven natural language processing (with Chris Manning, MIT Press) and of a dozen issued and pending patents. Dr. Sch�tze is currently CTO of Novation Biosciences, a bioinformatics company focussed on text and data mining of biological data. He is also Consulting Faculty at Stanford.

Link Analysis : Current State of the Art

Ronen Feldman, ClearForest Corporation NY

The information age has made it easy to store large amounts of data. The proliferation of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming. However, while the amount of data available to us is constantly increasing, our ability to absorb and process this information remains constant. Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes. Link Analysis is a new and exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, Information Extraction, Text Categorization, Visualization and Knowledge Management. Link Analysis is the process of building up networks of interconnected objects through various relationships in order to discover patterns and trends. The main tasks of link analysis are to extract, discover, and link together sparse evidence from vast amounts of data sources, to represent and evaluate the significance of the related evidence, and to learn patterns to guide the extraction, discovery, and linkage of entities. The relationships could be transactional, geographical, social, or temporal. Link Analysis involves the preprocessing of document collections (text categorization, term extraction, and information extraction), integration with structured information sources, the storage of the intermediate representations, the techniques to analyze these intermediate representations (distribution analysis, clustering, trend analysis, association rules, etc.) and visualization of the results. In this tutorial we will present the general theory of Link Analysis and will demonstrate several systems that use these principles to enable interactive exploration of a combination of structured and unstructured collections. We will present a general architecture of link analysis systems and will outline the algorithms and data structures behind the systems. The Tutorial will cover the state of the art in this rapidly growing area of research. Several real world applications of link analysis will be presented.

Intended Audience
The tutorial should be of interest to practitioners from Data Mining, Bio Information, NLP, IR, Knowledge Management and the general AI audience interested in this fast-growing research area.

Lecturer's Biography
Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc. in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY. He is the founder and president of ClearForest Corporation, a NY based company specializing in development of text mining tools and applications. He is also an Adjunct Professor at NYU Stern Business School.

Common Reasons Data Mining Projects Fail

Monte F. Hancock, CSI Corporation

OUTLINE and AUDIENCE:
I intend to present the material by walking through the steps of a data mining project (e.g., CRISP-DM) describing the errors that are commonly made at each step, and how to avoid them. Practical and interesting examples will be drawn from the presentor's extensive real-world data mining experience in government and industry.

The 11th chapter of my book, Data Mining Explained generated more interest than any other. It is titled, "Common Reasons Data Mining Projects Fail". More and more people are having practical exposure to data mining, but much of this is negative due to errors that a little knowledge can readily overcome. In this tutorial, I will be expanding the list of error sources given in the book.

This tutorial is intended for both technical and non-technical audiences. It will be of particular interest to those who have some practical data mining experience, but have enjoyed little success.

Attendees with some practical data mining experience (but not yet "experts") will benefit most. Enterprise decision makers should also find the presentation accessible and interesting.

Lecturer's Biography
Chief Scientist, CSI Corporation
1235 Evans Road, Melbourne, Florida 32904

Monte F. Hancock is a member of the Program Committee for the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, to be held July 23-26th, 2002, in Edmonton, Alberta, Canada.

Education:
1976: B.A., Pure Mathematics, Rice University
1977: M.S., Pure Mathematics, Syracuse University

Academic Career:
1985-present: Rollins College, Winter Park, FL (Brevard Campus)
Adjunct Faculty, Mathematics and Computer Science
(1999 recipient of the Christa McAuliffe Teaching Award)
2001-present: Webster University, St. Louis, MO (Merrit Island Campus)
Graduate Faculty, Computer Science
1997-2001: University of Florida, Gainesville, FL
Distance Learning Faculty, Mathematics
1981-82: Pennsylvania State University, University Park, PA
Adjunct Faculty, Computer Science

Professional:
1987-present: Chief Scientist, CSI Inc.

Primary research interest is in artificial intelligence applications for data mining (pattern recognition, predictive modeling). Principal Investigator / Technical Lead on many data mining efforts for both government and industry (target detection, signal identification, medical image processing, CRM [fraud detection, attrition prediction], insurance liability estimation, resource management, process optimization, etc.)

Querying and Mining Data Streams: you only get one look

Rajeev Rastogi, Minos Garofalakis, Bell Labs
Johannes Gehrke, Cornell University

OUTLINE and AUDIENCE:
Continuous data streams arise naturally, for example, in the network installations of large Telecom and Internet service providers where detailed usage information from different parts of the network needs to be continuously collected and analyzed for interesting trends. This tutorial will provide a comprehensive and clear overview of the key research results surrounding data stream processing at this point in time. Our discussion will be structured as follows.

Introduction: Basic stream-processing models and architectures; motivating applications.
Basic Stream Summarization Algorithms: Samples, quantiles/histograms, sketches, wavelets over streaming data.
Processing Queries on Streams: Using sketches for self-joins, binary joins, and complex joins over data streams; estimating correlated aggregates; using histogram and wavelet synopses for approximate-query processing.
Mining High-speed Data Streams: Single-pass algorithms for rule discovery, clustering, and decision-tree construction over streams.
Advanced Topics and Future Research Directions: Hot-list maintenance; distinct-value estimation; multi-dimensional synopses; content-based filtering of streaming XML documents.

This tutorial is targeted at researchers and practitioners who want to obtain a solid understanding of the state-of-the-art in stream query processing and analysis.

Lecturers' Biography

Minos Garofalakis (PhD 1998, UW-Madison) is a Member of Technical Staff at Bell Labs. His research interests include data reduction and mining, data streaming, approximate queries, and XML.
Johannes Gehrke (PhD 1999, UW-Madison) is an Assistant Professor at Cornell University. His research interests include data mining, database systems, and ubiquitous computing.
Rajeev Rastogi (PhD 1993, UT-Austin) is a Department Director at Bell Labs. His research interests include network management, database systems, and knowledge discovery.

Visual Data Mining: Background, Techniques, and Drug Discovery Applications

Georges Grinstein, University of Massachusetts Lowell
Mihael Ankerst, Boeing
Daniel A. Keim, ATT Research & University Konstanz

Tutorial Abstract
The areas of data mining and information visualization offer various techniques which effectively complement one another supporting the discovery of patterns in data. Whereas traditional (algorithmic) techniques attempt to analyze data automatically, information visualization techniques leverage data mining from an orthogonal direction by providing a platform for acquiring insight not only into the data but into the algorithms as well. The visualizations help generate hypotheses harnessing human capabilities including domain knowledge, perception, and creativity. To successfully apply data mining algorithms, visualization and interaction are crucial since they enable the user to steer the data mining process, incorporate domain knowledge, and understand the results.

The tutorial presents the state-of-the-art in visual data mining, covering research prototypes as well as commercial systems. The first part provides the necessary background on information visualization techniques. The second part focuses on approaches specifically designed for visual data mining. The final part of the tutorial covers visual data mining in the drug discovery domain.

Lecturers' Biography

Georges Grinstein is a full time Professor of Computer Science at the University of Massachusetts Lowell, Director of its Institute for Visualization and Perception Research, its Center for Bioinformatics and Computational Biology, and a co-founder and Director of Research at AnVil, Inc. He received his B.S. from City College of NY, his M.S. from New York University, and his Ph.D. in Mathematics from the University of Rochester. Georges Grinstein is a member of IEEE, ACM , AAAI, Eurographics, and served on the journal editorial boards of Computers and Graphics, Computer Graphics Forum, and Knowledge Discovery in Databases and Data Mining. He was co-chair of numerous conferences including several of the IEEE Visualization Conferences and SPIE Visual Data Exploration and Analysis Conferences. He was co-chair for the 1993 and 1995 Workshops on Database Issues for Data Visualization, and for the 1997 AAAI and IEEE Workshops on the Integration of Visualization and Data Mining. He has participated on numerous panels, presentations, and seminars in the area of data visualization and exploration and is co-author of the new book on Information Visualization in Data Mining and Knowledge Discovery.
Mihael Ankerst is an Advanced Computing Technologist at The Boeing Company. His research interests include data mining, information visualization and database systems. He leads the design and development of the PBC system - a system which tightly integrates data mining algorithms with visualization capabilities. He has published research papers for several conferences, including KDD 2000, KDD '99, SIGMOD '99, InfoVis '98, Visualization '96, and Visualization '95, and in journals, including TKDE '98 and Informatica '99. He has served on program committees or as external referee for KDD '99, 2001 and 2002, InfoVis '99, The Computer Journal 2000 and VLDB 2001. He gave invited talks on human involvement in the KDD process at the University of Alberta 2002, LANL 2001, SolEuNet Workshop 2000, AWAMIDA '99, Simon Fraser University '99, AT&T '99 and co-presented the tutorial "Visual Data Mining and Exploration of Large Databases" at the PKDD'01. He received his Ph.D. in 2000 from the University of Munich, Germany and has written his Ph.D. thesis on Visual Data Mining.
Daniel A. Keim is working in the area of information visualization and data mining. In the field of information visualization, he developed several novel techniques which use visualization technology for the purpose of exploring large databases. He has published extensively on information visualization and data mining; he has given tutorials on related issues at several large conferences including Visualization, SIGMOD, VLDB, and KDD; he has been program co-chair of the IEEE Information Visualization Symposia in 1999 and 2000; he is program co-chair of the ACM SIGKDD conference in 2002; and he is an editor of TVCG and the Information Visualization Journal. Daniel Keim received his Ph.D. in Computer Science from the University of Munich in 1994. He has been assistant professor at the CS department of the University of Munich, associate professor at the CS department of the Martin-Luther-University Halle, and full professor at the CS department of the University of Constance. Currently, he is working at AT&T Shannon Research Labs, Florham Park, NJ, USA.

Last modified: April 12th, 2002 by the KDD-2002 Webmaster (zaiane@cs.ualberta.ca)