Tutorials

KDD-2007 ACCEPTED TUTORIALS

All tutorials will be held on Sunday, August 12th. Tentative schedules are:

Room 1

Room 2

Room 3

Room 4

Mining Large Time-evolving Data Using Matrix and Tensor Tools

http://www.cs.cmu.edu/~christos/TALKS/KDD-07-tutorial/

INSTRUCTORS:

Christos Faloutsos, Carnegie Mellon University
Tamara G. Kolda, Sandia National Labs
Jimeng Sun, Carnegie Mellon University

SUMMARY:

How can we find patterns in sensor streams (eg., a sequence of temperatures, water-pollutant measurements, or machine room measurements)? How can we mine Internet traffic graph over time? Further, how can we make the process incremental? We review the state of the art in four related fields: (a) numerical analysis and linear algebra (b) multi-linear/tensor analysis (c) graph mining and (d) stream mining. We will present both theoretical results and algorithms as well as case studies on several real applications. Our emphasis is on the intuition behind each method, and on guidelines for the practitioner.


A Statistical Framework for Mining Data Streams

INSTRUCTORS:

Tamraparni Dasu, AT&T Research
Simon Urbanek, AT&T Research

SUMMARY:

Data streams are a predominant form of information today, arising in areas and applications ranging from telecommunications, meteorology and rocketry, to the monitoring and support of e-commerce sites. Data streams are characterized by large volumes and high rates of accumulation, and unknown, frequently changing distributions. They pose unique analytical and computing challenges that are just beginning to be addressed. :In this tutorial, we give an introduction and overview of the analysis and monitoring of data streams in the context of a statistical framework. There are a wide variety of problems -- data reduction, characterizing constantly changing distributions, detecting changes in these distributions, computing and updating models for evolving data streams, identifying outliers, tracking rare events, "correlating" multiple data streams and others. We discuss techniques ranging from EDA (exploratory data analysis) and information visualization to density estimation, parametric and nonparametric approaches for the statistical estimation, inference and modeling of data streams.

We give an overview of existing literature and applications, highlighting opportunities for cross-pollination with statistics where appropriate. We make extensive use of examples and real life applications to elucidate the material, using state-of-the art information visualization techniques. In particular, we discuss major applications that we have worked on as running examples throughout the tutorial. We conclude with a discussion of open research problems in this dynamic area.


Statistical Modeling of Relational Data

INSTRUCTOR:

Pedro Domingos, University of Washington

SUMMARY:

KDD has traditionally been concerned with mining data from a single relation. However, most applications involve multiple interacting relations, either explicitly (in relational databases) or implicitly (in semi-structured and multimodal data). Examples include link analysis, social networks, bioinformatics, information extraction, security, ubiquitous computing, etc. Mining such data has become a topic of keen interest in the KDD community in recent years. The key difficulty is that data in relational domains is no longer i.i.d. (independent and identically distributed), greatly complicating statistical modeling. However, research has now advanced to the point where robust, easy-to-use, general-purpose techniques and languages for mining non-i.i.d. data are available. The goal of this tutorial is to add a sufficient subset of these concepts and techniques to the toolkits of both researchers and practitioners.

The tutorial will be composed of three parts:

  1. FOUNDATIONAL AREAS. The first part will consist of a brief introduction to each of the four foundational areas of relational data mining: logical inference, inductive logic programming, probabilistic inference, and statistical learning. Obviously, in the short time available no attempt will be made to comprehensively survey these areas; rather, the focus will be on providing the key concepts and techniques required for the subsequent parts. For example, the logical inference part will focus on the basics of satisfiability testing, and the probabilistic/statistical parts on Markov networks.
  2. PUTTING THE PIECES TOGETHER. The second part will introduce the key ideas in the statistical modeling of relational data and survey major approaches, using Markov logic as the unifying framework. It will present state-of-the-art algorithms for learning and inference in non-i.i.d. data, and give an overview of the Alchemy open-source software.
  3. APPLICATIONS. The third and final part will describe how to efficiently develop state-of-the-art non-i.i.d. applications in various areas, including: hypertext classification, link-based information retrieval, information extraction and integration, social network modeling, computational biology, ubiquitous computing, etc. This part will also include practical tips on using statistical relational techniques, Markov logic and Alchemy---the kind of information that is seldom found in research papers, but is key to developing successful applications.


Text Mining and Link Analysis for Web and Semantic Web

INSTRUCTORS:

Marko Grobelnik, J. Stefan Institute
Dunja Mladenic, J. Stefan Institute

SUMMARY:

The tutorial on Text Mining and Link Analysis for Web Data will focus on two main analytical approaches when analyzing web data: text mining and link analysis for the purpose of analyzing web documents and their linkage. First, the tutorial will cover some basic steps and problems when dealing with the textual and network (graph) data showing what is possible to achieve without very sophisticated technology. The idea of this first part is to present the nature of un-structured and semi-structured data. Next, in the second part, more sophisticated methods for solving more difficult and challenging problems will be shown. In the last part, some of the current open research issues will be presented and some practical pointers on the available tolls for solving previously mentioned problems will be provided.


Learning Bayesian Networks

INSTRUCTORS:

Richard E. Neapolitan, Northeastern Illinois University

SUMMARY:

Bayesian networks are graphical structures for representing the probabilistic relationships among a large number of variables and doing probabilistic inference with those variables. The 1990's saw the emergence of excellent algorithms for learning Bayesian networks from passive data. In 2004 I unified this research with my text: Learning Bayesian Networks, Prentice Hall, Upper Saddle River, NJ, 2004.

This tutorial is based on that text and my paper: Neapolitan, R.E., and X. Jiang, "A Tutorial on Learning Causal Influences," in Holmes, D. and L. Jain (Eds.): Innovations in Machine Learning, Springer-Verlag, New York, 2005.

I will discuss the constraint-based learning method using an intuitive approach that concentrates on causal learning. Then I will discuss the Bayesian approach with some simple examples. I will show how, using the Bayesian approach, we can even learning something about causal influences from passive data on two variables. Finally, I will show some applications to finance and marketing. These applications are from my new book (with Xia Jiang): Probabilistic Methods for Financial and Marketing Informatics, Morgan Kaufmann, San Mateo, CA, 2007.


From Trees to Forests and Rule Sets -- A Unified Overview of Ensemble Methods.

INSTRUCTORS:

Giovanni Seni, Santa Clara University, PDF Solutions
John Elder, Elder Research, Inc.

SUMMARY:

Ensemble methods are one of the most influential developments in Machine Learning over the past decade. They perform extremely well in a variety of problem domains, have desirable statistical properties, and scale well computationally. By combining competing models into a committee, they can strengthen �weak� learning procedures.

This tutorial explains two recent developments with ensemble methods: Importance Sampling and Rule Ensembles. Importance Sampling reveals �classic� ensemble methods -- bagging, random forests, and boosting -- to be special cases of a single algorithm. This unified view clarifies the properties of these methods and suggests ways to improve their accuracy and speed. Rule Ensembles are linear rule models derived from decision tree ensembles. While maintaining (and often improving) the accuracy of the tree ensemble, the rule-based model is much more interpretable.

This tutorial is aimed at both novice and advanced data mining researchers and practitioners -- especially in Engineering, Statistics, and Computer Science. Users with little exposure to ensemble methods will gain a clear overview of each method. Advanced practitioners already employing ensembles will gain insight into this breakthrough way to create nextgeneration models.


Mining Shape and Time Series Databases with Symbolic Representations.

INSTRUCTOR:

Eamonn Keogh, University of Calinfornia

SUMMARY:

Time series, shape and more generally multimedia data are ubiquitous; large volumes of such data are routinely created in scientific, industrial, entertainment, medical and biological domains. Examples include anthropological imagery, gene expression data, medical images, electrocardiograms, gait analysis, stock market quotes, space telemetry, military intelligence, zoology etc. To efficiently and accurately mine such data we must carefully chose algorithms and data representations. While most representations used in the past have been real valued (i.e. wavelets and Fourier methods) in this tutorial I advocate for using discrete (symbolic) representations of the data. Symbolic representations allow us to avail of very useful algorithms and data structures which are not available for real data, for example suffix trees, hashing and Markov models.

The tutorial will be illustrated with numerous real world examples created just for this tutorial, including examples from archeology (petroglyphs and projectile points), microscopy (nematodes and blood cells), historical manuscripts, zoology, motion capture and biometrics. The data mining tasks considered include indexing, classification, clustering, novelty discovery, motif discovery and visualization.

Dr. Keogh�s research interests are in Data Mining, Machine Learning and Information Retrieval. He has published more than 90 papers, including ten papers in SIGKDD, ten papers in IEEE ICDM, and papers in SIGIR, SIGMOD, SIGGRAPH, VLDB, ICML. Several of his papers have won �best paper� awards. He is the recipient of a 5-year NSF Career Award for �Efficient Discovery of Previously Unknown Patterns and Relationships in Massive Time Series Databases�. Dr Keogh has given well received tutorials on time series, machine learning and data mining all over the world, and his papers have been referenced well over a 2,000 times.


An archive of the KDD 2007 Call for Tutorial Proposals can be found here.

Links