KDD 2011: Tutorials

Tutorials

Note that this year all tutorials are invited -- we will not be soliciting tutorial proposals. The schedule is as follows:

Sunday Morning

Data Mining Problems in Internet Ad Systems
(Muthu Muthukrishnan, Rutgers)

We are inspired by systems that have emerged in the past decade that enable advertisements (ads) on the Internet. Such Internet ad systems handle billions of transactions every day involving millions of users, websites and advertisers, and are the basis for billions of dollars worth industry. They crucially rely on real-time collection, management and analysis of data for their effectiveness. Further, they represent unusual challenges for data analysis: nearly all parties in Internet ad systems from marketeers to publishers use active, selfish strategies that both help generate new data as well as distort data produced due to their strategies. Mining such data while cognizant of the inherent game theory is a great research challenge. The tutorial will provide an overview of Internet ad systems and discuss such challenges, with special emphasis on Ad Exchanges.

S. (Muthu) Muthukrishnan is a Professor in Rutgers Univ. with research interest in databases and algorithms, recently on data stream management and in algorithms for Internet ad systems. This tutorial is based on work with the Doublecick Ad Exchange and Market Algorithms at Google.

Social Media Analytics
(Jure Leskovec, Stanford)

Tutorial Website

Online social media represent a fundamental shift of how information is being produced, transferred and consumed. The present tutorial investigates techniques for social media modeling, analytics and optimization. First we present methods for collecting large scale social media data and then discuss techniques for coping with and correcting for the effects arising from missing and incomplete data. We proceed by discussing methods for extracting and tracking information as it spreads among the users. Then we examine methods for extracting temporal patterns by which information popularity grows and fades over time. We show how to quantify and maximize the influence of media outlets on the popularity and attention given to particular piece of content, and how to build predictive models of information diffusion and adoption. As the information often spreads through implicit social and information networks we present methods for inferring networks of influence and diffusion. Last, we discuss methods for tracking the flow of sentiment through networks and emergence of polarization.

Jure Leskovec is an assistant professor of Computer Science at Stanford University. His research focuses on the analysis and modeling of large real-world social and information networks as the study of phenomena across the social, technological, and natural worlds. Problems he investigates are motivated by large scale data, the Web and Social Media. Jure received his PhD in Machine Learning from Carnegie Mellon University in 2008 and spent a year at Cornell University. His work received six best paper awards, won the ACM KDD cup and topped the Battle of the Sensor Networks competition.

Modeling with Hadoop
(Vijay Narayanan, Y!; Milind Bhandarkar, LinkedIn)

Apache Hadoop has become the platform of choice for developing large-scale data-intensive applications. In this tutorial, we will discuss the design philosophy and architecture of Hadoop, describe how to design and develop Hadoop applications and higher-level application frameworks to crunch several terabytes of data, and describe some uses of Hadoop for practical data mining and modeling applications. We will describe how to run some common data mining algorithms on Hadoop, provide examples of large scale model training and scoring systems in the internet domain, and a demonstration of a simple modeling task end to end.

Dr. Vijay K Narayanan is a Principal Scientist in Yahoo Labs, where he works on computational advertising. He has been working on large scale user modeling applications on Hadoop. He received a B.Tech degree in from IIT, Chennai and a Ph.D in Astrophysics from The Ohio State University. He has authored or co-authored about 55 peer-reviewed papers in astrophysics, and a few papers on large scale query categorization and advertising campaign recommendations.

Dr. Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system, and has been contributing and working with Hadoop since version 0.1.0. He started the Yahoo! Grid solutions team focused on training, consulting, and supporting hundreds of new migrants to Hadoop. Parallel programming languages and paradigms has been his area of focus for over 20 years. He worked at the Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), and Yahoo!. Currently, he works on distributed data systems at LinkedIn Corp.

Sunday Afternoon

Finding Bias, and Making Do With Data That Have It
(Diane Lambert, Google)

Much of statistics and machine learning relies on random sampling and designed experiments, but sometimes the only data that we can have were obtained by other means. The data may be just whatever is available in transaction logs. Or, an advertiser may use data on everyone exposed to an ad campaign and a random sample of people who were not exposed to a campaign ad to measure ad effectiveness. Such data may give badly biased estimates. However, even perfect random sampling can give flawed estimates. The time spent in hospital for a random sample taken from patients in a hospital on a given day will overestimate duration of hospital stay, even though the patients are randomly chosen. This tutorial will describe some sources of selection bias, some ways to detect when it is so overwhelming that no valid estimate is possible, and some strategies that can sometimes be used to dilute the influence of selection bias on estimates.

Diane Lambert is a statistician who has made a career out of learning how to wrestle with, and sometimes tame, data. She is now a research scientist at Google, focused on solving Google-scale problems ranging from network monitoring to display ad effectiveness.

Scaling Up Machine Learning: Parallel and Distributed Approaches
(Ron Bekkerman, LinkedIn; Mikhail Bilenko, MSR; John Langford, Y! Research)

This tutorial gives a broad view of modern approaches for scaling up machine learning and data mining methods on parallel/distributed platforms. Demand for scaling up machine learning is task-specific: for some tasks it is driven by the enormous dataset sizes, for others by model complexity or by the requirement for real-time prediction. Selecting a task-appropriate parallelization platform and algorithm requires understanding their benefits, trade-offs and constraints. This tutorial focuses on providing an integrated overview of state-of-the-art platforms and algorithm choices. These span a range of hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters), programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ), and learning settings (e.g., semi-supervised and online learning). The tutorial is example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., recommender systems and object recognition in vision).

The tutorial is based on (but not limited to) the material from our upcoming Cambridge U. Press edited book which is currently in production.

Visit the tutorial website at http://hunch.net/~large_scale_survey/

Ron Bekkerman is a senior research scientist at LinkedIn where he develops machine learning and data mining algorithms to enhance LinkedIn products. Prior to LinkedIn, he was a researcher at HP Labs. Ron completed his PhD in Computer Science at the University of Massachusetts Amherst in 2007. He holds BSc and MSc degrees from the Technion---Israel Institute of Technology. Ron has published on various aspects of clustering, including multimodal clustering, semi-supervised clustering, interactive clustering, consensus clustering, one-class clustering, and clustering parallelization.
Misha Bilenko is a researcher in Machine Learning and Intelligence group at Microsoft Research, which he joined in 2006 after receiving his PhD from the University of Texas at Austin.�� His current research interests include� large-scale machine learning methods, adaptive similarity functions and personalized advertising.
John Langford is a senior researcher at Yahoo! Research. He studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor's degree in 1997, and received his PhD from Carnegie Mellon University in 2002. Previously, he was affiliated with the Toyota Technological Institute and IBM's Watson Research Center. He is the author of the popular Machine Learning weblog, hunch.net.� John's research focuses on the fundamentals of learning, including sample complexity, learning reductions, active learning, learning with exploration, and the limits of efficient optimization.

Probabilistic Topic Models
(David Blei, Princeton)

Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. This analysis can be used for corpus exploration, document search, and a variety of prediction problems.

In this tutorial, I will review the state-of-the-art in probabilistic topic models. I will describe the three components of topic modeling:

(1) Topic modeling assumptions
(2) Algorithms for computing with topic models
(3) Applications of topic models

In (1), I will describe latent Dirichlet allocation (LDA), which is one of the simplest topic models, and then describe a variety of ways that we can build on it. These include dynamic topic models, correlated topic models, supervised topic models, author-topic models, bursty topic models, Bayesian nonparametric topic models, and others. I will also discuss some of the fundamental statistical ideas that are used in building topic models, such as distributions on the simplex, hierarchical Bayesian modeling, and models of mixed-membership.

In (2), I will review how we compute with topic models. I will describe approximate posterior inference for directed graphical models using both sampling and variational inference, and I will discuss the practical issues and pitfalls in developing these algorithms for topic models. Finally, I will describe some of our most recent work on building algorithms that can scale to millions of documents and documents arriving in a stream.

In (3), I will discuss applications of topic models. These include applications to images, music, social networks, and other data in which we hope to uncover hidden patterns. I will describe some of our recent work on adapting topic modeling algorithms to collaborative filtering, legislative modeling, and bibliometrics without citations.

Finally, I will discuss some future directions and open research problems in topic models.

David Blei is an assistant professor of Computer Science at Princeton University. He received his PhD in 2004 at U.C. Berkeley and was a postdoctoral fellow at Carnegie Mellon University. His research focuses on probabilistic models, Bayesian nonparametric methods, and approximate posterior inference. He works on a variety of applications, including text, images, music, social networks, and scientific data.