Keynotes | http://www.kdd.org/kdd2013

Session 1: Monday 8:45am:
Raghu Ramakrishnan, Technical Fellow and CTO Information Services, Microsoft

Title: Scale-out Beyond Map-Reduce

Abstract:The amount of data being collected is growing at a staggering pace. The default is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated by instrumenting key customer and systems touchpoints. Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous. These differences in data scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of data to be stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools for graph analytics and machine learning. These new systems use scale-out architectures for both data storage and computation.

Hadoop has become a key building block in the new generation of scale-out systems. Early versions of analytic tools over Hadoop, such as Hive and Pig for SQL-like queries, were implemented by translation into Map-Reduce computations. This approach has inherent limitations, and the emergence of resource managers such as YARN and Mesos has opened the door for newer analytic tools to bypass the Map-Reduce layer. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit. In this talk, I will examine this architectural trend, and argue that resource managers are a first step in re-factoring the early implementations of Map-Reduce, and that more work is needed if we wish to support a variety of analytic tools on a common scale-out computational fabric. I will then present REEF, which runs on top of resource managers like YARN and provides support for task monitoring and restart, data movement and communications, and distributed state management. Finally, I will illustrate the value of using REEF to implement iterative algorithms for graph analytics and machine learning.

This is joint work with the CISL team at Microsoft.

Bio: Raghu Ramakrishnan heads the Cloud and Information Services Lab (CISL) in the Data Platforms Group at Microsoft. From 1987 to 2006, he was a professor at University of Wisconsin-Madison, where he wrote the widely-used text “Database Management Systems” and led a wide range of research projects in database systems (e.g., the CORAL deductive database, the DEVise data visualization tool, SQL extensions to handle sequence data) and data mining (scalable clustering, mining over data streams). In 1999, he founded QUIQ, a company that introduced a cloud-based question-answering service. He joined Yahoo! in 2006 as a Yahoo! Fellow, and over the next six years served as Chief Scientist for the Audience (portal), Cloud and Search divisions, driving content recommendation algorithms (CORE), cloud data stores (PNUTS), and semantic search (“Web of Things”). Ramakrishnan has received several awards, including the ACM SIGKDD Innovations Award, the SIGMOD 10-year Test-of-Time Award, the IIT Madras Distinguished Alumnus Award, and the Packard Fellowship in Science and Engineering. He is a Fellow of the ACM and IEEE.

Session 2: Monday 1:30pm:
Andrew Ng, Stanford University and Coursera

Title: The Online Revolution: Education for Everyone

Abstract: In 2011, Stanford University offered three online courses, which anyone in the world could enroll in and take for free. Together, these three courses had enrollments of around 350,000 students, making this one of the largest experiments in online education ever performed. Since the beginning of 2012, we have transitioned this effort into a new venture, Coursera, a social entrepreneurship company whose mission is to make high-quality education accessible to everyone by allowing the best universities to offer courses to everyone around the world, for free. Coursera classes provide a real course experience to students, including video content, interactive exercises with
meaningful feedback, using both auto-grading and peer-grading, and a rich peer-to-peer interaction around the course materials. Currently, Coursera has 62 university partners, and over 3 million students enrolled in its over 300 courses. These courses span a range of topics including computer science, business, medicine, science, humanities, social sciences, and more. In this talk, I’ll report on this far-reaching experiment in education, and why we believe this model can provide both an improved classroom experience for our on-campus students, via a flipped classroom model, as well as a meaningful learning experience for the millions of students around the world who would otherwise never have access to education of this quality.

Bio: Andrew Ng is a Co-founder of Coursera, and a Computer Science faculty member at Stanford. In 2011, he led the development of Stanford University’s main MOOC (Massive Open Online Courses) platform, and also taught an online Machine Learning class that was offered to over 100,000 students, leading to the founding of Coursera. Ng’s goal is to give everyone in the world access to a high quality education, for free. Today, Coursera partners with top universities to offer high quality, free online courses. With 62 university partners, over 300 courses, and more than 3 million students, Coursera is currently the largest MOOC (Massively Open Online Courses) platform in the world. Outside online education, Ng’s research work is in machine learning, with an emphasis on Deep Learning. He is also the Director of the Stanford Artificial Intelligence Lab.

Session 3: Tuesday 8:45am
Stephen J. Wright, Computer Sciences Dept., University of Wisconsin-Madison

Title: Optimization in Learning and Data Analysis Wright

Abstract: Optimization tools are vital to data analysis and learning. The optimization perspective has provided valuable insights, and optimization formulations have led to practical algorithms with good theoretical properties. In turn, the rich collection of problems in learning and data analysis is providing fresh perspectives on
optimization algorithms and is driving new fundamental research in the area. We discuss research on several areas in this domain, including signal reconstruction, manifold learning, and regression / classification, describing in each case recent research in which optimization algorithms have been developed and applied successfully. A particular focus is asynchronous parallel algorithms for optimization and linear algebra, and their applications in data analysis and learning.

Bio: Stephen J. Wright is a Professor of Computer Sciences at the
University of Wisconsin-Madison. His research is on computational
optimization and its applications to many areas of science and
engineering. Prior to joining UW-Madison in 2001, Wright was a Senior
Computer Scientist at Argonne National Laboratory (1990-2001), and a
Professor of Computer Science at the University of Chicago
(2000-2001). During 2007-2010, he served as chair of the Mathematical
Optimization Society, and is on the Board of the Society for
Industrial and Applied Mathematics (SIAM). He is a Fellow of SIAM.

Wright is the author or coauthor of widely used text / reference books
in optimization including “Primal Dual Interior-Point Methods” (SIAM,
1997) and “Numerical Optimization” (2nd Edition, Springer, 2006, with
J. Nocedal). He has published widely on optimization theory,
algorithms, software, and applications. He is coauthor of widely used
software for linear and quadratic programming and for compressed
sensing.

Wright currently serves on the editorial boards of the leading
journals in optimization (SIAM Journal on Optimization and
Mathematical Programming, Series A) as well as SIAM Review. He served
a term as editor-in-chief of Mathematical Programming, Series B from
2003-2007.

Session 4: Wednesday 8:45am
Hal Varian, Chief Economist at Google

Title: Predicting the Present with Search Engine Data Varian

Abstract: Many businesses now have almost real time data available about their operations. This data can be helpful in contemporaneous prediction (“nowcasting”) of various economic indicators. We illustrate how one can use Google search data to nowcast economic metrics of interest, and discuss some of the ramifications for research and policy. Our approach combines three Bayesian techniques: Kalman filtering, spike-and-slab regression, and model averaging. We use Kalman filtering to whiten the time series in question by removing the trend and seasonal behavior. Spike-and-slab regression is a Bayesian method for variable selection that works even in cases where the number of predictors is far larger than the number of observations. Finally, we use Markov Chain Monte Carlo methods to sample from the posterior distribution for our model; the final forecast is an average over thousands of draws from the posterior. An advantage of the Bayesian approach is that it allows us to specify informative priors that affect the number and type of predictors in a flexible way.

Bio: Hal R. Varian is the Chief Economist at Google. He started in May 2002 as a consultant and has been involved in many aspects of the company, including auction design, econometric analysis, finance, corporate strategy and public policy.

He is also an emeritus professor at the University of California, Berkeley in three departments: business, economics, and information management.

He received his SB degree from MIT in 1969 and his MA in mathematics and Ph.D. in economics from UC Berkeley in 1973. He has also taught at MIT, Stanford, Oxford, Michigan and other universities around the world.

Dr. Varian is a fellow of the Guggenheim Foundation, the Econometric Society, and the American Academy of Arts and Sciences. He was Co-Editor of the American Economic Review from 1987-1990 and holds honorary doctorates from the University of Oulu, Finland and the University of Karlsruhe, Germany.

Professor Varian has published numerous papers in economic theory, industrial organization, financial economics, econometrics and information economics. He is the author of two major economics textbooks which have been translated into 22 languages. He is the co-author of a bestselling book on business strategy, Information Rules: A Strategic Guide to the Network Economy and wrote a monthly column for the New York Times from 2000 to 2007.