Big Data Camp

Saturday, August 10th, 9am-5pm
Chicago Sheraton Hotel

This year KDD 2013 will host a Big Data Camp to give those interested a practical introduction to working building analytic models over big data using some of the open source and other tools available.

Through June 30 July 1- 31 Aug 1 to Onsite
  Early Late On-site
Big Data Camp Only* $85 $105 $125
*Space is limit to 150
SATURDAY- Aug 10th ~ Room Michigan A/B
Robert Grossman, University of Chicago and Open Data Group
Introduction to Big Data Analytics Using Open Source Tools
Jeffrey Ryan, Lemnica Corporation
Introduction to R
11am -12pm
Dean Wampler, Concurrent Thought
Introduction to NoSQL Databases
Lunch Provided
1:30 -2:30pm
Q Ethan McCallum
Integrating R+Hadoop into Your Data Analytics Pipeline
Collin Bennett, Open Data Group
Building and Deploying Predictive Models Using Python, Augustus and PMML
Coffee Break
Andrew Johnson, University of Illinois at Chicago
Introduction to Data Visualization



Collin Bennett and Jim Pivarski
Title: Building and Deploying Predictive Models Using Python, Augustus and PMML

Abstract: Augustus is an open source system for building and scoring statistical models designed to work with data sets that are too large to fit into memory. Augustus is written in Python and supports the Predictive Model Markup Language (PMML), an XML standard for specifying statistical and data mining models. Efficiently deploying models usually involves separating estimating the parameters of models (building models) from deploying the models (scoring data using the models in operational systems). Working with large datasets usually involves building and deploying multiple models. Augustus supports the PMML standards for working with multiple segmented models and for passing information between applications that build and score statistical models. In this talk, we give an introduction to using Python for building and deploying models using Augustus and PMML.

Bio: Collin Bennett is a Partner at Open Data Group ( In three and a half years with the company, Collin has worked on the open source Augustus scoring engine ( and a cloud-based environment for rapid analytic prototyping called RAP. Additionally, he has released open source projects for the Open Cloud Consortium. One of these, MalGen, has been used to benchmark several parallel computation frameworks. Previously, he led software development for the Product Development Team at Acquity Group, an IT consulting firm head-quartered in Chicago. He also worked at startups Orbitz (when it was still was one) and Business Logic Corporation. He has co-authored papers on Weyl tensors, large data clouds, and high performance wide area cloud testbeds. He holds degrees in English, Mathematics and Computer Science.

Bio: Jim Pivarski is a data scientist at Open Data Group ( In two years with the company, Jim has worked on or developed analysis workflows to study network traffic, twitter sentiment, satellite images, clusters of virtual machines, credit card fraud, and automobile traffic. He redesigned and implemented the open source Augustus scoring engine (, and added data visualization capabilities. Previously, he contributed to the commissioning of the Large Hadron Collider physics experiment by aligning the CMS experiment’s muon detectors and leading a search for hypothetical particles that decay into muons. He holds a Ph.D. in Physics from Cornell University.

Robert Grossman
Title: Introduction to Big Data Analytics Using Open Source Tools

Abstract: A successful project that builds predictive models over big data is not just about selecting the right machine learning algorithm, but also about exploring the data to understand it well enough to build appropriate features, deploying the model efficiently into operational systems, evaluating the effectiveness of the model, and continuously improving the model. In this tutorial, we give an introduction to some best practices for each of these phases in the life cycle of a predictive model and a give quick survey of open source tools that are commonly used when working with big data.

Bio: Robert Grossman is a faculty member at the University of Chicago, where he is the Chief Research Informatics Officer of the Biological Sciences Division; a Senior Fellow in the Institute for Genomics and Systems Biology and the Computation Institute; and a Professor in the Department of Medicine in the Section of Genetic Medicine. He is also Partner of Open Data Group, which he founded in 2002. Open Data Group provides analytic services to help companies and organizations build predictive models over big data. He has been involved in several open source software projects, including the development of the network protocol UDT, which is designed to support large data flows over wide area networks; Augustus, a PMML compliant predictive model application; and the Sector system, which is a framework for data intensive computing. He founded Magnify, Inc. in 1996, which provides data mining solutions to the insurance industry. Grossman was Magnify’s CEO until 2001 and its Chairman until it was sold to ChoicePoint in 2005. He blogs occasionally about big data, bioinformatics, data science, and data engineering at

Andrew Johnson
Title: Introduction to Data Visualization

Abstract:  This talk will give a brief overview of data visualization, why its an important part of the data analysis process, and provide some tips and techniques for creating useful visualizations that take advantage of human perception and cognition.

Bio: Dr. Andrew Johnson is an Associate Professor of Computer Science and member of the Electronic Visualization Laboratory at the University of Illinois at Chicago. His research and teaching focus on interaction and collaboration using advanced visualization displays and the application of those displays to enhance discovery and learning.

Q Ethan McCallum
Title: Integrating R+Hadoop into Your Data Analytics Pipeline

Abstract: As an analytics tool, R strains under modern large-scale datasets. People have devised a number of ways to help R function in the big-data arena, one of which is to drive it with Hadoop. But how does this work, and when is it an appropriate solution? This talk will describe the what and the how of mixing R and Hadoop, and more importantly, the when and why of this approach.

Bio: Q Ethan McCallum (@qethanm) works as a professional-services consultant, with a focus on strategic matters around data and technology. He is especially interested in helping shops build and shape their internal analytics practice.

Q’s published work includes Parallel R: Data Analysis in the Distributed World and Bad Data Handbok: Mapping the World of Data Problems. He is currently working on his next book, Making Analytics Work (

Jeffrey Ryan
Title: Introduction to R

Abstract: Open-source R has experienced explosive growth in the last few years along with the rise of analytics and data. Powerful and expressive syntax, an unrivaled package ecosystem, and near ubiquity within organizations and universities both big and small, R has become the language of data analysis. One common (mis)-perception is that R doesn’t handle ‘big data’. In this talk, Jeff Ryan, a noted R consultant and major R package contributor, will discuss the ins and outs of managing all types of data within R. Starting with basic data structures, the talk will cover proper manipulation techniques to minimize memory issues as well as maximize performance of pure R code. From there, contributed packages designed for large data processing will be discussed, such as mmap, xts, data.table, and bigmemory. Finally, the discussion will examine the R API itself, and give tips and techniques for making R as fast as compiled code by crafting solutions involving very simple R and C/C++ code to make R the only language you will need for daily analytics work.

Bio: Jeffrey Ryan is the founder of Lemnica Corporation, a Chicago firm specializing in statistical software, training, and on-demand support. He helps organize the R/Finance conference series [], and is a frequent speaker on software related topics. He is the author or co-author of a variety of R packages involving finance, large data, and visualizations including quantmod, xts, Defaults, IBrokers, RBerkeley, mmap, and indexing. He currently lives in Chicago, Illinois with his wife and three children.

Dean Wampler
Title: Introduction to NoSQL databases

Abstract: The emergence of Internet companies like Yahoo!, Google, Twitter, and Facebook created the need to manage data sets of unprecedented size. Furthermore, the “always on” nature of the Internet imposed new demands on availability and reliability. These forces drove the emergence of alternatives to relational (SQL-oriented) databases, collectively called “NoSQL” databases. This session will describe the history of NoSQL databases, their features, and the problems they address.
Bio: Dean Wampler works with clients on “Big Data” applications, using Hadoop, NoSQL, and other technologies. He is a Functional Programming enthusiast and a contributor to several open-source projects. Dean is the founder of the Chicago-Area Scala Enthusiasts and the author of “Functional Programming for Java Developers”, the co-author of “Programming Scala”, and the co-author of “Programming Hive”, all from O’Reilly. He pontificates on twitter, @deanwampler, and at His consulting company is Concurrent Thought (