The Industry Practice Expo track will comprise of technical invited talks and panel discussions / debates by leading experts in the world of applied data mining and knowledge discovery. The expo will feature highly influential speakers who have directly contributed to successful data mining applications in their respective fields. The talks and discussions will focus on innovative and leading-edge, large-scale industry or government applications of data mining in areas such as finance, health-care, bio-informatics, public policy, infrastructure (transportation, utilities, etc.), telecommunications, social media, and computational advertising.
The objective of the Industry Practice Expo track is to bring together leading industry and government practitioners to share their insights and experiences will inspire the KDD community and spread awareness of the variety of seminal, innovative, and proven applications of data mining and knowledge discovery in the industry and government. This track will complement the already established Industry and Government track at KDD that focuses on peer reviewed publications.
|MONDAY- Aug 12th||
|TUESDAY- Aug 13th|
IPE Panel Discussion:
Abstract: Shopping can be decomposed into three basic questions: what, where, and when to buy? In this talk, I’ll describe how we utilize advanced data-mining and text-mining techniques at Decide.com (and earlier at Farecast) to solve these problems for on-line shoppers. Our algorithms have predicted prices utilizing billions of data points, and ranked products based on millions of reviews.
Bio: Oren Etzioni received his PhD from Carnegie Mellon in 1991. He is the WRF Entrepreneurship Professor of Computer Science at the University of Washington. Oren is the author of over 200 technical papers, cited over 18,000 times. He received the NSF Young Investigator Award in 1993, and was selected as a AAAI Fellow a decade later. In 2007, he received the Robert S. Engelmore Memorial Award for “long-standing technical and entrepreneurial contributions to Artificial Intelligence”. Oren is the founder of three companies focused on increased transparency for shoppers. His first company, Netbot, was the first online comparison shopping company (acquired by Excite in 1997). His second company, Farecast, advised travelers when to buy their air tickets. Farecast was acquired by Microsoft in 2008 and became the foundation for Bing Travel. Decide.com, founded in 2010, utilizes cutting-edge data-mining methods to minimize buyer’s remorse. In 2013, Oren was chosen as the “Geek of the Year” by a vote of the Seattle Tech. Community.
IPE Session 1: Monday 10:30-12pm
Title: Mining the digital universe of data to develop personalized cancer therapies
Link to Slides
Abstract:The development of a personalized approach to medical care is now well recognized as an urgent priority. This approach is particularly important in oncology, where it is well understood that each cancer diagnosis is unique at the molecular level, arising from a particular and specific collection of genetic alterations. Furthermore, taking a personalized approach to oncology may expedite the treatment process, pre-empting therapeutic decisions based on fewer data in favor of treatments targeted to an individual’s tumor. This directed course may be key to survival for many patients who are terminal or have failed standard therapies.
I will discuss a personalized cancer therapy program we have initiated that involves DNA and RNA sequencing of a patient’s tumor and germline DNA and the projection of high-dimensional features extracted from these data onto predictive network models constructed by integrating large-scale, high dimensional data that exists for the patient’s cancer type. From the causal network inference procedures to the ensemble-based classification methods, big data analytics is front and center for interpreting large-scale patient data in the context of the digital universe of information that exists for the patient’s condition.
Bio: Dr. Eric Schadt is Chairman and Professor of the Department of Genetics and Genomic Sciences at the Icahn School of Medicine at Mount Sinai and the Director of the Institute for Genomics and Multiscale Biology at Mount Sinai. Previously, Dr. Schadt had been the Chief Scientific Officer at Pacific Biosciences, overseeing the scientific strategy for the company, including creating the vision for next-generation sequencing applications of the company’s technology. Dr. Schadt is also a founding member of Sage Bionetworks, an open access genomics initiative designed to build and support databases and an accessible platform for creating innovative, dynamic models of disease. Dr. Schadt’s current efforts at Mount Sinai involve the generation and integration of large-scale, high-dimension molecular, cellular, and clinical data to build more predictive models of disease, a research direction motivated by the genomics and systems biology research he led at Merck to elucidate common human diseases and drug response using novel computational approaches applied to genetic and molecular profiling data. Dr. Schadt received his B.S. in applied mathematics/computer science from California Polytechnic State University, his M.A. in pure mathematics from UCD, and his Ph.D. in bio-mathematics from UCLA (requiring Ph.D. candidacy in molecular biology and mathematics).
Abstract: In the last year deep learning has gone from being a special purpose machine learning technique used mainly for image and speech recognition, to becoming a general purpose machine learning tool. This has broad implications for all organizations that rely on data analysis. It represents the latest development in a general trend towards more automated algorithms, and away from domain specific knowledge. For organizations that rely on domain expertise for their competitive advantage, this trend could be extremely disruptive. For start-ups interested in entering established markets, this trend could be a major opportunity. This talk will be a non-technical introduction to general-purpose deep learning, and its potential business impact.
Bio:Jeremy is a serial entrepreneur, business strategist, developer, and educator. He is the President and Chief Scientist of Kaggle, a funded San Francisco startup that is currently growing quickly and has resulted in many scientific breakthroughs. He was ranked #1 participant in data science competitions globally in 2010 and 2011. He is also a noted educator, becoming the youngest faculty member at Singularity University, where he teaches data science. He was the founding CEO of two successful self-funded Australian startups (FastMail, and Optimal Decisions Group), both of which grew internationally and were sold to large international companies. Both companies are still being operated successfully. The World Economic Forum has awarded him the title “Young Global Leader” in recognition of his achievements.
He previously spent 8 years in management consulting at the world’s most exclusive firms, including McKinsey & Co, and AT Kearney (becoming the youngest engagement manager world-wide, and building a new global practice in what is now called “Big Data”). He is also a keen student, for example developing a new system for learning Chinese, which he used to develop strong Chinese language skills in just one year. Jeremy has mentored and advised many startups, and has also acted as an angel investor. He has contributed to a range of open source projects as a developer, and is also in demand as an expert commentator on various TV news programs.
Abstract: Statistical machine learning / knowledge discovery techniques tend to fail when faced with an adaptive adversary attempting to evade detection in the data. Humans do an excellent job of correctly spotting adaptive adversaries given a good way to digest the data. On the other hand, humans are glacially slow and error-prone when it comes to moving through very large volumes of data, a task best left to the machines.
Fighting complex fraud and cyber-security threats requires a symbiosis between the computers and teams of human analysts. The computers use algorithmic analysis, heuristics, and/or statistical characterization to ﬁnd interesting ‘simple’ patterns in the data. These candidate events are then queued for in-depth human analysis in rich, expressive, interactive analysis environments.
In this talk, we’ll take a look at case studies of three different systems, using a partnership of automation and human analysis on large scale data to ﬁnd the clandestine human behavior that these datasets hold, including a discussion of the backend systems architecture and a demo of the interactive analysis environment.
The backend systems architecture is a mix of open source technologies, like Cassandra, Lucene, and Hadoop, and some new components that bind them all together. The interactive analysis environment allows seamless pivoting between semantic, geospatial, and temporal analysis with a powerful GUI interface that’s usable by non-data scientists. The systems are real systems currently in use by commercial banks, pharmaceutical companies, and governments.
Bio: Ari Gesher is a senior engineer and Engineering Ambassador at Palantir Technologies. An alumnus of the University of Illinois computer science department, Ari has worked in the software industry for the past fifteen years, including a stint as the lead engineer for the SourceForge.net open source software archive. At Palantir Technologies, Ari has split his time between working as a backend engineer on Palantir’s analysis platform, thinking and writing about Palantir’s vision for human-driven information data systems, and moonlighting on Palantir’s Philanthropic engineering team. Ari is in demand as a speaker on the topic of big data and the limits of automated decision- making. In the past year, he’s spoken at Harvard Business School, the Institure for the Future’s Tech Horizons conference, multiple O’Reilly Strata Big Data Conferences, the Economist Future Technologies Summit, and PayPal’s TechXploration series.
IPE Session 3: Tuesday 10:30-12pm
Rayid Ghani, University of Chicago / Edgeflip
Title: Targeting and Influencing at Scale: From Presidential Elections to Social Good
Link to Slides
Abstract: If you’re still recovering from the barrage of ads, news, emails, Facebook posts, and newspaper articles that were giving you the latest poll numbers, asking you to volunteer, donate money, and vote, this talk will give you a look behind the scenes on why you were seeing what you were seeing. I will talk about how machine learning and data mining along with randomized experiments were used to target and influence tens of millions of people. Beyond the presidential elections, these methodologies for targeting and influence have the power to solve big problems in education, healthcare, energy, transportation, and related areas. I will talk about some recent work we’re doing at the University of Chicago Data Science for Social Good summer fellowship program working with non-profits and government organizations to tackle some of these challenges.
Bio: Rayid is currently at the Computation Institute & Harris School of Public Policy at the University of Chicago and the co-founder of Edgeflip, an analytics startup focused on helping non-profits and social good organizations better use social networks and analytics. Rayid was the Chief Scientist at Obama for America 2012 campaign focusing on analytics, technology, and data. His work in the campaign focused on improving different functions of the campaign including fundraising, volunteer, and voter mobilization using analytics, social media, and machine learning. In addition, Rayid serves as an adviser to several start-ups, non-profits, and corporations, is an active organizer of and participant in academic and industry analytics conferences, and publishes regularly in machine learning and data mining conferences and journals.
IPE Session 3: Tuesday 10:30-12pm
Title: Hadoop: A View from the Trenches
Link to Slides
Abstract: From it’s beginnings as a framework for building web crawlers for small-scale search engines to being one of the most promising technologies for building datacenter-scale distributed computing and storage platforms, Apache Hadoop has come far in the last seven years. In this talk I will reminisce about the early days of Hadoop, and will give an overview of the current state of the Hadoop ecosystem, and some real-world use cases of this open source platform. I will conclude with some crystal gazing in the future of Hadoop and associated technologies.
Bio: Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system, and has been contributing and working with Hadoop since version 0.1.0. He started the Yahoo! Grid solutions team focused on training, consulting, and supporting hundreds of new migrants to Hadoop. Parallel programming languages and paradigms has been his area of focus for over 20 years, and his area of specialization for PhD (Computer Science) from University of Illinois at Urbana-Champaign. Previously, he has worked at the Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo! and Linkedin. Currently, he is the Chief Scientist, Machine Learning Platforms at Pivotal.
IPE Session 4: Tuesday 3:00-4:30pm
Raffael Marty, Pixlcloud
Title: Cyber Security – How Visual Analytics Unlock Insight
Link to Slides
Abstract: In the Cyber Security domain, we have been collecting ‘big data’ for almost two decades. The volume and variety of our data is extremely large, but understanding and capturing the semantics of the data is even more of a challenge. Finding the needle in the proverbial haystack has been attempted from many different angles. In this talk we will have a look at what approaches have been explored, what has worked, and what has not. We will see that there is still a large amount of work to be done and data mining is going to play a central role. We’ll try to motivate that in order to successfully find bad guys, we will have to embrace a solution that not only leverages clever data mining, but employs the right mix between human computer interfaces, data mining, and scalable data platforms.
Traditionally, cyber security has been having its challenges with data mining. We are different. We will explore how to adopt data mining algorithms to the security domain. Some approaches like predictive analytics are extremely hard, if not impossible. How would you predict the next cyber attack? Others need to be tailored to the security domain to make them work.
Visualization and visual analytics seem to be extremely promising to solve cyber security issues. Situational awareness, large-scale data exploration, knowledge capture, and forensic investigations are four top use-cases we will discuss. Visualization alone, however, does not solve security problems. We need algorithms that support the visualizations. For example to reduce the amount of data so an analyst can deal with it, in both volume and semantics.
Bio: Raffael Marty is one of the world’s most recognized authorities on security data analytics. The author of Applied Security Visualization and creator of the open source DAVIX analytics platform, Raffy, is the founder and CEO of PixlCloud, a next-generation data visualization application for big data. With a track record at companies including IBM Research and ArcSight, Raffy is thoroughly familiar with established practices and emerging trends in data analytics. He has served as Chief Security Strategist with Splunk and was a co-founder of Loggly, a cloud-based log management solution. For more than 12 years, Raffy has helped Fortune 500 companies defend themselves against sophisticated adversaries and has trained organizations around the world in the art of data visualization for security. Practicing zen has become an important part of Raffy’s life.
Abstract: The brief history of knowledge discovery is filled with products that promised to bring “BI to the masses”. But how do you build a product that truly bridges the gap between the conceptual simplicity of “questions and answers” and the structure needed to query traditional data stores?
In this talk, Chris Neumann will discuss how DataHero applied the principles of user-centric design and development over a year and a half to create a product with which more than 95% of new users can get answers on their first attempt. He’ll demonstrate the process DataHero uses to determine the best combination of algorithms and user interface concepts needed to create intuitive solutions to potentially complex interactions, including:
- Determining the structure of files uploaded by users
- Accurately identifying data types within files
- Presenting users with an optimal visualization for any combination of data
- Helping users to ask questions of data when they don’t know what to do
Chris will also talk about what it’s like to start a “Big Data” company and how he applied lessons from his time as the first engineer at Aster Data Systems to DataHero.
Bio: Chris is the CEO and Cofounder of Datahero, a data analytics company whose goal is to enable anyone to be able to unmask the answers in the data that matters to them. Chris was previously the first engineer at Big Data pioneer Aster Data Systems, where he held roles in engineering, professional services and business development. Chris holds an MS in Computer Science from Stanford University and a BS in Computing Science from Simon Fraser University.
Title: Death of the expert? The rise of algorithms and decline of domain experts
Abstract: Machine learning algorithms used to require features to be carefully hand created and filtered. The algorithms of yesteryear needed us to tell them about interactions, non-linearities, non-normal distributions, etc etc… and if we added too many features to the model, we would over-fit and end up with something that was useless in practice. That meant that domain experts were vital in manipulating and filtering the data to create just the right set of inputs.
But now that we have deep learning nets, ensembles of decision trees, and so forth, features are created automatically, and over-fitting is avoided even with huge numbers of features. Furthermore, these general purpose algorithms have proven their worth in everything from video object tracking to speech recognition to automated drug discovery to natural language processing. So where does that leave the role of the domain expert? In this panel, we will discuss and debate where domain experts fit in to this new world of general purpose machine learning algorithms.
Moderator: Jeremy Howard, Kaggle
Oren Etzioni, University of Washington
John Akred, Silicon Valley Data Science
Robert Munro, Idibon
Chris Neumann, DataHero
- Paul Bradley (Methodcare)
- Rajesh Parekh (Groupon)
- Eric Bloedom (Mitre)
- Usama Fayyad (ChoozOn)
- Bob Grossman (University of Chicago)
- Ying Li (Concurix Corporation)
- Gregory Piatetsky-Shapiro (KDNuggets)
- Christian Posse (Google)
- Raghu Ramakrishnan (Microsoft)
- Ramasamy Uthurusamy (General Motors, Retd.)
For more information please contact the Industry Practice Expo co-chairs – Paul Bradley and Rajesh Parekh – at firstname.lastname@example.org.