KDD 2010: Program

KDD-2010 Program Schedule

Click here to access the official KDD-2010 Social Networking Interactive Schedule. This will allow you to click to add talks to your conference schedule calendar using Outlook or other popular calendar programs, leave notes, ask questions to presenters, and engage in other ways! If you would like a quick look at the schedule, you can use the summary below.

Saturday, July 24
9:00AM - 5:00PM	Workshop 1: Mining and Learning with Graphs Workshop 2010 (MLG-2010)	Jefferson
7:30PM - 9:00PM	Registration	Foyer, Independence Foyer
Sunday, July 25
7:30AM - 8:00PM	Registration	Foyer, Independence Foyer
8:00AM-6:00PM	Exhibits	Independence Center B
9:00AM - 12:00PM	Workshop 1: Mining and Learning with Graphs Workshop 2010 (MLG-2010)	Potomac 1
	Workshop 2: Large-scale Data Mining: Theory and Applications (LDMTA-2010)	Potomac 2
	Workshop 3: Useful Patterns (UP)	Potomac 3
	Workshop 4: Social Media Analytics (SOMA 2010)	Potomac 4
	Tutorial 1: Large-scale Data Mining: MapReduce and Beyond	Regency E
	Tutorial 2: New Developments in the Theory of Clustering	Regency F
	Tutorial 3: Temporal Pattern Mining	Potomac 5
	Tutorial 4: Learning through Exploration	Potomac 6
	Tutorial 5: Geometric Tools for Graph Mining of Large Social and Information Networks	Tidewater 2
	Tutorial 8: Mining Web Search and Browse Logs	Regency F
	Workshop 5: KDD Cup 2010: Improving Cognitive Models with Educational Data Mining	Roosevelt
	Workshop 6: 9th International Workshop on Data Mining in Bioinformatics (BIOKDD10)	Lincoln
	Workshop 7: Tenth International Workshop on Multimedia Data Mining (MDMKDD 2010)	Arlington
	Workshop 8: The Fourth International Workshop on Data Mining and Audience Intelligence for Online Advertising (ADKDD'10)	Prince William
	Workshop 9: Human Computation Workshop (HCOMP 2010)	Fairfax
10:00AM-10:30AM	Coffee Break	Foyer, Ballroom 1, AV Wall
12:00PM-2:00PM	Lunch	Independence Center
2:00PM - 5:30PM	Tutorial 7: Introduction to Graphical Models for Data Mining	Regency E
	Tutorial 6: Privacy-aware Data Mining in Information Networks	Kennedy
	Tutorial 9: Mining Heterogeneous Information Networks	Potomac 5
	Tutorial 10: Outlier Detection Techniques	Potomac 6
	Tutorial 11: Recommender Problems for Web Applications	Tidewater 2
	Tutorial 12: Indexing and Mining Time Sequences	Kennedy
	Workshop 10: Intelligence and Security Informatics (ISI-KDD)	Roosevelt
	Workshop 11: The 4th International Workshop on Knowledge Discovery from Sensor Data (SensorKDD)	Lincoln
	Workshop 12: The 4th SNA-KDD Workshop on Social Network Mining and Analysis (SNAKDD 2010)	Arlington
	Workshop 13: Novel Data Stream Pattern Mining Techniques (StreamKDD)	Prince William
	Workshop 14: Discovering, Summarizing and Using Multiple Clusterings (MultiClust)	Fairfax
	Workshop 1: Mining and Learning with Graphs Workshop 2010 (MLG-2010)	Potomac 1
	Workshop 2: Large-scale Data Mining: Theory and Applications (LDMTA-2010)	Potomac 2
	Workshop 3: Useful Patterns (UP)	Potomac 3
	Workshop 4: Social Media Analytics (SOMA 2010)	Potomac 4
3:00PM-3:30PM	Coffee Break	AV Wall
6:00PM-6:15PM	Opening Remarks	Ballroom, Regency EF CTR
6:15PM-6:45PM	Award Presentations	Ballroom, Regency EF CTR
6:45PM-7:45PM	Innovation Award Talk (Christos Faloutsos)	Ballroom, Regency EF CTR
Monday, July 26
7:30AM-8:00PM	Registration	Independence Foyer
8:00AM-6:00PM	Exhibits	Independence Center B
7:30AM-9:00AM	Continental Breakfast	AV Wall
9:00AM-10:00AM	Plenary Invited Talk: Data Mining in the Online Services Industry	Regency EF CTR
10:00AM-10:30AM	Coffee Break	AV Wall
10:30AM-10:50AM	Mining Medical Data to Improve Patient Outcomes (DMCS 2005 and 2009 Winner)	Roosevelt
	Grafting-Light: Fast, Incremental Feature Selection and Structure Learning of Markov Random Fields	Independence Center A
	Mining Advisor-Advisee Relationships from Research Publication Networks	Regency E
	UP-Growth: An Efficient Algorithm for High Utility Itemset Mining	Regency F
	Versatile Publishing For Privacy Preservation	Potomac 3+4
10:50AM-11:10AM	Mining Medical Data to Improve Patient Outcomes (DMCS 2005 and 2009 Winner)	Roosevelt
	A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques	Independence Center A
	Estimating Rates of Rare Events with Multiple Hierarchies through Scalable Log-linear Models	Regency E
	Frequent Regular Itemset Mining	Regency F
	Privacy-Preserving Outsourcing Support Vector Machines with Random Transformation	Potomac 3+4
11:10AM-11:30AM	Interactive Data Mining and its Business Applications (Accenture Technology Labs)	Roosevelt
	An Efficient Algorithm for a Class of Fused Lasso Problems	Independence Center A
	Mining Uncertain Data with Probabilistic Guarantees	Regency F
	On the Quality of Inferring Interests From Social Neighbors	Potomac 3+4
	User Browsing Models: Relevance versus Examination	Regency E
11:30AM-11:50AM	Modeling with networked data	Roosevelt
	DUST: A Generalized Notion of Similarity between Uncertain Time Series	Potomac 3+4
	Mining Top-K Frequent Items in a Data Stream with Flexible Sliding Windows	Regency F
	Suggesting Friends Using the Implicit Social Graph	Regency E
	Unsupervised Feature Selection for Multi-Cluster Data	Independence Center A
11:40AM-11:50AM	Cold Start Link Prediction	Potomac 3+4
11:50AM-12:00PM	Modeling with networked data	Roosevelt
	Feature Selection for Support Vector Regression Using Probabilistic Prediction	Independence Center A
	New Perspectives and Methods in Link Prediction	Regency E
	Probably the Best Itemsets	Regency F
12:40PM-1:40PM	Conference Lunch	Independence Center
1:00PM-1:20PM	KDD Cup Awards Presentation (at the Conference Lunch)	Independence Center
1:20PM-1:50PM	Dissertation Awards Lectures (at the Conference lunch)	Independence Center
2:00PM-2:20PM	Discovering Precursors to Aviation Safety Incidents: from Massive Data to Actionable Information	Roosevelt
	Discovering Significant Relaxed Order-Preserving Submatrices	Regency F
	Fast Nearest-neighbor Search in Disk-resident Graphs	Potomac 3+4
	k-Support Anonymity Based on Pseudo Taxonomy for Outsourcing of Frequent Itemset Mining	Independence Center A
	Learning with Cost Intervals	Regency E
2:20PM-2:40PM	What's in your (customer's) Wallet? (DMCS 2005 Prize winner, Edelman Prize winner)	Roosevelt
	Balanced Allocation with Succinct Representation	Potomac 3+4
	Collusion-Resistant Privacy-Preserving Data Mining	Independence Center A
	The New Iris Data : Modular Data Generators	Regency E
	Topic Dynamics: An Alternative Model of Bursts" in Streams of Topics"	Regency F
2:40PM-3:00PM	Text Mining to Fast-Track Deserving Disability Applicants	Roosevelt
	Data Mining with Differential Privacy	Independence Center A
	Extracting Temporal Signatures for Comprehending Systems Biology Models	Regency F
	Neighbor Query Friendly Compression of Social Networks	Potomac 3+4
	Why Label when you can Search? Alternatives to Active Learning for Applying Human Resources to Build Classification Models Under Extreme Class Imbalance	Regency E
3:00PM-3:20PM	(Privacy friendly!) Social Network Targeting for Online Advertising	Roosevelt
	Discovering Frequent Patterns in Sensitive Data	Independence Center A
	Negative Correlations in Collaboration:Concepts and Algorithms	Regency F
3:10PM-3:20PM	Parallel SimRank Computation on Large Graphs with Iterative Aggregation	Potomac 3+4
3:10PM-3:20PM	Dynamics of Conversations	Potomac 3+4
3:35PM-4:00PM	Coffee Break	AV Wall
4:00PM-4:20PM	Discriminative Topic Modeling based on Manifold Learning	Potomac 3+4
	Evaluating Online Ad Campaigns in a Pipeline: Causal Models At Scale	Independence Center A
	Fast Euclidean Minimum Spanning Tree: Algorithm, Analysis, and Applications	Regency F
	Flexible Constrained Spectral Clustering	Regency E
4:20PM-4:40PM	A Hierarchical Information Theoretic Technique for the Discovery of Non Linear Alternative Clusterings	Regency E
	Mining Program Workflow from Interleaved Traces	Regency F
	Online Multiscale Dynamic Topic Models	Potomac 3+4
	Overlapping Experiment Infrastructure: More, Better, Faster Experimentation (KDD-2010 Best Application Honorable Mention)	Independence Center A
4:30PM-4:40PM	Exploitation and Exploration in a Performance based Contextual Advertising System	Independence Center A
4:40PM-5:00PM	Clustering by Synchronization	Regency E
	Connecting the Dots Between News Articles (KDD-2010 Best Research Paper Innovative Contribution)	Regency F
	MineFleet®*: An Overview of a Widely Adopted Distributed Vehicle Performance Data Mining System	Independence Center A
	Topic Models with Power-Law Using Pitman-Yor Process	Potomac 3+4
5:00PM-5:10PM	Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics	Regency F
5:00PM-5:10PM	Multiple Kernel Learning for Heterogeneous Anomaly Detection: Algorithm and Aviation Safety Case Study	Independence Center A
5:00PM-5:20PM	The Topic-Perspective Model for Social Tagging Systems	Potomac 3+4
5:00PM-5:20PM	Unifying Dependent Clustering and Disparate Clustering for Non-homogeneous Data	Regency E
5:10PM-5:20PM	Boosting with Structure Information in the Functional Space: an Application to Graph Classification	Regency F
5:30PM-7:30PM	Poster Reception I	Independence Center B
8:00PM-9:30PM	Dinner	Independence Center B
Tuesday, July 27
7:30AM-8:00PM	Registration	Independence Foyer
8:00AM-6:00PM	Exhibits	Independence Center B
7:30AM-9:00AM	Continental Breakfast	AV Wall)
9:00AM-10:00AM	Plenary Talk: Computational Social Science	Regency EF CTR
10:00AM-10:30AM	Coffee Break	AV Wall
10:30AM-10:50AM	Combining Predictions for Accurate Recommender Systems	Regency E
	Discovery of Significant Emerging Trends	Potomac 3+4
	Learning to Combine Discriminative Classifiers	Regency F
	Semi-supervised Feature Selection for Graph Classification	Independence Center A
10:50AM-11:10AM	Data Mining to Predict and Prevent Errors in Health Insurance Claims Processing	Potomac 3+4
	Fast Online Learning through Offline Initialization for Time-sensitive Recommendation	Regency E
	Mining Positive and Negative Patterns for Relevance Feature Discovery	Regency F
	Modeling Relational Events via Latent Classes	Independence Center A
11:10AM-11:30AM	Document Clustering via Dirichlet Process Mixture Model with Feature Selection	Regency F
	On Community Outliers and their Efficient Detection in Information Networks	Independence Center A
	Optimizing Debt Collections Using Constrained Reinforcement Learning (KDD-2010 Best Application Paper)	Potomac 3+4
	Training and Testing of Recommender Systems on Data Missing Not at Random	Regency E
11:30AM-11:40AM	Detecting Abnormal Coupled Sequences and Sequence Changes in Group-based Manipulative Trading Behaviors	Potomac 3+4
	Semantic Relation Extraction With Kernels Over Typed Dependency Trees	Regency F
	Temporal Recommendation on Graphs via Long- and Short-term Preference Fusion	Regency E
11:30AM-11:50AM	Redefining Class Definitions using Constraint-Based Clustering : An Application to Remote Sensing of the Earth's Surface	Independence Center A
11:40AM-11:50AM	Generative Models for Ticket Resolution in Expert Networks	Regency E
11:40AM-11:50AM	Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach	Regency F
12:30PM-1:30PM	SIGKDD Business Lunch	Independence Center
1:00PM-2:00PM	Invited Talk	Independence Center
2:35PM-2:55PM	Fast Query Execution for Retrieval Models Based on Path-Constrained Random Walks	Regency F
	Large Linear Classification When Data Cannot Fit In Memory (KDD-2010 Best Research Paper - Technical Contribution)	Regency E
	PET: A Statistical Model for Popular Events Tracking in Social Communities	Independence Center A
2:45PM-4:05PM	The Next Generation of Transportation Systems,Greenhouse Emissions, and Data Mining	Potomac 3+4
2:55PM-3:15PM	Class-Specific Error Bounds for Ensemble Classifiers	Regency E
	The community-search problem and how to plan a successful cocktail party	Independence Center A
	Trust Network Inference for Online Rating Data Using Generative Models	Regency F
	The Next Generation of Transportation Systems,Greenhouse Emissions, and Data Mining (continued 2:45pm - 4:05pm)	Potomac 3+4
3:15PM-3:35PM	An Energy-Efficient Mobile Recommender System	Regency F
	Designing Efficient Cascaded Classifiers: Tradeoff between Accuracy and Cost	Regency E
	Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata	Independence Center A
	The Next Generation of Transportation Systems,Greenhouse Emissions, and Data Mining (continued 2:45pm - 4:05pm)	Potomac 3+4
3:35PM-3:45PM	A Probabilistic Model for Personalized Tag Prediction	Independence Center A
	Direct Mining of Discriminative Patterns for Classifying Uncertain Data	Regency E
	Mixture Models for Learning Low-dimensional Roles in High-dimensional Data	Regency F
	The Next Generation of Transportation Systems,Greenhouse Emissions, and Data Mining (continued 2:45pm - 4:05pm)	Potomac 3+4
3:45PM-3:55PM	BioSnowball: Automated Population of Wikis	Independence Center A
	Ensemble Pruning via Individual Contribution Ordering	Regency E
	Towards Mobility-based Clustering	Regency F
	The Next Generation of Transportation Systems,Greenhouse Emissions, and Data Mining (continued 2:45pm - 4:05pm)	Potomac 3+4
4:05PM-4:25PM	Coffee Break	AV Wall
4:25PM-4:45PM	Automatic Malware Categorization Using Cluster Ensemble	Independence Center A
	Combined Regression and Ranking	Regency E
	Inferring Networks of Diffusion and Influence	Regency F
4:45PM-4:55PM	Beyond Heuristics: Learning to Classify Vulnerabilities and Predict Exploits	Independence Center A
4:45PM-5:05PM	Mass Estimation and Its Applications	Regency E
4:45PM-5:05PM	Scalable Influence Maximization for Prevalent Viral Marketing in Large-Scale Social Networks	Regency F
4:55PM-5:05PM	Diagnosing Memory Leaks using Graph Mining on Heap Dumps	Independence Center A
5:05PM-5:15PM	Community-based Greedy Algorithm for Mining Top-K Influential Nodes in Mobile Social Networks	Regency F
5:05PM-5:25PM	Multi-Label Learning by Exploiting Label Dependency	Regency E
5:05PM-5:25PM	Using Data Mining Techniques to Address Critical Information Exchange Needs in Disaster Affected Public-Private Networks	Independence Center A
5:15PM-5:25PM	DivRank: the Interplay of Prestige and Diversity in Information Networks	Regency E
5:15PM-5:25PM	Social Action Tracking via Noise Tolerant Time-varying Factor Graphs	Regency F
5:25PM-5:35PM	Finding Effectors in Social Networks	Regency F
5:25PM-5:35PM	Tropical Cyclone Event Sequence Similarity Search via Dimensionality Reduction and Metric Learning	Independence Center A
5:45PM-6:30PM	SIGKDD Transfer Meeting (SIGKDD 2010 / 2011 organizers only)	Regency EF CTR
5:45PM-8:00PM	Poster Reception II & Demo Session	Independence Center B
5:45PM-8:00PM	Small Appetizer Buffet	Independence Center A
Wednesday, July 28
7:30AM-9:00PM	Registration	Independence Foyer
8:00AM-12:30PM	Exhibits	Independence Center B
7:30AM-9:00AM	Continental Breakfast	AV Wall
9:00AM-10:00AM	Plenary Invited Talk: The quantification of advertising and lessons from building a business based on large scale data mining	Regency EF CTR
10:00AM-10:30AM	Coffee Break	AV Wall
10:30AM-10:50AM	An Efficient Causal Discovery Algorithm for Linear Models	Regency F
	GLS-SOD: A Generalized Local Statistical Approach for Spatial Outlier Detection	Regency E
	MalStone: Towards a Benchmark for Analytics on Large Data Clouds	Independence Center A
	Unsupervised Transfer Classification: Application to Text Categorization	Potomac 3+4
10:50AM-11:10AM	Compressed Fisher Linear Discriminant Analysis: Classification of Randomly Projected Data	Regency F
	Evolutionary Hierarchical Dirichlet Processes for Multiple Correlated Time-varying Corpora	Regency E
	Nonnegative Shared Subspace Learning and Its Application to Social Media Retrieval	Potomac 3+4
	TIARA: A Visual Exploratory Text Analytic System	Independence Center A
11:10AM-11:30AM	Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks	Potomac 3+4
	Metric Forensics: A Multi-Level Approach for Mining Volatile Graphs	Independence Center A
	Online Discovery and Maintenance of Time Series Motifs	Regency E
	Scalable Similarity Search with Optimized Kernel Hashing	Regency F
11:30AM-11:40AM	Active Learning for Biomedical Citation Screening	Independence Center A
11:30AM-11:50AM	Mining Periodic Behaviors for Moving Objects	Regency E
	Multi-Task Learning for Boosting with Application to Web Search Ranking	Potomac 3+4
	Semi-Supervised Sparse Metric Learning Using Alternating Linearization Optimization	Regency F
11:40AM-11:50AM	An integrated machine learning approach to stroke prediction	Independence Center A
11:50AM-12:00PM	Medical Coding Classification by Leveraging Inter-Code Relationships	Independence Center A
	Transfer Metric Learning by Learning Task Relationships	Potomac 3+4
	Universal Multi-Dimensional Scaling	Regency F
12:10PM-12:30PM	Closing Remarks	Regency EF CTR

Plenary Invited Talks

Online Services Division Strategy Overview

Qi Lu, President of Online Services Division, Microsoft

Abstract The online services industry is a rapidly growing industry with a worldwide online ad market projected to grow from $48 billion in 2011 to $67 billion in 2013, of which 47% will come from display advertising and 53% from search advertising. Online Services Division (OSD) within Microsoft is a leader in the consumer cloud space today with a strong portfolio of a set of 3 mutually reinforcing businesses: Search, Portal, Advertising. They are supported by a shared foundational asset of Intent & Knowledge Stores and a shared technology platform supporting large scale data and high performance systems. MSN (Portal) and Bing (Search) generate the content, traffic and data, that make for an exciting fertile environment for large scale data mining practice and system development. Our advertisers are thus given more valuable targeting opportunities and better ROI, which in turn, provide better economics, usability data, and allows for a higher quality services for our advertisers and experience for our users. The ability to transform data into meaningful, actionable insight is an important source of competitive advantage for OSD. The data mining initiatives within the division continue to strive for excellence around the following goals: actionable insights through deep data analysis, data mining and data modeling at scale and with speed, increased productivity from deployed large scale data systems and tools, improved product and service development and decision making gained from effective measurement and experimentation, and a mature data culture in product teams that made the above possible. With many technical and data challenges ahead of us, we are committed to utilizing our huge data asset well to understand the need, intent, and behavior of our users for the purpose of serving them better.

Bio As president of Microsoft's Online Services Division (OSD), Dr. Qi Lu leads the company's search and online advertising efforts. Dr. Lu oversees the OSD Research & Development team which has responsibility for the evolution of Microsoft's search, portal and advertising services; the Online Audience Business Group; and the Advertiser and Publisher Solutions Business Group. Dr. Lu reports to Microsoft chief executive officer Steve Ballmer. Prior to joining Microsoft, Dr. Lu spent 10 years as a Yahoo! senior executive. His roles included serving as the executive vice president of engineering for the company's Search and Advertising Technology Group where he oversaw the development of Yahoo!'s Web search and monetization platforms and vice president of engineering responsible for the technology development of Yahoo!'s search, e-commerce and local listings of businesses and products. Before joining Yahoo!, Dr. Lu worked as a research staff member at IBM's Almaden Research Center and Carnegie Mellon University and was a faculty member at Fudan University in China. He received his bachelor of science and master of science in computer science from Fudan University and his Ph.D. in computer science from Carnegie Mellon University. Dr. Lu holds 20 U.S. patents.

Computational Social Science

David Jensen, Department of Computer Science, University of Massachusetts Amherst

Abstract Research and applications in knowledge discovery and data mining increasingly address some of the most fundamental questions of social science: What determines the structure and behavior of social networks? What influences consumer and voter preferences? How does participation in social systems affect behaviors such as fraud, technology adoption, or resource allocation? Often for the first time, these questions are being examined by analyzing massive data sets that record the behavior and interactions of individuals in physical and virtual worlds.

A new kind of scientific endeavor - computational social science - is emerging at the intersection of social science and computer science. The field draws from a rich base of existing theory from psychology, sociology, economics, and other social sciences, as well as from the formal languages and algorithms of computer science. The result is an unprecedented opportunity to revolutionize the social sciences, expand the reach and impact of computer science, and enable decision-makers to understand the complex systems and social interactions that we must manage in order to address fundamental challenges of economic welfare, energy production, sustainability, health care, education, and crime.

Computational social science suggests an impressive array of new tasks and technical challenges to researchers and practitioners of KDD. These include modeling complex systems with temporal, spatial, and relational dependence; identifying cause and effect rather than mere association; modeling systems with feedback; and conducting analyses in ways that protect the privacy of individuals. Many of these challenges interact in fundamental ways that are both surprising and encouraging. Together, they point to an exciting new future for knowledge discovery and data mining.

Bio David Jensen is Associate Professor of Computer Science and Director of the Knowledge Discovery Laboratory at the University of Massachusetts Amherst. His current research focuses on causal discovery in relational data, computational social network analysis, fraud detection, and privacy. He serves on the Executive Committee of the ACM Special Interest Group on Knowledge Discovery and Data Mining and on the program committees of the International Conference on Machine Learning and the International Conference on Knowledge Discovery and Data Mining. He is an associate editor of the ACM Transactions on Knowledge Discovery from Data. He serves on DARPA's Information Science and Technology (ISAT) Group. He recently served on a National Research Council panel assessing the research program of the National Institutes of Justice. From 1991 to 1995, he served as an analyst with the Office of Technology Assessment, an agency of the United States Congress. He received his doctorate from Washington University in St. Louis in 1992.

The quantification of advertising and lessons from building a business based on large scale data mining

Konrad Feldman, CEO of Quantcast

Abstract As electronic communication, media and commerce increasingly permeate every aspect of modern life, real-time personalization of consumer experience through data-mining becomes practical. Effective classification, prediction and change modeling of consumer interests, behaviors and purchasing habits using machine learning and statistical methods drives efficiency, insights and consumer relevance that were never before possible. The internet has brought on a rapid evolution in advertising. Everything about behavior on the internet can be quantified and responses to behavior can occur in real time. This dynamic interaction with the user has created opportunities to better understand the way in which individuals move from awareness of a product to considering a purchase, through to intent and ultimately a sale for the marketer. When a marketer can answer the question „did those TV ads cause consumers to switch shampoo brands?‟ they can model behavior change and adjust marketing strategies accordingly. Underpinning this shift in how the world‟s trillion dollar marketing budget is spent is transactional data on an unprecedented scale, creating new challenges for software that must interpret this stream and make real time decisions tens, even hundreds of thousands of times every second. I will explore advances in modeling media consumption, advertising response and the real-time evaluation of media opportunities through reference to Quantcast, a business launched in September 2006 which today interprets in excess of 10 billion new digital media consumption records every day. We will examine the challenges of applying machine learning to non-search advertising and in doing so explore the creation of business environments – organization, infrastructure, tools, processes (and costs considerations) – in which scientists can quickly develop new petabyte scale algorithmic approaches, migrate them rapidly to real-time production and deliver fully customized experiences for marketers, publishers and consumers alike.

Bio Konrad Feldman, CEO, co-founded and launched Quantcast in 2006 along with Paul Sutter to transform the effectiveness of online advertising through the use of science and scalable computing. Prior to co-founding Quantcast, Feldman co-founded Searchspace (now Fortent) the leading provider of terrorist financing detection and anti-money laundering software for the world's financial services industry. As CEO of Searchspace's North American business, he established the business in the US and directed its rapid growth to become a market leader. Prior to Searchspace, Feldman was a Research Fellow in the Intelligent Systems Laboratory at University College London. Feldman holds a Bachelor of Science in Computer Science from University College, London.

Industrial Data Mining Case Studies - Invited Talks

Discovering Precursors to Aviation Safety Incidents: from Massive Data to Actionable Information

Ashok Srivastava, Intelligent Data Understanding group, NASA Ames Research Center

Abstract Modern aircraft are producing data at an unprecedented rate with hundreds of parameters being recorded on a second by second basis. The data can be used for studying the condition of the hardware systems of the aircraft and also for studying the complex interactions between the pilot and the aircraft. NASA is developing novel data mining algorithms to detect precursors to aviation safety incidents from these data sources. This talk will cover the theoretical aspects of the algorithms and practical aspects of implementing these techniques to study one of the most complex dynamical systems in the world: the national airspace.

Bio Ashok N. Srivastava, Ph.D. is the Principal Investigator for the Integrated Vehicle Health Management research project at NASA. His current research focuses on the development of data mining algorithms for anomaly detection in massive data streams, kernel methods in machine learning, and text mining algorithms.

Dr. Srivastava is also the leader of the Intelligent Data Understanding group at NASA Ames Research Center. The group performs research and development of advanced machine learning and data mining algorithms in support of NASA missions. He performs data mining research in a number of areas in aviation safety and application domains such as earth sciences to study global climate processes and astrophysics to help characterize the large-scale structure of the universe.

Dr. Srivastava is the author of many research articles in data mining, machine learning, and text mining, and has edited a book on Text Mining: Classification, Clustering, and Applications(with Mehran Sahami, 2009). He is currently editing two more books: Advances in Machine Learning and Data Mining for Astronomy (with Kamal Ali, Michael Way, and Jeff Scargle) andData Mining in Systems Health Management (with Jiawei Han).

He has won numerous awards including the IEEE Computer Society Technical Achievement Award for "pioneering work in Intelligent Information Systems," the NASA Exceptional Achievement Medal for contributions to state-of-the-art data mining and analysis, the NASA Distinguished Performance Award, several NASA Group Achievement Awards, the IBM Golden Circle Award, and the Department of Education Merit Fellowship.

Modeling with networked data

Francoise Fogelman-Soulie, VP Strategic Business Development, KXEN

Abstract Social Network Analysis has been one of the hottest topics among data mining scientists in the last 5 years. Meanwhile, more recently, companies, especially in Telco, have progressively started using these techniques to improve their predictive models. Through a few case studies, I will present the questions that SNA can address, the methodology we have used and the results which the companies obtained. I will then present other applications (in retail and social network sites), currently being deployed, with the scientific issues they raise.

Bio Francoise Soulie Fogelman is responsible for leading KXEN business development, identifying new business opportunities for KXEN and working with Product development, Sales and Marketing to help promote KXEN's offer. She is also in charge of managing KXEN's University Program. Ms Soulie Fogelman has over 30 years of experience in data mining and CRM both from an academic and a business perspective. Prior to KXEN, she directed the first French research team on Neural Networks at Paris 11 University where she was a CS Professor. She then co-founded Mimetics, a start-up that processes and sells development environment, optical character recognition (OCR) products and services using neural network technology, and became its Chief Scientific Officer. After that she started the Data Mining and CRM group at Atos Origin and, most recently, she created and managed the CRM Agency for Business & Decision, a french IS company specialized in Business Intelligence and CRM. Ms Soulie Fogelman holds a master’s degree in mathematics from Ecole Normale Superieure and a PhD in Computer Science from University of Grenoble. She was advisor to over 20 PhD on data mining, has authored more than 100 scientific papers and books and has been an invited speaker to many academic and business events.

Interactive Data Mining and its Business Applications

Rayid Ghani, Researcher, Accenture Technology Labs

Abstract A lot of practical data mining applications deal with settings where the goal is to help human experts find rare cases that are of interest to them. Fraud Detection, Intrusion Detection, Surveillance for security applications, Information Filtering, Recommender Systems are some examples of these applications. A common aspect among all of these problems is that they involve users (or experts) in an interactive classification setting, i.e. the experts are interacting with the results of the data mining system and in turn providing feedback that is valuable for the system. The competing goals of the data mining system are to make these experts more efficient and effective in performing their task as well as getting feedback that would allow it to improve itself over time. In this talk, I will describe this interactive data mining setting, give examples of case studies where this setting applies, and how data mining techniques help manage this tradeoff to build practical interactive systems that are not only useful but also improve over time.

Text Mining to Fast-Track Deserving Disability Applicants

John F. Elder IV, Chief Scientist, Elder Research, Inc.

Abstract If your health and finances are sufficiently poor, the Social Security Administration will send you taxpayer dollars to help out. But, applying and qualifying can be a long and frustrating process - sometimes taking up to two years! In the meantime, your health and finances are undoubtedly worsening. (Likely the reason half of those appealing a rejection eventually get approved; the lack of timely help ensures their deterioration.) Yet, by mining the important text of the applications, the SSA can identify those most likely to be approved upon analyst review, and put them in a much more efficient fast track - helping all applicants. The solution involves text extraction, token collocation, Bayesian inference, and a new way to combine evidence.

Bio
Dr. John Elder heads a data mining consulting team with offices in Charlottesville Virginia, Washington DC, Mountain View California, and Manhasset New York. Founded in 1995, Elder Research, Inc. focuses on investment, commercial and security applications of advanced analytics, including text mining, forecasting, stock selection, image recognition, process optimization, cross-selling, biometrics, drug efficacy, credit scoring, market timing, and fraud detection.

John obtained a BS and MEE in Electrical Engineering from Rice University, and a PhD in Systems Engineering from the University of Virginia, where he’s an adjunct professor teaching Optimization or Data Mining. Prior to 15 years at ERI, he spent 5 years in aerospace defense consulting, 4 heading research at an investment management firm, and 2 in Rice University's Computational & Applied Mathematics department.

Dr. Elder has authored innovative data mining tools, is a frequent keynote speaker, and was co-chair of the 2009 Knowledge Discovery and Data Mining conference, in Paris. John’s courses on analysis techniques -- taught at dozens of universities, companies, and government labs -- are noted for their clarity and effectiveness. Dr. Elder was honored to serve for 5 years on a panel appointed by the President to guide technology for National Security. His book with Bob Nisbet and Gary Miner, Handbook of Statistical Analysis & Data Mining Applications, won the PROSE award for Mathematics in 2009. His book with Giovanni Seni, Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions, was published in February 2010.

Mining Medical Data to Improve Patient Outcomes

R Bharat Rao, Balaji Krishnapuram, Murat Dundar, Siemens Healthcare

Abstract The last century has seen a massive increase in the accuracy and sensitivity of diagnostic tests: from observing external symptoms, to precise laboratory panels, to complex imaging methods for non-invasive internal examinations, to, in the very near future, the use of genomic and molecular analysis at the bedside. This improved diagnostic accuracy has resulted in an exponential increase in the patient data available to the physician. Furthermore, medical knowledge is continuously growing, with physicians being flooded with an expanding array of new tests, updated clinical guidelines on how to diagnose and treat patients, and evidence-based results from clinical trials. Both these trends – the increase in patient data and medical knowledge – will only intensify, as healthcare transforms into the practice of increasingly personalized medicine.

There is a tremendous opportunity for data mining methods to assist the physician, improve patient care, control costs, and ultimately to save lives. In this talk we will provide an overview of the special challenges faced in launching new healthcare data mining products, and identify a few key take aways for entrepreneurs who want to create new businesses in this domain. We begin by analyzing the clinical need for products to mine medical images to enable radiologists to identify cancers and other medical conditions in asymptomatic patients, and thus begin treatment as early as possible. The next step is personalized therapy selection, which requires data mining methods to mine different patient data sources, including images, free text, labs, pharmacy, molecular & genomic data. We discuss how to determine the scope and market size for products such as these, and identify the key methodological issues we have tackled. We focus on the clinical, regulatory and marketing challenges that we have had to solve over the last decade, as we have gone from concepts, to deployed products that are used today in thousands of patient encounters worldwide. We conclude by highlighting results that demonstrate the impact of data mining on patient care and improved outcomes.

Bio Dr. R. Bharat Rao is the Director of Knowledge Solutions in the the Health Services Division in Siemens Healthcare. Headquarted in Malvern, PA, USA, and Knowledge Solutions focuses on developing products and services that (a) help improve patient outcomes by integrating medical knowledge with various parts of a patient record (free text, images, labs, pharmacy, genomics, etc.), and (b) support the increasing drive to personalize medicine.

Dr. Rao received a B.Tech in Electronics Engineering from the Indian Institute of Technology, Madras in 1985, and an M.S. and Ph.D. focusing on machine learning from the Dept. of Electrical Engineering, University of Illinois, Urbana-Champaign, in 1993. He joined Siemens Corporate Research in 1993, and formed the Data Mining group there in 1996. In 2002, he moved to Siemens Healthcare to help found the Computer-Aided Diagnosis & Knowledge Solutions group.

Dr. Rao's research interests include probabilistic inference, machine learning, natural language processing, classification, and graphical models, with a focus on developing decision-support systems that can help physicians improve the quality of patient care. He is particularly interested in the development of novel data mining methods to collectively mine the structured and unstructured parts of a patient record and the automatic integration of medical domain knowledge into the mining process. He has published over 100 papers in peer-reviewed scientific journals and conferences in machine learning and medicine and has filed over 50 patents. In 2005, Siemens honored him with its "Inventor of the Year" award for “outstanding contributions related to improving the technical expertise and the economic success of the company” for developing the REMIND™ (Reliable Extraction and Meaningful Inference from Nonstructured Data) Platform. The REMIND Platform supports both the integration of knowledge into medical decision-support, as well as the discovery of novel medical knowledge to support personalized medicine. He has twice received the IEEE Data Mining Practice Prize for the best deployed industrial and government data mining application in 2005 (for the REMIND Platform) and 2009 (for Computer-Aided Diagnosis applications).

(Privacy-friendly!) Social Network Targeting for On-line Advertising

Foster Provost, Professor, Leonard N. Stern School of Business, New York University

Abstract I will discuss privacy-friendly methods for finding good audiences for on-line display advertising, by extracting quasi-social networks from browser behavior on user-generated content sites. Targeting social-network neighbors resonates well with advertisers, and on-line browsing behavior data counterintuitively can allow the identification of good audiences anonymously. I will discuss methods for extracting quasi-social networks from data on visitations to social media pages. The data are completely anonymous with respect to both browser identity and content. I will introduce measures of computing which browsers are "close" to other browsers that in the past have exhibited brand affinity. Results show that audiences with high brand proximity indeed show substantially higher brand affinity themselves, as well as higher propensity to convert. Time permitting, I also will present additional findings relating to whether the the quasi-social network actually embeds a true social network, how to gather appropriate training data, and whether on-line advertising actually is effective. This work was done in collaboration with Michael Barnathan, Brian Dalessandro, Rod Hook, Alan Murray, Claudia Perlich, and Xiaohan Zhang.

Bio Foster Provost is Professor, NEC Faculty Fellow, and Paduano Fellow of Business Ethics (Emeritus) at the NYU Stern School of Business. He is Chief Scientist for Coriolis Ventures, a NYC-based early stage venture and incubation firm. In 2001 he was Program Chair of the KDD Conference, and he just retired as Editor-in-Chief of the journal Machine Learning. His main research interests these days include predictive modeling with (social) network data, and alternative methods for data acquisition for data mining. Foster has applied data mining in practice to applications including on-line advertising, fraud detection, network diagnosis, targeted marketing, counterterrorism, and others. His work has won best paper awards at KDD, IBM Faculty Awards, and a President's Award at NYNEX Science and Technology. Last year his work on social network-based marketing systems won the 2009 INFORMS Design Science Award.

What's in your (customer's) wallet?

Claudia Perlich, Chief Scientist, Media6Degrees

Abstract In 2009 IBM was recognized as a finalist of the INFORMS Edelman competition for its predictive modeling initiative to improve the productivity of its global salesforce and with an estimated business impact of ~ 100 Million dollars. The first component implements some traditional propensity modeling to identify new sales opportunities and is currently used by over 13,000 sales reps. The second 'wallet estimation' component is used strategically to allocate sales resources based on validated analytical estimates of revenue opportunity. In this case study we cover the key elements leading to the success including the data integration, data mining and predictive modeling, solution delivery, human guided model validation, integration of the business process and we conclude with an assessment of the bottom-line business impact.

Bio Prior to joining Media6Degrees, Claudia spent five years working at the Data Analytics Research group at the IBM T.J. Watson Research Center, concentrating on research in data analytics and machine learning for complex real-world domains and applications. She has been published in over 30 scientific publications and holds multiple patents in the area of machine learning. Claudia has won many data mining competitions, including the prestigious 2007 KDD CUP on movie ratings, the 2008 KDD CUP on breast-cancer detection, and the 2009 KDD CUP on churn and propensity predictions for telecommunication customers. Claudia received her Ph.D. in Information Systems from Stern School of Business, New York University in 2005 and holds a Master of Computer Science from Colorado University.

Go to top