Accepted Tutorials | http://www.kdd.org/kdd2013

Morning:

1. Algorithmic techniques for modeling and mining large graphs (AMAzING)

Abstract
Network science has emerged over the last years as an interdisciplinary area spanning traditional domains including mathematics, computer science, sociology, biology and economics. Since complexity in social, biological and economical systems, and more generally in complex systems, arises through pairwise interactions there exists a surging interest in understanding networks.

In this tutorial, we will provide an in-depth presentation of the most popular random-graph models used for modeling real-world networks. We will then discuss efficient algorithmic techniques for mining large graphs, with emphasis on the problems of extracting graph sparsifiers, partitioning graphs into densely connected components, and finding dense subgraphs. We will motivate the problems we will discuss and the algorithms we will present with real-world applications.

Our aim is to survey important results in the areas of modeling and mining large graphs, to uncover the intuition behind the key ideas, and to present future research directions.

Who Should Attend
The tutorial presents both classic and cutting-edge research topics on networks. We aim to go into depth for the following topics: random graphs, graph sparsifiers, graph partitioning, finding dense subgraphs and their applications. The tutorial will combine a blend of computer science rigor and real-world applications. It should be of theoretical and practical interest to the graph analysis community and a large part of the data mining community as well.

Prerequisites
Computer science background (B.Sc or equivalent); familiarity with undergraduate level concepts covered in probability and algorithm classes.

Instructors
Dr. Alan Frieze is a professor in the Department of Mathematical Sciences at Carnegie Mellon University, Pittsburgh, United States. He graduated from the University of Oxford in 1966, and obtained his Ph.D. from the University of London in 1975. His research interests lie in combinatorics, discrete optimization and theoretical computer science. In 1991, Dr. Frieze received the Fulkerson Prize in Discrete Mathematics awarded by the American Mathematical Society and the Mathematical Programming Society. In 1997 he was a Guggenheim Fellow In 2000, he received the IBM Faculty Partnership Award. In 2006 he jointly received (with Michael Krivelevich) the Professor Pazy Memorial Research Award from the United States-Israel Binational Science Foundation. In 2011 he was selected as a SIAM Fellow. In 2012 he was selected as an AMS fellow.

Dr. Aristides Gionis is an associate professor in the Department of Information and Computer Science, in Aalto University, Finland. Previously he has been a senior research scientist in Yahoo! Research. He received his Ph.D. from the Computer Science department of Stanford University in 2003. He is currently serving as an associate editor in the Transactions of Knowledge and Data Engineering (TKDE). He has served in the PC of numerous premium conferences, including being the PC co-chair for WSDM 2013 and ECML PKDD 2010. His research interests include data mining, web mining, and algorithmic data analysis.

Dr. Charalampos Tsourakakis is an Aalto Science Fellow. He received his Ph.D. in Algorithms, Combinatorics and Optimization at Carnegie Mellon University. He holds a Diploma in Electrical and Diploma Engineering from the National Technical University of Athens and a Master of Science from the Machine Learning Department at Carnegie Mellon University. His research interests include algorithm design, random graphs and data mining.

2. Mining Data from Mobile Devices: A Survey of Smart Sensing and Analytics

Abstract: Mobile connected devices, and smartphones in particular, are rapidly emerging as a dominant computing and sensing platform. This poses several unique opportunities for data collection and analysis, as well as new challenges. In this tutorial, we survey the state-of-the-art in terms of mining data from mobile devices across different application areas such as ads, healthcare, geo-social, public policy, etc. Our tutorial has three parts. In part one, we summarize data collection in terms of various sensing modalities. In part two, we present cross-cutting challenges such as real-time analysis, security, and we outline cross-cutting methods for mobile data mining such as network inference, streaming algorithms, etc. In the last part, we specifically overview emerging and fast-growing application areas, such as noted above. Concluding, we briefly highlight the opportunities for joint design of new data collection techniques and analysis methods, suggesting additional directions for future research. mobilemining.clusterhack.net

Speaker bio for presenter 1: Spiros Papadimitriou is mainly interested in data mining for graphs and streaming data, clustering, time series, large-scale data processing, and mobile applications. His interests span from the very small (embedded devices, and sensors; Arduino) to the very large (large-scale data processing and analysis; Hadoop). He has published more than forty papers on these topics in refereed conferences and journals. He received the best paper award in SDM 2008, has three invited journal publications in best paper issues, several book chapters and he has filed multiple patents. He has also been invited to give keynote talks on graph and social network analysis (WAAMD 2008, and ADN 2009) and tutorials on time series stream mining (University of Maine Summer School, 2008) and large-scale analytics (Carnegie Mellon University, 2012). In the past, he has also developed and released a number of Android applications (including live-view mobile OCR, and web service clients) that have 50,000 downloads. He is currently an assistant professor at Rutgers University (MSIS-RBS). Prior to that, he was a research scientist at Google, and a research staff member at IBM Research. He was a Siebel scholarship recipient in 2005. He obtained his MSc and PhD degrees from Carnegie Mellon University.

Speaker bio for presenter 2: Tina Eliassi-Rad is an Associate Professor of Computer Science at Rutgers University. Before joining academia, she was a Member of Technical Staff and Principal Investigator at Lawrence Livermore National Laboratory. Tina earned her Ph.D. in Computer Sciences (with a minor in Mathematical Statistics) at the University of Wisconsin-Madison. Within data mining and machine learning, Tina’s research has been applied to the World-Wide Web, text corpora, large-scale scientific simulation data, complex networks, and cyber situational awareness. She has published over 50 peer-reviewed papers (including a best paper runner-up award at ICDM’09 and a best interdisciplinary paper award at CIKM’12); and has given over 70 invited presentations. Tina is an action editor for the Data Mining and Knowledge Discovery Journal. In 2010, she received an Outstanding Mentor Award from the US DOE Office of Science and a Directorate Gold Award from Lawrence Livermore National Laboratory for work on cyber situational awareness. For more details, visit http://eliassi.org.

3. Big Data Analytics for Healthcare

Abstract
Large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment. In this tutorial, we introduce the characteristics and related mining challenges on dealing with big medical data. Many of those insights come from medical informatics community, which is highly related to data mining but focuses on biomedical specifics. We survey various related papers from data mining venues as well as medical informatics venues to share with the audiences key problems and trends in healthcare analytics research, with different applications ranging from clinical text mining, predictive modeling, survival analysis, patient similarity, genetic data analysis, and public health. The tutorial will include several case studies dealing with some of the important healthcare applications.

Speaker bio for each presenter
Jimeng Sun is a research staff member at IBM TJ Watson Research Center. Dr. Sun graduated with PhD in Computer Science in Carnegie Mellon University in the fall 2007. His advisor was Prof. Christos Faloutsos. He studied in Computer science department at Carnegie Mellon University from 2003 to 2007. His research focus is on healthcare analytics and informatics, large-scale data mining, graph mining, high dimensional data mining such as time series, matrices, and tensors (data cubes) and visual analytics. Dr. Sun has received ICDM best research paper in 2007 and KDD Dissertation runner-up award in 2008 and SDM best research paper in 2007. For more details, one can refer to his personal homepage at http://www.dasfa.net/jimeng .

Chandan K. Reddy is an Assistant Professor in the Department of Computer Science at Wayne State University. He received his PhD from Cornell University and MS from Michigan State University. His primary research interests are in the areas of data mining and machine learning with applications to healthcare, bioinformatics, and social network analysis. His research is funded by the National Science Foundation, the National Institutes of Health, the Department of Transportation, and the Susan G. Komen for the Cure Foundation. He has published over 45 peer-reviewed articles in leading conferences and journals. He received the Best Application Paper Award at the ACM SIGKDD conference in 2010 and was a finalist of the INFORMS Franz Edelman Award Competition in 2011. He is a member of IEEE, ACM, and SIAM.

Afternoon:

4. Entity Resolution for Big Data

Abstract: Entity resolution (ER), the problem of extracting, matching and resolving entity mentions in structured and unstructured data, is a long-standing challenge in database management, information retrieval, machine learning, natural language processing and statistics. Accurate and fast entity resolution has huge practical implications in a wide variety of commercial, scientific and security domains. Despite the long history of work on entity resolution, there is still a surprising diversity of approaches, and lack of guiding theory. Meanwhile, in the age of big data, the need for high quality entity resolution is growing, as we are inundated with more and more data, all of which needs to be integrated, aligned and matched, before further utility can be extracted. In this tutorial, we bring together perspectives on entity resolution from a variety of fields, including databases, information retrieval, natural language processing and machine learning, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges and open research problems. In addition to giving attendees a thorough understanding of existing ER models, algorithms and evaluation methods, the tutorial will cover important research topics such as scalable ER, active and lightly supervised ER, and query-driven ER.

Lise Getoor is a professor in the Computer Science Department at the University of Maryland, College Park. Her primary research interests are in machine learning and reasoning with uncertainty, applied to structured and semi-structured data. She also works on data integration, social network analysis and visual analytics. She has six best paper awards, an NSF Career Award, has served as associate editor for the Machine Learning Journal, JAIR, and TKDD, is elected member of the International Machine Learning Society board and AAAI Executive council, was PC co-chair of ICML 2011, and has served on a variety of program committees including AAAI, ICML, IJCAI, ISWC, KDD, SIGMOD, UAI, VLDB, WSDM and WWW. She received her Ph.D. from Stanford University, her M.S. from UC Berkeley, and her B.S. from UC Santa Barbara.

Ashwin Machanavajjhala is an Assistant Professor in the Department of Computer Science, Duke University. Previously, he was a Senior Research Scientist in the Knowledge Management group at Yahoo! Research. His primary research interests lie in data privacy, systems for massive data analytics, and statistical methods for information extraction and entity resolution. He is a recipient of the NSFCAREER award in 2013 and the ACM SIGMOD Jim Gray Dissertation Award Honorable Mention in 2008. He received his Ph.D. from Cornell University and a B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Madras.

5. Network Sampling

Abstract: Network data appears in various domains, including social, communication, and information sciences. Analysis of such data is crucial for making inferences and predictions about these networks, and moreover, for understanding the different processes that drive their evolution. However, a major bottleneck to perform such an analysis is the massive size of real-life networks, which makes modeling and analyzing these networks simply infeasible. Further, many networks, specifically those that belong to social and communication domains, are not visible to the public due to privacy concerns, and other networks, such as the Web, are only accessible via crawling. Therefore, to overcome the above challenges, researchers use network sampling overwhelmingly as a key statistical approach to select a sub-population of interest that can be studied thoroughly.

In this tutorial, we aim to cover a diverse collection of methodologies and applications of network sampling. We will begin with a discussion of the problem setting in terms of objectives (such as, sampling a representative subgraph, sampling graphlets, etc.), population of interest (vertices, edges, motifs), and sampling methodologies (such as Metropolis-Hastings, random walk, and snowball sampling). We will then present a number of applications of these methods, and will outline both the resulting opportunities and possible biases of different methods in each application.

Mohammad A. Hasan is an Assistant Professor of Computer Science at Indiana University–Purdue University, Indianapolis (IUPUI). Before that, he was a Senior Research Scientist at eBay Research Labs, San Jose, CA. He received a Ph.D. degree in Computer Science from Rensselaer Polytechnic Institute (RPI) in 2009, and an MS degree in Computer Science from the University of Minnesota, Twin Cities in 2002. His research interest focuses on developing novel algorithms in data mining, data management, information retrieval, machine learning, social network analysis, and bioinformatics. One of his particular interests is to develop algorithms for sampling small substructures from large networks. He developed methods for: (1) sampling frequent subgraphs from a graph database, (2) sampling triangles and graphlets from a large network, and (3) Sampling interesting subgraph patterns using interactive feedbacks, all using Markov Chain Monte Carlo (MCMC) sampling algorithm. His doctoral dissertation won the ACM SIGKDD doctoral dissertation award in 2010. He is also a recepient of NSF CAREER award in 2012.

Jennifer Neville is an assistant professor at Purdue University with a joint appointment in the Departments of Computer Science and Statistics. She received her PhD from the University of Massachusetts Amherst in 2006. In 2012, she was awarded an NSF Career Award, in 2008 she was chosen by IEEE as one of ”AI’s 10 to watch”, and in 2007 was selected as a member of the DARPA Computer Science Study Group. Her research focuses on developing data mining and machine learning techniques for relational domains, including citation analysis, fraud detection, and social network analysis.

Nesreen Ahmed is a 5th year Ph.D. student working with Jennifer Neville in the Computer Science Department at Purdue University. Her Ph.D research is focused on statistical network sampling and network stream sampling. She has worked on research developing machine learning algorithms for time series forecasting, statistical predictive analysis of social media, and digital marketing. She has worked as a research intern at Adobe ATL labs and Intel Corporation, a research assistant at the data mining and computer modeling center of excellence in Egypt, and a teaching assistant at Cairo University.

6. The Dataminer’s Guide to Scalable Mixed-Membership and Nonparametric Bayesian Models.

Abstract: Large amounts of data arise in a multitude of situations, ranging from bioinformatics to astronomy, manufacturing, and medical applications. For concreteness our tutorial focuses on data obtained in the context of the internet, such as user generated content (microblogs, e-mails, messages), behavioral data (locations, interactions, clicks, queries), and graphs. Due to its magnitude, much of the challenges are to extract structure and interpretable models without the need for additional labels, i.e. to design effective unsupervised techniques. We present design patterns for hierarchical nonparametric Bayesian models, efficient inference algorithms, and modeling tools to describe salient aspects of the data.

Dr. Amr Ahmed is a Research Scientist at Google. He received his PhD from Carnegie Mellon University in 2011. His thesis “Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms” was awarded the prestigious ACM SIGKDD Doctoral Dissertation award in 2012. He spent a year as a Research Scientist at Yahoo! Research before joining Google. He authored over 40 papers on topics that are core to this tutorial (including a best-paper runner-up award at WSDM 2012) and co-presented 3 tutorials at web and machine learning conferences.

Dr. Alex Smola received his PhD from the University of Technology in Berlin in 1998. Subsequently he was research group leader and professor at the Australian National University and Senior Principal Researcher at National ICT Australia. From 2008 until 2012 he was Principal Research Scientist at Yahoo. Since 2012 he is a visiting researcher at Google and since 2013 a full professor at the Machine Learning Department of Carnegie Mellon University. He has written over 180 papers (that won several best paper awards at ICML, WSDM and SIGIR) and authored or edited 5 books. His work covers a broad range of subjects from statistical learning theory, convex optimization, and functional analysis to practical algorithms for scalable data classification, regression, clustering, and topic models. His recent work focuses on distributed, very large scale latent variable models for user profiling and content recommendation.