VC-Dimension and Rademacher Averages: From Statistical Learning Theory to Sampling Algorithms
Rademacher Averages and the Vapnik-Chervonenkis dimension are
fundamental concepts from statistical learning theory. They allow to
study simultaneous deviation bounds of empirical averages from their
expectations for classes of functions, by considering properties of
the problem, of the dataset, and of the sampling process. In this
tutorial, we survey the use of Rademacher Averages and the
VC-dimension for developing sampling-based algorithms for graph
analysis and pattern mining. We start from their theoretical
foundations at the core of machine learning, then show a generic
recipe for formulating data mining problems in a way that allows using
these concepts in the analysis of efficient randomized algorithms for
those problems. Finally, we show examples of the application of the
recipe to graph problems (connectivity, shortest paths, betweenness
centrality) and pattern mining. Our goal is to expose the usefulness
of these techniques for the data mining researcher, and to encourage
research in the area.
Matteo Riondato is a postdoctoral research associate at Brown
University, USA, supervised by Prof. Eli Upfal. He received his Ph.D.
from Brown in May 2014, with a dissertation on sampling-based
randomized algorithms for data analytics, which received the Best
Student Poster Award at SIAM SDM 2014. He presented a nectar talk
about modern sampling algorithms at ECML PKDD 2014. His research
focuses on exploiting theoretical results for practical algorithms in
pattern and graph mining.
Eli Upfal is a professor of computer science at Brown University,
where he was also the department chair from 2002 to 2007. Prior to
joining Brown in 1998, he was a researcher and project manager at the
IBM Almaden Research Center in California, and a professor of Applied
Mathematics and Computer Science at the Weizmann Institute of Science
in Israel. Upfal’s research focuses on the design and analysis of
algorithms. In particular he is interested in randomized algorithms,
probabilistic analysis of algorithms, and computational statistics,
with applications ranging from combinatorial and stochastic
optimization to routing and communication networks, computational
biology, and computational finance. Upfal is a fellow of the IEEE and
the ACM. He received the IBM Outstanding Innovation Award, and the IBM
Research Division Award. His work at Brown has been funded in part by
the National Science Foundation (NSF), the Defense Advanced Research
Projects Agency (DARPA), the Office of Naval Research (ONR), and the
National Institute of Health (NIH). He is co-author of a popular
textbook “Probability and Computing: Randomized Algorithms and
Probabilistic Analysis” (with M. Mitzenmacher, Cambridge University
Graph-Based User Behavior Modeling: From Prediction to Fraud Detection
How can we model users' preferences? How do anomalies, fraud, and spam effect our models of normal users? How can we modify our models to catch fraudsters? In this tutorial we will answer these questions - connecting graph analysis tools for user behavior modeling to anomaly and fraud detection. In particular, we will focus on the application of subgraph analysis, label propagation, and latent factor models to static, evolving, and attributed graphs.
For each of these techniques we will give a brief explanation of the algorithms and the intuition behind them. We will then give examples of recent research using the techniques to model, understand and predict normal behavior. With this intuition for how these methods are applied to graphs and user behavior, we will focus on state-of-the-art research showing how the outcomes of these methods are effected by fraud, and how they have been used to catch fraudsters.
is a fifth year Ph.D. candidate at Carnegie Mellon University in the Computer Science Department. He previously received his B.S. from Duke University. His Ph.D. research focuses on large scale user behavior modeling, covering both recommendation systems and fraud detection systems. He has interned at Facebook on both the Site Integrity and News Feed Ranking teams, at Microsoft in the Cloud and Information Services Laboratory, and at Google Research. Alex's research is supported by the National Science Foundation Graduate Research Fellowship Program and a Facebook Fellowship. More details can be found at http://alexbeutel.com
is an Assistant Professor in the Department of Computer Science at Stony Brook University. She received her Ph.D. from the Computer Science Department at Carnegie Mellon University in 2012. She also worked at IBM T. J. Watson Research Labs and Microsoft Research at Redmond during summers. Her research interests span a wide range of data mining and machine learning topics with a focus on algorithmic problems arising in graph mining, pattern discovery, social and information networks, and especially anomaly mining; outlier, fraud, and event detection. Dr. Akoglu's research has won 4 publication awards; Best Research Paper at SIAM SDM 2015, Best Paper at ADC 2014, Best Paper at PAKDD 2010, and Best Knowledge Discovery Paper at ECML/PKDD 2009. She also holds 3 U.S. patents filed by IBM T. J. Watson Research Labs. Dr. Akoglu is a recipient of the NSF CAREER award (2015) and Army Research Office Young Investigator award (2013). Her research is currently supported by the National Science Foundation, the US Army Research Office, DARPA, and a gift from Northrop Grumman Aerospace Systems. More details can be found at http://www.cs.stonybrook.edu/~leman
is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), the Research Contributions Award in ICDM 2006, the Innovations award in KDD’10, 20 "best paper" awards, and several teaching awards. He has served as a member of the executive committee of SIGKDD; he has published over 200 refereed articles, 11 book chapters and one monograph. He holds five patents and he has given over 30 tutorials and over 10 invited distinguished lectures. His research interests include data mining for graphs and streams, fractals, database performance, and indexing for multimedia and bio-informatics data. More details can be found at http://www.cs.cmu.edu/~christos/
A New Look at the System, Algorithm and Theory Foundations of Large-Scale Distributed Machine Learning
The rise of Big Data has led to new demand for Machine Learning (ML) systems to learn complex models often with millions to billions of parameters that promise adequate capacity to analyze massive datasets and offer predicative functions thereupon. For example, in many modern applications such as web-scale content extraction via topic models, genome-wide association mapping via sparse structured regression, and image understanding via deep neural networks, one needs to handle BIG ML problems that threaten to exceed the limit of current architectures and algorithms. In this tutorial, we present a systematic overview of modern scalable ML approaches for such applications --- the insights and challenges of designing scalable and parallelizable algorithms for working with Big Data and Big Model; the principles and architectures of building distributed systems for executing these models and algorithms; and the theory and analysis necessary for understanding the behaviors and providing guarantees of these models, algorithms, and systems.
We present a comprehensive, principled, yet highly unified and application-grounded view of the fundamentals and strategies underlying a wide range of modern ML programs practiced in industry and academia, beginning with introducing the basic algorithmic roadmaps of both optimization-theoretic and probabilistic-inference methods --- two major workhorse algorithmic engines that power nearly all ML programs, and the technical developments therein aiming at large scales built on algorithmic acceleration, stochastic approximation, and parallelization. We then turn to the challenges such algorithms must face in a practical distributed computing environment due to memory/storage limit, communication bottleneck, resource contention, straggler, etc., and review and discuss various modern parallelization strategies and distributed frameworks that can actually run these algorithms at Big Data and Big Model scales, while also exposing the theoretical insights that make such systems and strategies possible. We focus on what makes ML algorithms peculiar, and how this can lead to algorithmic and systems designs that are markedly different from today’s Big Data platforms. We discuss such new opportunities in algorithm, system, and theory on parallel machine learning, in real (instead of ideal) distributed communication, storage, and computing environments.
Dr. Eric Xing is a Professor of Machine Learning in the School of Computer Science at Carnegie Mellon University, and Director of the CMU/UPMC Center for Machine Learning and Health. His principal research interests lie in the development of machine learning and statistical methodology, and large-scale computational system and architecture; especially for solving problems involving automated learning, reasoning, and decision-making in high-dimensional, multimodal, and dynamic possible worlds in artificial, biological, and social systems. Professor Xing received a Ph.D. in Molecular Biology from Rutgers University, and another Ph.D. in Computer Science from UC Berkeley. He servers (or served) as an associate editor of the Annals of Applied Statistics (AOAS), the Journal of American Statistical Association (JASA), the IEEE Transaction of Pattern Analysis and Machine Intelligence (PAMI), the PLoS Journal of Computational Biology, and an Action Editor of the Machine Learning Journal (MLJ), the Journal of Machine Learning Research (JMLR). He was a member of the DARPA Information Science and Technology (ISAT) Advisory Group, a recipient of the NSF Career Award, the Sloan Fellowship, the United States Air Force Young Investigator Award, and the IBM Open Collaborative Research Award. He is the Program Chair of ICML 2014.
Dr. Qirong Ho is a scientist at the Institute for Infocomm Research, A*STAR, Singapore, and an adjunct assistant professor at the Singapore Management University School of Information Systems. His primary research focus is distributed cluster software systems for Machine Learning at Big Data scales, with a view towards correctness and performance guarantees. In addition, Dr. Ho has performed research on statistical models for large-scale network analysis --- particularly latent space models for visualization, community detection, user personalization and interest prediction --- as well as social media analysis on hyperlinked documents with text and network data. Dr. Ho received his PhD in 2014, under Eric P. Xing at Carnegie Mellon University's Machine Learning Department. He is a recipient of the 2015 KDD Dissertation Award (runner-up), and the Singapore A*STAR National Science Search Undergraduate and PhD fellowships.
Dense subgraph discovery (DSD)
Finding dense subgraphs is a fundamental graph-theoretic problem, that lies in the heart of numerous graph-mining applications, ranging from finding communities in social networks, to detecting regulatory motifs in DNA, and to identifying real-time stories in news. The problem of finding dense subgraphs has been studied extensively in theoretical computer science, and recently, due to the relevance of the problem in real-world applications, it has attracted considerable attention in the data-mining community.
In this tutorial we aim to provide a comprehensive overview of (i) major algorithmic techniques for finding dense subgraphs in large graphs and (ii) graph mining applications that rely on dense subgraph extraction. We will present fundamental concepts and algorithms that date back to 80’s, as well as the latest advances in the area, from theoretical and from practical point-of-view. We will motivate the problem of finding dense subgraphs by discussing how it can be used in real-world applications. We will discuss different density definitions and the complexity of the corresponding optimization problems. We will also present efficient algorithms for different density measures and under different computational models. Specifically, we will focus on scalable streaming, distributed and MapReduce algorithms. Finally we will discuss problem variants, extensions, and will provide pointers for future research directions.
is an associate professor in the department of computer science of Aalto university. He is the director of the Algorithmic Data Analysis (ADA) programme in HIIT and he leads the Data Mining group in Aalto University. Previously he has been a senior research scientist in Yahoo! Research. He received his Ph.D. from the Computer Science department of Stanford University in 2003. He is currently serving as an associate editor in the IEEE Transactions of Knowledge and Data Engineering (TKDE) and the ACM Transactions of Knowledge Discovery from Data (TKDD), and as a managing editor in Internet Mathematics. He has served in the PC of numerous premium conferences, including being the PC chair for WSDM 2013 and ECML PKDD 2010. He has published over 100 papers and his Google scholar h-index is 40.
Charalampos E. Tsourakakis
is currently a postdoctoral fellow in the Center of Research on Computation and Society (CRCS) at Harvard University. He received his Ph.D. from the Algorithms, Combinatorics and Optimization (ACO) program at Carnegie Mellon University in 2013. He also holds a Master of Science from the Machine Learning Department at CMU. He did his undergraduate studies in the School of Electrical and Computer Engineering (ECE) at the National Technical University of Athens (NTUA). He is the recipient of a best paper award in IEEE Data Mining and has designed two graph mining libraries for tera-scale graphs. The former is officially used by Windows Azure, and the latter was a research highlight of Microsoft Research. He has served in the PC of premium conferences. He has published over 30 papers in the fields of data mining, discrete mathematics, theoretical computer science and computational biology. He has delivered two tutorials in the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. His Google scholar h-index is 17. His research interests include algorithm design for large-scale datasets, data science and mathematical optimization.
Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach
In today's computerized and information-based society, individuals are bombarded with vast amounts of text data from a variety of sources ranging from news articles, scientific publications, product reviews, to a wide range of information from social media. To better understand and extract value from these large, multi-domain, textual sources, it is essential to first identify and gain an understanding of the entities within the data. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically and scalably identify token spans as entity mentions in documents and label their types (e.g., people, product, food). We demonstrate on real-world datasets including news articles and tweets how these typed entities aid in effective knowledge discovery and management.
Jiawei Han, Abel Bliss Professor, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research areas encompass data mining, data ware-housing, information network analysis, and database systems, with over 600 conference and journal publications. He is Fellow of ACM and Fellow of IEEE, and received ACM SIGKDD Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), and IEEE Computer Society W. Wallace McDowell Award (2009). His co-authored textbook "Data Mining: Concepts and Techniques", 3rd ed., (Morgan Kaufmann, 2011) has been adopted popularly world-wide.
Chi Wang (Ph.D. UIUC, 2014), researcher at Microsoft Research, Redmond, Washington. He has been researching into discovering knowledge from unstructured and linked data, such as topics, concepts, relations, communities and social influence. His book Mining Latent Entity Structures is published by Morgan Claypool Pub., 2015, in the series of Synthesis Lectures on Data Mining and Knowledge Discovery. He is a winner of Microsoft Research Graduate Research Fellowship.
Ahmed El-Kishky, Ph.D. candidate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research interests include mining large unstructured data, text mining, and network mining. He is the recipient of both the National Science Foundation Graduate Research Fellowship as well as National Defense Science and Engineering Fellowship.
Xiang Ren, Ph.D. candidate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research focuses on knowledge acquisition from text data and mining linked data. He is the recipient of C. L. and Jane W.-S. Liu Award and Yahoo!-DAIS Research Excellence Gold Award in 2015. He received Microsoft Young Fellowship from Microsoft Research Asia in 2012.
Big Data Analytics: Optimization and Randomization
As the scale and dimensionality of data continue to grow in many applications of data analytics (e.g., bioinformatics, finance, computer vision, medical informatics), it becomes critical to develop efficient and effective algorithms to solve numerous machine learning and data mining problems. This tutorial will focus on simple yet practically effective techniques and algorithms for big data analytics. In the first part, we plan to present the state-of-the-art large-scale optimization algorithms, including various stochastic gradient descent methods, stochastic coordinate descent methods and distributed optimization algorithms, for solving various machine learning problems. In the second part, we will focus on randomized approximation algorithms for learning from large-scale data. We will discuss i) randomized algorithms for low-rank matrix approximation; ii) approximation techniques for solving kernel learning problems; iii) randomized reduction methods for addressing the high-dimensional challenge. Along with the description of algorithms, we will also present some empirical results to facilitate understanding of different algorithms and comparison between them.
Tianbao Yang is currently an assistant professor at the University of Iowa (UI). He received his Ph.D. degree in Computer Science from Michigan State University in 2012. Before joining UI, he was a researcher in NEC Laboratories America at Cupertino (2013-2014) and a Machine Learning Researcher in GE Global Research (2012-2013), mainly focusing on developing distributed optimization systems for various classification and regression problems. Dr. Yang has board interests in machine learning and he has focused on several research topics, including large-scale optimization in machine learning, online optimization and distributed optimization. His recent research interests revolve around randomized algorithms for solving big data problems. He has won the Mark Fulk Best student paper award at 25th Conference on Learning Theory (COLT) in 2012.
Qihang Lin is an assistant professor in Department of Management Sciences in Tippie College of Business in the University of Iowa. Before he joined the University of Iowa, he studied in Tsinghua University (B.S. in Math 2008) in China and obtained his PhD in ACO (Algorithm Combinatorics and Optimization) in 2013 from Tepper School of Business in Carnegie Mellon University. Dr. Lin’s research fields involve: 1) large-scale optimization for machine learning; 2) Markov decision making with applications in high-frequency trading and crowdsourcing. He is the winner of Best Student Paper Award of INFORMS Financial Service Section in 2012.
Rong Jin is a professor of the Computer and Science Engineering Department at Michigan State University. He has been working in the areas of statistical machine learning and its application to information retrieval. He has extensive research experience in a variety of machine learning algorithms such as conditional exponential models, support vector machine, boosting and optimization for different applications including information retrieval. Dr. Jin is an associative editor of ACM Transactions on Knowledge Discovery from Data, and received NSF Career Award in 2006. Dr. Jin obtained his Ph.D. degree from Carnegie Mellon University in 2003, and received best paper award from Conference of Learning Theory (COLT) in 2012.
Big Data Analytics: Social Media Anomaly Detection: Challenges and Solutions
Anomaly detection problem is of critical importance to prevent malicious activities such as bullying, terrorist attack planning, and fraud information dissemination. With the recent popularity of social media, new types of anomalous behaviors arise, causing concerns from various parties. While large of amount of work haven been dedicated to traditional anomaly detection problems, we observe a surge of research interests in the new realm of social media anomaly detection. In this tutorial, we survey existing work on social media anomaly detection, focusing on the new anomalous phenomena in social media and the recent developed techniques to detect those special types of anomalies. We aim to provide a general overview of the problem domain, common formulations, existing methodologies and future directions.
Sanjay Chawla is the Professor in the Faculty of Engineering and IT, University of Sydney. He is currently on leave as Principal Scientist at the Qatar Computing Research Institute. He was an academic visitor at Yahoo! Research in 2012. Sanjay’s area of research is Data Mining and Machine Learning with a specialization in spatio-temporal data mining, outlier detection, class imbalanced classification and adversarial learning. He is a co-author on a popular text in spatial database management systems: “ Spatial Databases: A Tour” which has been translated into Chinese and Russian. His work has been recognized by several best paper awards including in leading conferences such as SIAM International Conference in Data Mining (2006) and IEEE International Conference in Data Mining (2010). Sanjay serves on the Editorial Board of IEEE TKDE and DMKD. He served as PC Chair of PAKDD in 2012
Yan Liu is an assistant professor in Computer Science Department at University of Southern California from 2010. She was a Research Staff Member at IBM Research in 2006-2010. She received her Ph.D. degree from Carnegie Mellon University in 2006. Her research interest is data mining and machine learning with applications to social media, biology and climate science. She has received several awards, including NSF CAREER Award, Okawa Foundation Research Award, ACM Dissertation Award Honorable Mention, Best Paper Award in SIAM Data Mining Conference, Yahoo! Faculty Award and the winner of several data mining competitions, such as KDD Cup and INFORMS data mining competition. She has published over 10 referred articles on temporal causal models for time series data in top conferences, such as KDD, ICML, ICDM, SDM and AAAI, and given invited talks on the topic in many institutions and industrial research labs.
Diffusion in Social and Information Networks: Problems, Models and Machine Learning Methods
In recent years, there has been an increasing effort on developing realistic models, and learning and inference algorithms to understand, predict, and influence diffusion over networks. This has been in part due to the increasing availability and granularity of large-scale diffusion data, which, in principle, allows for understanding and modeling not only macroscopic diffusion but also microscopic (node-level) diffusion. To this aim, a bottom-up approach has been typically considered, which starts by considering how particular ideas, pieces of information, products, or, more generally, contagions spread locally from node to node apparently at random to later produce global, macroscopic patterns at a network level.However, this bottom-up approach also raises significant modeling, algorithmic and computational challenges which require leveraging methods from machine learning, probabilistic modeling, temporal point processes and graph theory, as well as the nascent field of network science. In this tutorial, we will present several diffusion models designed for fine-grained large-scale diffusion and social event data, present some canonical research problem in the context of diffusion, and introduce state-of-the-art algorithms to solve some of these problems, in particular, network estimation, influence estimation and control, and rumor source identification.
Manuel Gomez Rodriguez is an tenure-track independent research group leader at Max Planck for Software Systems. Manuel develops machine learning and large-scale data mining methods for the analysis and modeling of large real-world networks and processes that take place over them. He is particularly interested in problems motivated by the Web and social media and has received several recognitions for his research, including an Outstanding Paper Award at NIPS '13 and a Best Research Paper Honorable Mention at KDD '10. Manuel holds a PhD in Electrical Engineering from Stanford University and a BS in Electrical Engineering from Carlos III University in Madrid (Spain), and has been a Barrie de la Maza Fellow and a Caja Madrid Fellow.
Le Song is an assistant professor in the College of Computing, Georgia Institute of Technology. His principal research interests lie in the development of machine learning methodology, especially in kernel methods, probabilistic graphical models, temporal data and network analysis. Le Song received his Ph.D. in Computer Science from University of Sydney in 2008, and then conducted his post-doctoral research in the School of Computer Science, Carnegie Mellon University, between 2008 and 2011. Before he joined Georgia Institute of Technology, he worked briefly as a research scientist at Google. His current work involves, 1) kernel methods, including kernel embedding of distributions, large scale kernel methods and nonparametric graphical models; 2) spectral algorithms for latent variable models, including theory, algorithms and applications for efficient structure and parameter estimation; 3) computational and statistical analysis of spatial/temporal dynamics of networked processes, such as those occurring in social media and biological systems. He is the winner of Outstanding Paper Award at NIPS '13 and Best Paper Award at ICML ’10.
In year 2015 we experience a proliferation of scientific publications, conferences and funding programs on KDD for medicine and healthcare - KDD for health. However, medical scholars and practitioners work differently from KDD researchers: their research is mostly hypothesis-driven, not data-driven. It is the KDD researchers who should learn how medical researchers and practitioners work, what questions they have and what methods they use, and how mining methods can fit into their research frame and their everyday business. Purpose of this tutorial is to contribute to this learning process.
We address medicine and healthcare; there the expertise of KDD scholars is needed and familiarity with medical research basics is a prerequisite. We aim to provide basics for (1) mining in epidemiology and (2) mining in the hospital. We also address, to a lesser extent, the subject of (3) preparing and annotating Electronic Health Records for mining.
is Professor of Business Information Systems at the Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany. Her main research is on mining dynamic complex data. Her publications are on mining complex streams, mining evolving objects, adapting models to drift and building models that capture drift. She focusses on two application areas: business (including opinion stream mining and adaptive recommenders) and medical research (including epidemiological mining and learning from clinical studies). She served as PC Co-Chair of ECML PKDD 2006, NLDB 2008 and of 36th Annual Conference of the German Classification Society (GfKl 2012, Hildesheim, August 2012) She is involved in the organization committees of several conferences; she was Tutorials Co-Chair at ICDM 2010 and Workshops Co-Chair at ICDM 2011, Demo Track Co-Chair of ECML PKDD 2014 and 2015, and is senior PC member of recent conferences like ECML PKDD 2014, 2015 and SIAM Data Mining 2015. She has held tutorials on topics of data mining at KDD 2009, PAKDD 2013 and in several ECML PKDD conferences.
Pedro Pereira Rodrigues
is Professor at the Department of Health Information and Decision Sciences, Faculty of Medicine of the University of Porto, and a researcher at the Biostatistics and Intelligent Data Analysis group of the Center for Health Technologies and Services Research. His main research area is machine learning, currently devoted to Bayesian networks applications to clinical research and decision support. He has edited 4 conference proceedings, and published articles in indexed peer-reviewed journals and conference proceedings. He helped organizing events as also general chair (CBMS 2013) and PC chair (ECMLPKDD 2015, CBMS 2014-15, and several thematic tracks and workshops since 2007) and is a member of the steering committee of CBMS, and was a member of the program committee for more than 20 editions of international conferences (e.g. IJCAI, ECMLPKDD, ICML, CBMS). He has also co-organized tutorials in IBERAMIA 2012 and ECMLPKDD 2014.
is Professor at the Department of Computer Systems Languages and Sw Engeneering, Faculty of Computer Science of Universidad Politecnica de Madrid (UPM). Her subject area is Data Mining. She studied Computer Science and she has a PhD in Computer Science. She is nowadays member of MIDAS “Data Mining and data simulation group” at the Center of Biotechnology in UPM and data bases and data mining professor at UPM. Her research activities are on various aspects of data mining project development and in the last years her research is focused on data mining on the medical field. She has participated in different research and development project related to data integration and mining on mobile devices. She has published three international books on web mining (edited by Springer in 2003, 2004 and 2009 respectively) and many international journals including Data and Knowledge Engineering Journal, Information Sciences, Expert Systems with applications, Journal of Medical Systems and International Journal of Intelligent Data Analysis.
Large Scale Distributed Data Science using Apache Spark
Apache Spark is an open-source cluster computing framework. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing.
This tutorial will provide an accessible introduction to those not already familiar with Spark and its potential to revolutionize academic and commercial data science practices. It is divided into two parts: the first part will introduce fundamental Spark concepts, including Spark Core, data frames, the Spark Shell, Spark Streaming, Spark SQL, MLlib, and more; the second part will focus on hands-on algorithmic design and development with Spark (developing algorithms from scratch such as decision tree learning, graph processing algorithms such as pagerank/shortest path, gradient descent algorithms such as support vectors machines and matrix factorization. Industrial applications and deployments of Spark will also be presented. Example code will be made available in python (pySpark) notebooks.
James G. Shanahan has spent the past 25 years developing and researching cutting-edge information management systems. He is SVP of Data Science and Chief Scientist at NativeX, a mobile ad network. In 2007, he founded a boutique consultancy (Church and Duncan Group Inc.) whose major goal is to help companies leverage their vast data repositories using statistics, machine learning, optimization theory and data mining for big data applications in areas such as web search, local and mobile search, and digital advertising and marketing.
Dr. Shanahan has been affiliated with the University of California since 2009 where he teaches graduate courses on big data analytics, machine learning, and stochastic optimization. He also advises several high-tech startups and is executive VP of science and technology at Irish Innovation Center (IIC). He has published six books, more than 50 research publications, and over 20 patents in the areas of machine learning and information processing. Dr. Shanahan received his PhD in engineering mathematics from the University of Bristol, U. K., and holds a Bachelor of Science degree from the University of Limerick, Ireland. He is a EU Marie Curie fellow. In 2011 he was selected as a member of the Silicon Valley 50 (Top 50 Irish Americans in Technology).
Liang Dai is data scientist at NativeX, a leading ad technology company for mobile games. He works on large scale data mining projects in distributed platform, e.g. AWS, Hadoop, Spark, etc. He focuses on end to end data modeling pipeline: including preprocessing raw data, designing behavior and non-behavior features, selecting features based on experimental results, building predictive models and deploying models in production to handle large volume of ads placement requests. Liang is also pursuing Ph.D. in Technology Information and Management department, UC Santa Cruz. There he does research in data mining on digital marketing, including campaign evaluation, online experiment design, customer value improvement, etc. Liang received the B.S. and the M.S. from Information Science and Electronic Engineering department, Zhejiang University.
Data-Driven Product Innovation
Data Science is an increasingly popular area of Knowledge Discovery and
Data Mining. Leading consumer Web companies like Amazon, Facebook, Google
and LinkedIn possess Petabytes of user data. Through effective mining of
this data, they create products and services that benefit millions of
users and generate tremendous amount of business value. It is widely
acknowledged that Data Scientists play key roles in the creation of these
products, from pattern identification, idea generation and product
prototyping to experiment design and launch decisions. Nonetheless, they
also face common challenges, such as the gap between creating a prototype
and turning it into a scalable product, or the frustration of generating
innovative product ideas that do not get adopted.
Organizers of this tutorial have many years of experience leading Data
Science teams in some of the most successful consumer Web companies. In
this tutorial, we will introduce the framework that we created to nurture
data-driven product innovations. The core of this framework is the focus
on scale and impact we will take the audience through a discussion on
how to balance between velocity and scale, between product innovation and
product operation, and between theoretical research and practical impact.
We will share some guidelines for successful data-driven product
innovation with real examples from our experiences.
Xin Fu is a Director of Data Science at LinkedIn. He and his team are
responsible for driving product innovation through creative use of data,
ranging from inferential analysis and online experimentation, to creation
of analytics platforms and data products. Xin holds a PhD in
Human-Computer Interaction and a MS in Statistics from the University of
North Carolina at Chapel Hill. Prior to joining LinkedIn, Xin worked at
Google and Microsoft. He has authored over 50 refereed papers and journal
articles in the areas of information organization and retrieval, Web
search interface, and online user behavioral measurement.
Hernán Asorey is the VP of Product Data Science at Salesforce. In this key
leadership position, Hernán and team are responsible for transforming
product ideation, development, and customer engagement to an
evidence-driven culture where intelligence becomes an intrinsic part of
the product to ship. Hernán manages, grows, and inspires a world-class
team of data experts across all information spectrum: information engines
(digest), analytics engines (visualize), data science (apply),
experimentation and optimization (test and learn). Prior to Salesforce,
Hernán held different roles in fields related to large-scale data
analytics, data warehousing, business intelligence, data mining, social
media analytics, and HR analytics at companies such as Sears Holdings,
Hewlett Packard, eBay, and Microsoft. Hernán has a degree in Information
Systems Engineering with specialization on Applied Math; additionally, he
graduated in 2011 from Stanford Executive Program, Graduate School of
Business. Hernán also spent 5 years in academia as a lecturer at the Math
Department in The School of Engineering in Buenos Aires, Argentina on
Discrete Math, Calculus, and Fractal Theory.
Web Personalization and Recommender Systems
The quantity of accessible information has been growing rapidly and far exceeded human processing capabilities. The sheer abundance of information often prevents users from discovering the desired information, or aggravates making informed and correct choices. This highlights the pressing need for intelligent personalized applications that simplify information access and discovery by taking into account users' preferences and needs. One type of personalized application that has recently become tremendously popular in research and industry is recommender systems. These provide to users personalized recommendations about information and products they may be interested to examine or purchase. Extensive research into recommender systems has yielded a variety of techniques, which have been published at a variety of conferences and adopted by numerous Web-sites. This tutorial will provide the participants with broad overview and understanding of algorithms for personalized and recommendation technologies.
Shlomo Berkovsky is a Senior Researcher at the Digital Productivity Flagship, CSIRO. Shlomo received his PhD (summa cum laude) from the University of Haifa, where his research focused on mediation of user models in recommender systems. At CSIRO he was the Research Leader of the Personalized Information Delivery team and worked on a project focusing on personalized eHealth applications. Shlomo's broad research interests include Web personalization, and recommender systems. He is interested in collaborative and content-based recommenders, personalized persuasion, privacy-enhanced personalization, ubiquitous user modeling, personalization on the Social Web, and context-aware recommender systems. Shlomo is the author of more than 80 refereed papers published in high esteem journals, books, and conferences. He published in broad venues (ACM TIST, AI Review, WWW, ECAI) and in venues focused on personalization and recommender systems (UMUAI, ACM TiiS, RecSys, IUI, UMAP). His paper won the Best Paper Award of the AH conference and 2 iAward prizes. Shlomo presented 3 keynote talks and 9 tutorials on personalization and recommender systems, including at WWW, IJCAI, and WISE. He served on the organizing committee of 9 conferences and chaired 14 workshops.
Jill Freyne is a Senior Research Scientist at the CSIRO Digital Productivity Flagship. Jill received her PhD from University College Dublin, where her research focused on collaborative, community based Web search. Jill continued her research into personalization and recommender technology as a Postdoctoral Fellow at the Clarity Center for Sensor Web technologies and IBM Research in Cambridge, MA where she worked primarily on influence in social networking. Since her commencement at CSIRO, Jill’s research has focused on recommender and persuasive technologies to impact attitude and behaviour in the health domain. Jill is the author of over 70 publications published in top quality journals and conferences. She holds a strong international reputation and has been invited to speak at highly regarded institutions including MIT, Carnegie Mellon University, and IBM Research. Jill has co-chaired the International Conference on Persuasive Technologies 2013, has organized a national conference, served on the organizing committees of IUI, Hypertext and Recommender Systems conferences, and run several workshops on Recommender System and Social Web. Jill serves as a senior PC member of RecSys and has reviewed for many other journals and international conferences.