Scaling Up Deep Learning
Deep learning has rapidly moved from a marginal approach in the machine learning community less than ten years ago to one that has strong industrial impact, in particular for high-dimensional perceptual data such as speech and images, but also natural language. The demand for experts in deep learning is growing very fast (faster than we can graduate PhDs), thereby considerably increasing their market value. Deep learning is based on the idea of learning multiple levels of representation, with higher levels computed as a function of lower levels, and corresponding to more abstract concepts automatically discovered by the learner. Deep learning arose out of research on artificial neural networks and graphical models and the literature on that subject has considerably grown in recent years, culminating in the creation of a dedicated conference (ICLR). The tutorial will introduce some of the basic algorithms, both on the supervised and unsupervised sides, as well as discuss some of the guidelines for successfully using them in practice. Finally, it will introduce current research questions regarding the challenge of scaling up deep learning to much larger models that can successfully extract information from huge datasets.
is Full Professor of the Department of Computer Science and Operations Research,head of the Machine Learning Laboratory (LISA), CIFAR Fellow in the Neural Computation and Adaptive Perception program, Canada Research Chair in Statistical Learning Algorithms, and he also holds the NSERC-Ubisoft industrial chair. His main research ambition is to understand principles of learning that yield intelligence. He teaches a graduate course in Machine Learning (IFT6266) and supervises a large group of graduate students and post-docs. His research is widely cited (over 16000 citations found by Google Scholar in early 2014, with an H-index of 55).
Yoshua Bengio is currently action editor for the Journal of Machine Learning Research, editor for Foundations and Trends in Machine Learning, and has been associate editor for the Machine Learning Journal and the IEEE Transactions on Neural Networks.
Yoshua Bengio was Program Chair for NIPS'2008 and General Chair for NIPS'2009 (NIPS is the flagship conference in the areas of learning algorithms and neural computation). Since 1999, he has been co-organizing the Learning Workshop with Yann Le Cun, with whom he has also created the International Conference on Representation Learning (ICLR). He has also organized or co-organized numerous other events, such as the ICML'2012 Representation Learning Workshop, the NIPS'2011 Deep Learning and Unsupervised Feature Learning Workshop, the NIPS'2010 Deep Learning and Unsupervised Feature Learning Workshop, the ICML'2009 Workshop on Learning Feature Hierarchiesand the NIPS'2007 Deep Learning Workshop.
Constructing and mining web-scale knowledge graphs
Recent years have witnessed a proliferation of large-scale knowledge graphs, such as Freebase, YAGO, Facebook’s Open Graph, Google's Knowledge Graph, and Microsoft's Satori. Whereas there is a large body of research on mining homogeneous graphs, this new generation of information networks are highly heterogeneous, with thousands of entity and relation types and billions of instances of vertices and edges. In this tutorial, we will present the state of the art in constructing, mining, and growing knowledge graphs. The purpose of the tutorial is to equip newcomers to this exciting field with an understanding of the basic concepts, tools and methodologies, available datasets, and open research challenges. A publicly available knowledge base (Freebase) will be used throughout the tutorial to exemplify the different techniques.
Antoine Bordes is a staff research scientist at Facebook Artificial Intelligence Research. Prior to joining Facebook in 2014, he was a CNRS staff researcher in the Heudiasyc laboratory of the University of Technology of Compiegne in France. In 2010, he was a postdoctoral fellow in Yoshua Bengio's lab of University of Montreal. He received his PhD in machine learning from Pierre & Marie Curie University in Paris in early 2010. From 2004 to 2009, he collaborated regularly with the Machine Learning department of NEC Labs of America in Princeton. He received two awards for best PhD from the French Association for Artificial Intelligence and from the French Armament Agency, as well as a Scientific Excellence Scholarship awarded by CNRS in 2013. Antoine is a pioneer in the use of embedding models for modeling knowledge bases and co-authored many papers on the topic in recent years. He is also a specialist of natural language processing, deep learning and large-scale learning.
is a senior staff research scientist at Google, where he works on knowledge discovery from the Web. Prior to joining Google in 2012, he was a director of research and head of the natural language processing and information retrieval group at Yahoo! Research. Evgeniy is an ACM Distinguished Scientist, and is a recipient of the 2010 Karen Sparck Jones Award for his contributions to natural language processing and information retrieval. Evgeniy serves as a PC co-chair of WSDM 2015, and has served as an area chair or senior program committee member at numerous major conferences, including KDD, SIGIR, WWW, WSDM, AAAI, IJCAI, ACL, EMNLP, CIKM, ICDM, and ICWSM. He has organized a number of workshops and taught multiple tutorials at SIGIR, ACL, WWW, WSDM, ICML, IJCAI, AAAI, CIKM, and EC. Evgeniy earned his PhD in computer science from the Technion - Israel Institute of Technology.
Bringing Structure to Text: Mining Phrases, Entity Concepts, Topics, and Hierarchies
Mining phrases, entity concepts, topics, and hierarchies from massive text corpus is an essential problem in the age of big data. Text data in electronic forms are ubiquitous, ranging from scientific articles to social networks, enterprise logs, news articles, social media and general web pages. It is highly desirable but challenging to bring structure to unstructured text data, uncover underlying hierarchies, relationships, patterns and trends, and gain knowledge from such data. In this tutorial, we provide a comprehensive survey on the state-of-the art of data-driven methods that automatically mine phrases, extract and infer latent structures from text corpus, and construct multi-granularity topical groupings and hierarchies of the underlying themes. We study their principles, methodologies, algorithms and applications using several real datasets including research papers and news articles and demonstrate how these methods work and how the uncovered latent entity structures may help text understanding, knowledge discovery and management.
Jiawei Han, Abel Bliss Professor, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research areas encompass data mining, data ware-housing, information network analysis, and database systems, with over 600 conference and journal publications. He is Fellow of ACM and IEEE, received ACM SIGKDD Innovation Award (2004) and IEEE CS W. McDowell Award (2009). His co-authored textbook "Data Mining: Concepts and Techniques", 3rd ed., (Morgan Kaufmann, 2011) has been adopted popularly world-wide.
Chi Wang is a Ph.D. candidate at Univ. of Illinois at Urbana-Champaign. He is the sole winner of Microsoft Research Graduate Research Fellowship in the history of CS, UIUC. He received KDDCUP'13 runner-up award for the entity-name disambiguation competition, and his work on topic hierarchy construction was nominated best paper candidates in CIKM'13 and ICDM'13.
Ahmed El-Kishky is a Ph.D. candidate at Univ. of Illinois at Urbana-Champaign. He is a National Science Foundation Graduate Research Fellow and was twice selected to participate and conduct research at NSF REU sites at The University of Notre Dame and The University of Massachusetts Amherst.
As recent pandemics such as SARS and the Swine Flu outbreak have shown, diseases spread very fast in today’s interconnected world, making public health an important research area. Some of the basic questions are: How can an outbreak be contained before it becomes an epidemic, and what disease surveillance strategies should be implemented? These problems have been studied traditionally using differential equation methods, which are based on complete mixing assumptions, which do not hold for realistic populations. In this tutorial, we focus on an approach based on diffusion processes on complex networks, which is able to capture more realistic problems. We provide an overview of the state of the art in computational epidemiology, which is a multi-disciplinary research area, that overlaps different areas in computer science, including data mining, machine learning, high performance computing and theoretical computer science, as well as mathematics, economics and statistics.
Madhav V. Marathe is a Professor of Computer Science and Deputy Director, Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute, Virginia Tech. He obtained his Bachelor of Technology in 1989 in Computer Science and Engineering from the Indian Institute of Technology, Madras, and his PhD in 1994 in Computer Science from the University at Albany. He has published more than 150 research articles in peer reviewed journals, conference proceedings, and books, and has over eight years of experience in project leadership and technology development, specializing in population dynamics, telecommunication systems, epidemiology, design and architecture of the data grid, design, and analysis of algorithms for data manipulation, design of services-oriented architectures, and socio-technical systems. He is the recipient of the Distinguished Copyright award for TRANSIMS software, LANLs achievement award, and a recipient of the University at Albany’s Distinguished Alumni Award. His research interests are in R&D of high-performance computing, modeling & simulation of socio-technical systems, service oriented architectures, computer and communication networks, theoretical computer science, social networks and graph theory, computational epidemiology and computational economics. He is a fellow of the IEEE and ACM.
Naren Ramakrishnan is the Thomas L. Phillips Professor of Engineering at Virginia Tech. His research interests focus on data mining for intelligence analysis, forecasting, sustainability, and health informatics. He currently leads the IARPA OSI EMBERS project on forecasting critical societal events (disease outbreaks, civil unrest, and elections) using open source indicators. His research has been supported by NSF, DHS, NIH, NEH, IARPA, DARPA, DTRA, ONR, General Motors, HP Labs, NEC Labs, and Advance Auto Parts. Ramakrishnan serves on the editorial boards of IEEE Computer, Data Mining and Knowledge Discovery, IEEE Transactions on Knowledge and Data Engineering, and other journals. He was an invited co-organizer of the National Academy of Engineering Frontiers of Engineering symposium in 2009. Ramakrishnan is an ACM Distinguished Scientist.
Anil Kumar S. Vullikanti is an Associate Professor in the Dept. of Computer Science and the Virginia Bioinformatics Institute at Virginia Tech. He received his undergraduate degree from the Indian Institute of Technology, Kanpur, and his Ph.D. from the Indian Institute of Science, Bangalore. He was a post-doctoral researcher at the Max-Planck Institute for Informatics, and a Technical Staff Member at the Los Alamos National Laboratory. His research interests are in the broad areas of approximation and randomized algorithms, dynamical systems, computation epidemiology, wireless networks, social networks, data mining and the modeling, simulation and analysis of socio-technical systems. His name appears as V.S. Anil Kumar on most publications.
Management and Analytic of Biomedical Big Data with Cloud-based In-Memory Database and Dynamic Querying
A Hands-on Experience with Real-world Data
Analyzing Biomedical Big Data (BBD) is computationally expensive due to high dimensionality and large data volume. Performance and scalability issues of traditional databases often limit the usage of more sophisticated and complex data queries and analytic models. Moreover, in the conventional setting, data management and analytic are carried out in separate software platforms. Exporting and importing large amount of data across platforms requires a significant amount of computational and I/O resources, as well as potentially putting sensitive data at risk. In this tutorial, we will explore an in-memory database system as a solution to BBD analysis. The participants will learn the advantages of an in-memory databases over the traditional ones through hands-on exercises with the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC) database, a large clinical database with over 60,000 ICU stays. In addition, we seek to educate the participants on e_ective analytic methods for BBD, including dynamic queries and statistical analysis with R.
Roger G. Mark earned the SB and PhD degrees in electrical engineering from Massachusetts Institute of Technology and the MD degree from Harvard Medical School and trained in internal medicine with the Harvard Medical Unit at Boston City Hospital. At present Dr. Mark is Distinguished Professor of Health Sciences and Technology, and Professor of Electrical Engineering at MIT. He remains active in the part-time practice of internal medicine with a focus on geriatrics, is Senior Physician at the Beth Israel Deaconess Medical Center and Assistant Professor of Medicine at Harvard Medical School. His current research activities include Integrating Data, Models, and Reasoning in Critical Care (http://mimic.mit.edu) and PhysioNet (http://www.physionet.org), both of which involve physiological signal processing, cardiovascular modeling, and the development, use and open distribution of large physio-logical and clinical databases.
John Ellenberger currently works in the "Chairman's Special Projects" group at SAP Labs where he serves as the liaison to MIT where he focuses on Big Data Topic including medical analytics, machine learning and data privacy and architecture. John has worked in SAP's research function for 10 years in both strategic and technical research roles. Previous to that John spent 30 years managing software development teams working on the 1st generation products in the areas of multimedia services, online product and catalog management, network-based speech services, software tools and VLSI layout/verification. Most recently he led the engineering team at Nokia that developed the first generation of multimedia messaging services.
is currently a visiting scholar at MIT in Harvard-MIT Health Sciences and Technology Division. He is also one of the faculty members of MIT course HST 936 "Global Health Informatics". Dr. Feng obtained both his Bachelor and PhD degrees from School of Electronic and Electrical Engineering, Nanyang Technological University. Dr. Feng's PhD study focused on developing data mining methods to discover meaningful knowledge that impacts real life practices. Before his current affiliation at MIT, Dr. Feng joined the Data Analytic Department of Institute for Infocomm Research (I2R) as a research scientist. Dr. Feng was awarded the Ministry of Education Scholarship for his undergraduate studies and the A*STAR Graduate Scholarship for his PhD study. His work was also recognized with the Bi-annual Best Paper Award from the Institute for Infocomm Research. Dr. Feng's research focus is to develop data mining and machine learning methods to discover or infer casual phenomenon among real-life practice and strategic planning.
Mohammad Ghassemi is a PhD student in Electrical and Computer Engineering at the Massachusetts Institute of Technology with a research interest in statistical signal processing and medical informatics. In 2010, Mohammad received the Gates-Cambridge Scholarship to fund his MPhil at the University of Cambridge in Information Engineering. He was also awarded the Goldwater scholarship while pursuing two undergraduate degrees in Electrical Engineering and Applied Mathematics. He holds two patents, and has several years of experience working in both research and industrial settings in North America, the Middle East and Europe. Mohammad's prior research experience spans machine learning, signal processing and neuroscience.
Thomas Brennan is currently a post-doctoral Research Engineer in the Laboratory of Computation Physiology, MIT. With a background in electrical and computer engineering, Thomas Brennan was awarded the Rhodes Scholarship from South Africa in 2004. In 2009 he completed his D.Phil. in Biomedical Engineering from the University of Oxford. He then worked for the Vodafone Foundation developing a mobile-based platform to monitor and support community health workers in South Africa. In 2010, he accepted a Wellcome Trust post-doctoral research fellowship at the Institute of Biomedical Engineering in Oxford to develop and assess mobile health solutions for monitoring chronic disease in resource-constrained settings. He has over 10 years experience in cardiac modeling and biomedical signal processing, with a special focus on machine learning.
Ishrar Hussain is working as a Machine Learning Researcher at SAP Research Labs, Montreal, and developing state-of-the-art machine learning solutions for Big Data applications using the SAP HANA's in-memory database platform. He received his Master' degree in Computer Science in 2007 from Concordia University in Montreal, Canada. He is now also a Ph.D. Candidate at the same university and is expecting to graduate in 2014. Ishrar has been continuing his research in the fields of Natural Language Processing, Machine Learning and Requirements Engineering for the last eight years. He is proficient in different platforms for linguistic analysis, e.g. GATE, UIMA, Stanford NLP etc., and has developed several standalone data science applications. Ishrar is also the author of nine research papers.
The Recommender Problem Revisited
In 2006, Netflix announced a $1M prize competition to advance recommendation algorithms. The recommendation problem was simplified as the accuracy in predicting a user rating measured by the Root Mean Squared Error. While that formulation helped get the attention of the research community in the area, it may have put an excessive focus on what is simply one of possible approaches to recommendations. In this tutorial we will describe different components of modern recommender systems such as: personalized ranking, similarity, explanations, context awareness, or search as recommendation. We will use the Netflix use case as a driving example of a prototypical industrial scale recommender system. We will also review the usage of modern algorithmic approaches that include algorithms such as Factorization Machines, Restricted Boltzmann Machines, SimRank, Deep Neural Networks, or Listwise Learningtorank.
Xavier Amatriain (PhD) is Director of Algorithms Engineering at Netflix. He leads a team of researchers and engineers designing and implementing the next wave of machine learning approaches to power the Netflix product. Previous to this, he was a Researcher in Recommender Systems, and neighboring areas such as Data Mining, Machine Learning, Information Retrieval, and Multimedia. He has authored more than 50 papers including book chapters, journals, and articles in international conferences. He has also lectured in different universities including the University of California Santa Barbara and UPF in Barcelona, Spain, where he is originally from.
Bamshad Mobasher is a Professor of Computer Science and the director of the Center for Web Intelligence at the School of Computing of DePaul University in Chicago. His research areas include Web mining, Web personalization, recommender systems, predictive user modeling, and information retrieval. He has published five edited books as well as more than 170 scientific articles, including several seminal papers in Web mining and Web personalization that are among the most cited in these areas. Most recently, has served as the program chair and steering committee member of the ACM International Conference on Recommender Systems, as a program chair for the International Conference on User Modeling, Adaptation and Personalization, and as local organizing chair for the ACM Conference on Knowledge Discovery and Data Mining. As the director of the Center for Web Intelligence, Dr. Mobasher is directing research in Web mining, predictive analytics, and recommender systems, as well as overseeing several related joint projects with the industry. Dr. Mobasher serves as an associate editor for the ACM Transactions on the Web, ACM Transactions on Intelligent Interactive Systems and the ACM Transactions on Internet Technology. He also serves on the editorial board of User Modeling and User-Adapted Interaction, The Journal of Personalization Research.
Correlation clustering: from theory to practice
Correlation clustering is arguably the most natural formulation of clustering. Given a set of objects and a pairwise similarity measure between them, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. As it just needs a definition of similarity, its broad generality makes it applicable to a wide range of problems in different contexts, and in particular makes it naturally suitable to clustering structured objects for which feature vectors can be difficult to obtain. Despite its simplicity, generality and wide applicability, correlation clustering has so far received much more attention from the algorithmic theory community than from the data mining community. The goal of this tutorial is to show how correlation clustering can be a powerful addition to the toolkit of the data mining researcher and practitioner, and to encourage discussions and further research in the area. In the tutorial we will survey the problem and its most common variants, with an emphasis on the algorithmic techniques and key ideas developed to derive efficient solutions. We will motivate the problems and discuss real-world applications, the scalability issues that may arise, and the existing approaches to handle them.
is leading the Web Mining research group at Yahoo Labs in Barcelona, Spain. He is member of the ECML PKDD Steering Committee, member of the Editorial Board of ACM Transactions on Intelligent Systems and Technology (TIST), and Associate Editor of IEEE Transactions on Knowledge and Data Engineering (TKDE). He has been program co-chair of ECML PKDD 2010, PinKDD 2007 and 2008, PADM 2006, and KDID 2005. He has co-authored several papers on variants of correlation clustering. At KDD 2013 he got 4 papers accepted in the research track. More information can be found at http://www.francescobonchi.com/
received his undergraduate degrees in Computer Science and Mathematics from the Complutense University of Madrid. After engineering internships at CERN and Google, he joined the Algorithms and Complexity group at CWI Amsterdam, obtaining his PhD from the University of Amsterdam under the supervision of Harry Buhrman. Currently he is a postdoctoral researcher at Yahoo Labs in Barcelona. His research interests include sublinear-time algorithms, learning, approximation algorithms, and large-scale problems in data mining and machine learning. More information can be found at https://sites.google.com/site/elhipercubo/
is leading the "Scalable Machine Learning" research group at Yahoo Labs in New York, focusing on the theory and practice of (very) large scale data mining and machine learning; in particular, on theoretical foundations of machine learning, optimizations, scalable scientific computing, and machine learning systems and platforms. He received his PhD in Computer Science from Yale University, under the supervision of Steven Zucker. During his PhD he spent time at both UCLA and Google as an engineering intern and a researcher. After that, he joined the Program in Applied Mathematics at Yale as a Post-Doctoral fellow. In 2009 he joined Yahoo Labs. He has published important papers on correlation clustering [1, 4] and given several lectures. He received the best paper award at SODA 2011 and KDD 2013. More information can be found at http://www.cs.yale.edu/homes/el327/
Building intelligent systems that are capable of extracting high-level representations from high-dimensional sensory data lies at the core of solving many AI related tasks, including visual object or pattern recognition, speech perception, and language understanding. Theoretical and biological arguments strongly suggest that building such systems requires deep architectures that involve many layers of nonlinear processing. Many existing learning algorithms use shallow architectures, including neural networks with only one hidden layer, support vector machines, kernel logistic regression, and many others. The internal representations learned by such systems are necessarily simple and are incapable of extracting some types of complex structure from high-dimensional input. In the past few years, researchers across many different communities, from applied statistics to engineering, computer science and neuroscience, have proposed several deep (hierarchical) models that are capable of extracting useful, high-level structured representations. An important property of these models is that they can extract complex statistical dependencies from high-dimensional sensory input and efficiently learn high-level representations by re-using and combining intermediate concepts, allowing these models to generalize well across a wide variety of tasks. The learned high-level representations have been shown to give state-of-the-art results in many challenging learning problems, where data patterns often exhibit a high degree of variations, and have been successfully applied in a wide variety of application domains, including visual object recognition, information retrieval, natural language processing, and speech perception. A few notable examples of such models include Deep Belief Networks, Deep Boltzmann Machines, Deep Autoencoders, and sparse coding-based methods.
Ruslan Salakhutdinov received his PhD in machine learning (computer science) from the University of Toronto in 2009. After spending two post-doctoral years at the Massachusetts Institute of Technology Artificial Intelligence Lab, he joined the University of Toronto as an Assistant Professor in the Department of Computer Science and Department of Statistics. Dr. Salakhutdinov’s primary interests lie in statistical machine learning, Deep Learning, probabilistic graphical models, and large-scale optimization. He is an action editor of the Journal of Machine Learning Research and served on the senior programme committee of several learning conferences including NIPS and ICML. He is the recipient of the Early Researcher Award, Connaught New Researcher Award, Alfred P. Sloan Research Fellowship, Microsoft Research Faculty Fellowship, Google Faculty Research Award, and a Fellow of the Canadian Institute for Advanced Research. A major focus on Dr. Salakhutdinov’s work is Deep Learning, an active research area that stems, in large part, from his 2006 Science article co-authored with Geoffrey Hinton. This work developed a formulation for stacked Restricted Boltzmann Machines (RBMs), showing an algorithm capable of learning large-scale multi-layer networks using a combination of unsupervised pre-training followed by supervised training. Dr. Salakhutdinov subsequently pioneered a new class of deep generative models, called Deep Boltzmann Machines (DBMs). These are probabilistic graphical models that contain multiple layers of latent variables. Each nonlinear layer captures progressively more complex patterns of data, which is a promising way of solving visual object recognition, language understanding, and speech perception problems. Dr. Salakhutdinov’s contributions to Deep Learning have already received over 5000 citations according to Google Scholar, and have been applied broadly in speech, language, and image analysis.
Network Mining and Analysis for Social Applications
The recent blossom of social network and communication services in both public and corporate settings have generated a staggering amount of network data of all kinds. Unlike the bio-networks and the chemical compound graph data often used in traditional network mining and analysis, the new network data grown out of these social applications are characterized by their rich attributes, high heterogeneity, enormous sizes and complex patterns of various semantic meanings, all of which have posed significant research challenges to the graph/network mining community. In this tutorial, we aim to examine some recent advances in network mining and analysis for social applications, covering a diverse collection of methodologies and applications from the perspectives of event, relationship, collaboration and network pattern. We would present the problem setting, the research challenges, the recent research advances and some future directions for each perspective.
Feida Zhu is an assistant professor at the School of Information Systems of Singapore Management University (SMU). He obtained his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign (UIUC) in 2009 and his B.Sc. in Computer Science from Fudan University, China, in 2001. His current research interests include large-scale data mining, graph/network mining and social network analysis. His research on large-scale frequent pattern mining has won the Best Student Paper Awards at 2007 IEEE International Conference on Data Engineering (ICDE) and 2007 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). He has founded the Pinnacle Lab a joint lab with China Ping An Insurance Group to focus on social media mining and analysis for finance industry.
Huan Sun is currently a fourth-year Ph.D. student working with Prof. Xifeng Yan in the Department of Computer Science at the University of California, Santa Barbara. Her research works focus on statistical machine learning and data mining with emphasis on network analysis and text mining. She received the Regents' Special Fellowship to support her graduate study in 2010. Before that, she received the B.S. degree in Electronic Engineering and Information Science from the University of Science and Technology of China.
Xifeng Yan is an associate professor at the University of California at Santa Barbara. He holds the Venkatesh Narayanamurti Chair of Computer Science. He received his Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in 2006. He was a research staff member at the IBM T. J. Watson Research Center between 2006 and 2008. He has been working on modeling, managing, and mining graphs in information networks, computer systems, social media and bioinformatics. His works were extensively referenced, with over 7,000 citations per Google Scholar and thousands of software downloads. He received NSF CAREER Award, IBM Invention Achievement Award, ACM-SIGMOD Dissertation Runner-Up Award, and IEEE ICDM 10-year Highest Impact Paper Award.
One response to the proliferation of large datasets has been to develop ingenious ways to throw resources at the problem, using massive fault tolerant storage architectures, parallel and graphical computation models such as MapReduce, Pregel and Giraph. However, not all environments can support this scale of resources, and not all queries need an exact response. This motivates the use of sampling to generate summary datasets that support rapid queries, and prolong the useful life of the data in storage. To be effective, sampling must mediate the tensions between resource constraints, data characteristics, and the required query accuracy. The state-of-the-art in sampling goes far beyond simple uniform selection of elements, to maximize the usefulness of the resulting sample. This tutorial reviews progress in sample design for large datasets, including streaming and graph-structured data. Applications are discussed to sampling network traffic and social networks.
Graham Cormode is a Professor in Computer Science at the University ofWarwick in the UK. He works on research topics in data management, privacy and big data analysis. Previously, he was a principal member of technical staff at AT&T Labs-Research from 2006-13, and before this he was at Bell Labs, and Rutgers University. His PhD is from the University of Warwick. In 2013, he was recognized as a ‘Distinguished Scientist’ by the Association of Computing Machinery (ACM). His work has appeared in over 80 conference papers, 30 journal papers, and been awarded 25 US Patents. His work has received two best paper awards, and a ten year “Test of time” award for his work on sketching algorithms. He has edited two books on applications of algorithms to different areas, and coauthored a third. Cormode currently serves as an associate editor for the IEEE Transactions on Knowledge and Data Engineering (TKDE) and the ACM Transactions on Database Systems (TODS).
Nick Duffield is a Research Professor at Rutgers University /DIMACS, New Jersey, USA. From 1995 until 2013 he was at AT&T Labs-Research, Florham Park, NJ, where he was a Distinguished Member of Technical Staff and an AT&T Fellow. He previously held post-doctoral and faculty positions in Dublin, Ireland, and Heidelberg, Germany. He received a BA in Natural Sciences in 1982 and MMath (Part III Maths) in 1983 from the University of Cambridge, UK, and a PhD in Mathematical Physics from the University of London, U.K., in 1987. His research focuses on data and network science, particularly applications of probability, statistics, algorithms and machine learning to the acquisition, management and analysis of large datasets in communications networks and beyond. He is a co-inventor of the Smart Sampling technologies that lie at the heart of AT&Ts scalable Traffic Analysis Service. He was Charter Chair of the IETF working group on Packet Sampling, and was an Associate Editor for the IEEE/ACM Transactions on Networking from 2007-2011. Duffield is an IEEE Fellow, and was a co-recipient of the ACM Sigmetrics Test of Time Award in both 2012 and 2013 for work in Network Tomography. He was recently an invited speaker and panelist at the workshop on Big Data in the Mathematical Sciences at Warwick University, UK.
Statistically Sound Pattern Discovery
Pattern discovery is a core data mining activity. Initial approaches were dominated by the frequent pattern discovery paradigm | only frequent patterns were explored. Having been thoroughly researched and its limitations now well understood, this paradigm is giving way to two emerging alternatives - the information theoretic minimum message length paradigm and the statistically sound paradigm. This tutorial covers the latter. In this paradigm, patterns are required to pass statistical tests with respect to user defined null-hypotheses, providing great flexibility about the properties that are sought, and strict control over the risk of false discoveries and overfitting. We cover the theoretical foundations, practical issues, limitations and future directions of this growing area of research, as well as explore in detail how this approach to pattern discovery resolves many of the limitations of the frequent pattern discovery paradigm and can deliver efficient and effective discovery of small sets of interesting patterns.
Geoff Webb is a Professor of Information Technology Research in the Faculty of Information Technology at Monash University, where he heads the Centre for Research in Intelligent Systems. His primary research areas are machine learning, data mining, user modelling and computational structural biology. He is known for his contribution to the debate about the application of Occam's razor in machine learning and for the development of numerous methods, algorithms and techniques for machine learning, data mining, user modelling and computational structural biology. His commercial data mining software, Magnum Opus, incorporates many techniques from his association discovery research. Many of his learning algorithms are included in the widely-used Weka machine learning workbench. He is editor-in-chief of Data Mining and Knowledge Discovery, co-editor of the Springer Encyclopedia of Machine Learning, a foundation member of the advisory board of Statistical Analysis and Data Mining, a member of the editorial board of Machine Learning and was a foundation member of the editorial board of ACM Transactions on Knowledge Discovery from Data. He is PC Co-Chair of the 2015 ACM SIGKDD Conference on Knowledge Discovery and Data Mining was PC Co-Chair of the 2010 IEEE International Conference on Data Mining and co-General Chair of the 2012 IEEE International Conference on Data Mining. He has received the 2013 IEEE ICDM Service Award and a 2014 Australian Research Council Discovery Outstanding Researcher Award (one of only seven awarded across all research disciplines).
Wilhelmiina Hamalainen is a postdoctoral researcher by Academy of Finland, currently working in the School of Computing, University of Eastern Finland. She received a M.Th. degree 1998 and a M.Sc. degree 2002, both from the University of Helsinki, a Ph.Lic. degree 2006 from the University of Joensuu, and a Ph.D. degree 2010 (Computer Science) from the University of Helsinki. She has worked as a teacher, lecturer, and researcher in the university since 1996, including 2 years as a university researcher of biology (applied data mining). She has often worked with interdisciplinary problems, involving computer science, statistics, and mathematics. Her main achievements in data mining are efficient algorithms for finding reliable statistical dependency patterns (a related award from The Finnish Society for Computer Science and Research foundation of the Finnish Information Processing Association). Her expertise areas cover statistical dependency analysis and significance testing, optimization algorithms, and applied knowledge discovery (biology, educational technology). Her research interests include statistically sound data mining, mathematics, algorithmics, and general number crunching.
Recommendation in Social Media
The pervasive use of social media generates massive data in an unprecedented rate and the information overload problem becomes increasingly severe for social media users. Recommendation has been proven to be effective in mitigating the information overload problem, demonstrated its strength in improving the quality of user experience, and positively impacted the success of social media. New types of data introduced by social media not only provide more information to advance traditional recommender systems but also manifest new research possibilities for recommendation. In this tutorial, we aim to provide a comprehensive overview of various recommendation tasks in social media, especially their recent advances and new frontiers. We introduce basic concepts, review state-of-the-art algorithms, and deliberate the emerging challenges and opportunities. Finally we summarize the tutorial with discussions on open issues and challenges about recommendation in social media.
Jiliang Tang is a senior PhD student of Computer Science and Engineering at Arizona State University. He obtained his Master degree in Computer Science and Bachelor degree in Software Engineering at Beijing Institute of Technology in 2008 and 2010, respectively. His research interests are in computing with online trust, mining social media data, social computing, feature selection and data mining. He has published innovative works in highly ranked journals and top conference proceedings such as IEEE TKDE, ACM TKDD, DMKD, ACM SIGKDD, WWW, WSDM, SDM, ICDM, IJCAI, and CIKM.
Jie Tang is an associate professor at the Department of Computer Science and Technology, Tsinghua University. He serves as the director of Scientific Office of the Department of Computer Science. His main research interests include social network analysis and data mining. He has been a visiting scholar at Cornell University, Chinese University of Hong Kong, Hong Kong University of Science and Technology, and Leuven University. He has published over 100 research papers in major international journals and conferences. He serves as PC Co-Chair of WSDM’15, ADMA’11, SocInfo’12, Poster Co-Chair of SIGKDD’14, Workshop Co-Chair of SIGKDD’13, Local Chair of SIGKDD'12, Publications Co-Chairs of SIGKDD'11, and also serves as the PC member of more than 50 international conferences. He is now leading the project Arnetminer.org for academic social network analysis and mining, which has attracted millions of independent IP accesses from 220 countries/regions in the world. He was honored with the CCF Young Scientist Award, NSFC Excellent Young Scholar, and IBM Innovation Faculty Award.
Huan Liu is a professor of Computer Science and Engineering at Arizona State University. He obtained his Ph.D. in Computer Science at University of Southern California and B.Eng. in Computer Science and Electrical Engineering at Shanghai JiaoTong University. Before he joined ASU, he worked at Telecom Australia Research Labs and was on the faculty at National University of Singapore. He was recognized for excellence in teaching and research in Computer Science and Engineering at Arizona State University. His research interests are in data mining, machine learning, social computing, and artificial intelligence, investigating problems that arise in many real-world, data-intensive applications with high-dimensional data of disparate forms such as social media. His well-cited publications include books, book chapters, encyclopedia entries as well as conference and journal papers. He serves on journal editorial boards and numerous conference program committees, and is a founding organizer of the International Conference Series on Social Computing, Behavioral-Cultural Modeling, and Prediction (http://sbp.asu.edu/). He is an IEEE Fellow. All presenters are active researchers in social network analysis, social media mining, and data mining in recent years. They will present or presented tutorials on relevant topics in WWW14, ICDM13, WWW'13, WSDM'13, and KDD08. Their years’ research and extensive experience in recommendation and social media mining make them uniquely qualified to deliver this timely tutorial of recommendation in social media.