Lecture-Style Tutorials

We have a fantastic lineup lecture-style tutorials to be held in conjunction with KDD 2020. Check back as we get closer to the conference for more detailed program information.

Lecture-Style Tutorials

  • Advances in Recommender Systems: From Multi-stakeholder Marketplaces to Automated RecSys

    The tutorial focuses on two major themes of recent advances in recommender systems: multi-stakeholder marketplace and automated RecSys.

    Part A: Recommendations in a Marketplace
    Multi-sided marketplaces are steadily emerging as viable business models in many applications (e.g. Amazon, AirBnb, YouTube), wherein the platforms have customers not only on the demand side (e.g. users), but also on the supply side (e.g. retailer). In the first part of the tutorial, we consider a number of research problems which need to be addressed when developing a search & recommendation framework powering a multi-stakeholder marketplace. We highlight the importance of a multi-objective ranking/recommendation, discuss different ways in which stakeholders specify their objectives, discuss user specific characteristics (e.g. user receptivity) which could be leveraged when developing joint optimization modules and finally present a number of real world case-studies of such multi-stakeholder search and recommendation systems.

    Part B: Automated Recommendation System
    As the recommendation tasks are getting more diverse and the recommending models are growing more complicated, it is increasingly challenging to develop a proper recommendation system that can adapt well to a new recommendation task. In this tutorial, we will focus on how automated machine learning (AutoML) techniques can benefit the design and usage of recommendation systems. Specifically, we will start from a full scope describing what can be automated for recommendation systems. Then, we will elaborate more on three important topics under such a scope, i.e., feature engineering, hyperparameter optimization/neural architecture search, and algorithm selection. The core issues and recent works under these topics will be introduced, summarized, and discussed. Finally, we will finalize the tutor with conclusions and some future directions.

    Presenter(s):

    Rishabh Mehrotra (Spotify); Benjamin Carterette (Spotify); Yong Li (Tsinghua University); Quanming Yao (4th Paradigm); James Tin-Yau Kwok (The Hong Kong University of Science and Technology); Isabelle Guyon (Clopinet); Qiang Yang (Hong Kong UST)

    Timeslot: All day Website
  • Causal Inference Meets Machine Learning

    PART-1: Causal inference has numerous real-world applications in many domains such as health care, marketing, political science and online advertising. Treatment effect estimation, a fundamental problem in causal inference, has been extensively studied in statistics for decades. However, traditional treatment effect estimation methods may not well handle large-scale and high-dimensional heterogeneous data. In recent years, an emerging research direction has attracted increasing attention in the broad artificial intelligence field, which combines the advantages of traditional treatment effect estimation approaches (e.g., matching estimators) and advanced representation learning approaches (e.g., deep neural networks). In this tutorial, we will introduce both traditional and state-of-the-art representation learning algorithms for treatment effect estimation. Background about causal inference, counterfactuals and matching estimators will be covered as well. We will also showcase promising applications of these methods in different application domains. PART-2: Predicting future outcome values based on their observed features using a model estimated on a training data set in a common machine learning problem. Many learning algorithms have been proposed and shown to be successful when the test data and training data come from the same distribution. However, the best-performing models for a given distribution of training data typically exploit subtle statistical relationships among features, making them potentially more prone to prediction error when applied to test data whose distribution differs from that in training data. How to develop learning models that are stable and robust to shifts in data is of paramount importance for both academic research and real applications.Causal inference, which refers to the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect, is a powerful statistical modeling tool for explanatory and stable learning. In this tutorial, we focus on causal inference and stable learning, aiming to explore causal knowledge from observational data to improve the interpretability and stability of machine learning algorithms. First, we will give introduction to causal inference and introduce some recent data-driven approaches to estimate causal effect from observational data, especially in high dimensional setting. Aiming to bridge the gap between causal inference and machine learning for stable learning, we first give the definition of stability and robustness of learning algorithms, then will introduce some recently stable learning algorithms for improving the stability and interpretability of prediction. Finally, we will discuss the applications and future directions of stable learning, and provide the benchmark for stable learning.

    Presenter(s):

    Peng Cui (Tsinghua University); Zheyan Shen(Tsinghua University); Sheng Li (University of Georgia); Liuyi Yao (University at Buffalo, USA); Yaliang Li (Alibaba Group); Zhixuan Chu (University of Georgia ); Jing Gao (University at Buffalo)

    Timeslot: All day Website
  • Fairness in Machine Learning for Healthcare

    Responsible machine learning is central to driving adoption of machine learning in healthcare. While the focus of deployment of responsible machine learning system has largely been on robustness and interpretable machine learning, fairness is now becoming a pivotal issue in healthcare AI/ML. Even though there is already a large and growing body of literature on fairness in machine learning in general, a focused emphasis on requirements for fair and unbiased systems deployed in healthcare settings is lacking. This tutorial is motivated by the need to comprehensively study fairness in the context of applied machine learning in healthcare.

    Presenter(s):

    Muhammad A Ahmad (KenSci / University of Washington Tacoma); Arpit Patel (Univerisity of Washington); Carly Miller (University of Washington); Vikas Kumar (KenSci); Ankur Teredesai (University of Washington Tacoma /KenSci)

    Timeslot: PM Website
  • Learning from All Types of Experiences: A Unifying Machine Learning Perspective

    Machine learning is about computational methods that enable machines to learn concepts from experiences. In practice, experiences can take many forms, such as data examples, abstract knowledge, feedback from the environment, auxiliary models, etc. Solving complex problems (e.g. in healthcare, manufacturing, social science) requires to integrate all possible sources of experiences in learning. On the other hand, several decades of machine learning research has resulted in a multitude of algorithms in different paradigms, with names such as (un-)supervised learning, reinforcement learning, adversarial learning, constraint-driven learning, and so forth. These algorithms have distinctly different mathematical formulations, and crucially, ingest different forms of experiences, respectively. With such a bewildering marketplace of algorithms, solving a real-world problem using ML is often challenging, demanding strong ML expertise and creativity, and requiring lots of bespoke efforts for crafting the solutions. This tutorial will present a systematic, unified perspective of machine learning, for both a refreshing holistic understanding of the diverse learning algorithms and a guidance of operationalizing machine learning for creating problem solutions by integrating all sources of experiences.

    Presenter(s):

    Zhiting Hu (CMU); Eric Xing (Petuum Inc. and CMU)

    Timeslot: PM Website
  • Physics Inspired Models in Artificial Intelligence

    As artificial intelligence and machine learning systems are being integrated in almost all facets of society, the need to understand such models is becoming paramount. Despite the phenomenal success of deep learning in many application domains, the explainability of such models has proved elusive. Recent work in applying physics inspired models and techniques to understand machine learning models and AI has yielded promising results. Machine learning has a long history of cross-fertilization with other domains, e.g., Shapley Values models from game theory, generalized linear models from statistics, Bayesian rule lists from frequent pattern mining [Wang 2016] and functional analysis in social networks. Ideas from Physics have time and again provided fodder for conceptual developments in machine learning. In this tutorial we provide an overview of how ideas from physics have informed progress in machine learning? We also explore the history of influence of physics in machine learning that is oft neglected in the Computer Science community, and how recent insights from physics hold the promise of opening the black box of deep learning. The history of borrowing ideas from physics to machine learning has shown that mapping machine learning problem formulation to the class of physical models with already known behavior and solution can be used to explain the inner workings of the machine learning models. Lastly, we will describe the current and future trends in this area and our suggestions on a research agenda on how physics-inspired models can benefit machine learning.

    Presenter(s):

    Muhammad A Ahmad (KenSci / University of Washington Tacoma); Sener Ozonder (Istinye University)

    Timeslot: AM Website
  • Scientific Text Mining and Knowledge Graphs

    Unstructured scientific text, in various forms of textual artifacts, including manuscripts, publications, patents, and proposals, is used to store the tremendous wealth of knowledge discovered after weeks, months, and years, developing hypotheses, working in the lab or clinic, and analyzing results. A grand challenge on data mining research is to develop effective methods for transforming the scientific text into well-structured forms (e.g., ontology, taxonomy, knowledge graphs), so that machine intelligent systems can build on them for hypothesis generation and validation. In this tutorial, we provide a comprehensive overview on recent research and development in this direction. First, we introduce a series of text mining methods that extract phrases, entities, scientific concepts, relations, claims, and experimental evidence. Then we discuss methods that construct and learn from scientific knowledge graphs for accurate search, document classification, and exploratory analysis. Specifically, we focus on scalable, effective, weakly supervised methods that work on text in sciences (e.g., chemistry, biology).

    Presenter(s):

    Meng Jiang (University of Notre Dame); Jingbo Shang (University of California, San Diego)

    Timeslot: AM Website
  • Learning with Small Data

    In the era of big data, data-driven methods have become increasingly popular in various applications, such as image recognition, traffic signal control, fake news detection. The superior performance of these data-driven approaches relies on large-scale labeled training data, which are probably inaccessible in real-world applications, i.e., “small (labeled) data” challenge. Examples include predicting emergent events in a city, detecting emerging fake news, and forecasting the progression of conditions for rare diseases. In most scenarios, people care about these small data cases most and thus improving the learning effectiveness of machine learning algorithms with small labeled data has been a popular research topic. In this tutorial, we will review the trending state-of-the-art machine learning techniques for learning with small (labeled) data. These techniques are organized from two aspects: (1) providing a comprehensive review of recent studies about knowledge generalization, transfer, and sharing, where transfer learning, multi-task learning, and meta-learning are discussed. Particularly, we will focus more on meta-learning, which improves the model generalization ability and has been proven to be an effective approach recently; (2) introducing the cutting-edge techniques which focus on incorporating domain knowledge into machine learning models. Different from model-based knowledge transfer techniques, in real-world applications, domain knowledge (e.g., physical laws) provides us with a new angle to deal with the small data challenge. Specifically, domain knowledge can be used to optimize learning strategies and/or guide the model design. In data mining field, we believe that learning with small data is a trending topic with important social impact, which will attract both researchers and practitioners from academia and industry.

    Presenter(s):

    Huaxiu Yao (Pennsylvania State University); Xiaowei Jia (University of Minnesota); Vipin Kumar (University of Minnesota); Zhenhui (Jessie) Li (Penn State University)

    Timeslot: AM Website
  • Adversarial Attacks and Defenses: Frontiers, Advances and Practice

    Basics and frontiers about researches in the field of adversarial attacks and defenses, with a hands-on tutorial on attack/defend deep learning models.

    Presenter(s):

    Han Xu (Michigan State University); Yaxin Li (Michigan State University); Wei Jin (Michigan State University); Jiliang Tang (Michigan State University)

    Timeslot: AM Website
  • Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web

    The World Wide Web contains vast quantities of textual information in several forms: unstructured text, template-based semi-structured webpages (which present data in key-value pairs and lists), and tables. Methods for extracting information from these sources and converting it to a structured form have been a target of research from the natural language processing (NLP), data mining, and database communities. While these researchers have largely separated extraction from web data into different problems based on the modality of the data, they have faced similar problems such as learning with limited labeled data, defining (or avoiding defining) ontologies, and making use of prior knowledge.

    In this tutorial we take a holistic view toward information extraction, exploring the commonalities in the challenges and solutions developed to address these different forms of text. We will explore the approaches targeted at unstructured text that largely rely on learning syntactic or semantic textual patterns, approaches targeted at semi-structured documents that learn to identify structural patterns in the template, and approaches targeting web tables which rely heavily on entity linking and type information.

    While these different data modalities have largely been considered separately in the past, recent research has started taking a more inclusive approach toward textual extraction, in which the multiple signals offered by textual, layout, and visual clues are combined into a single extraction model made possible by new deep learning approaches. At the same time, trends within purely textual extraction have shifted toward full-document understanding rather than considering sentences as independent units. With this in mind, it is worth considering the information extraction problem as a whole to motivate solutions that harness textual semantics along with visual and semi-structured layout information. We will discuss these approaches and suggest avenues for future work.

    Presenter(s):

    Luna Dong (Amazon.com); Hannaneh Hajishirzi (University of Washington); Colin Lockard (University of Washington); Prashant Shiralkar (Amazon)

    Timeslot: AM Website
  • Recent Advances on Graph Analytics and Its Applications in Healthcare

    Graph is a natural representation encoding both the features of the data samples and relationships among them. Analysis with graphs is a classic topic in data mining and many techniques have been proposed in the past. In recent years, because of the rapid development of data mining and knowledge discovery, many novel graph analytics algorithms have been proposed and successfully applied in a variety of areas. The goal of this tutorial is to summarize the graph analytics algorithms developed recently and how they have been applied in healthcare. In particular, our tutorial will cover both the technical advances and the application in healthcare. On the technical aspect, we will introduce deep network embedding techniques, graph neural networks, knowledge graph construction and inference, graph generative models and graph neural ordinary differential equation models. On the healthcare side, we will introduce how these methods can be applied in predictive modeling of clinical risks (e.g., chronic disease onset, in-hospital mortality, condition exacerbation, etc.) and disease subtyping with multi-modal patient data (e.g., electronic health records, medical image and multi-omics), knowledge discovery from biomedical literature and integration with data-driven models, as well as pharmaceutical research and development (e.g., de-novo chemical compound design and optimization, patient similarity for clinical trial recruitment and pharmacovigilance). We will conclude the whole tutorial with a set of potential issues and challenges such as interpretability, fairness and security. In particular, considering the global pandemic of COVID-19, we will also summarize the existing research that have already leveraged graph analytics to help with the understanding the mechanism, transmission, treatment and prevention of COVID-19, as well as point out the available resources and potential opportunities for future research.

    Presenter(s):

    Fei Wang (Cornell University); Peng Cui (Tsinghua University); Jian Pei (Simon Fraser University); Yangqiu Song (Hong Kong University of Science and Technology); Chengxi Zang (Cornell University)

    Timeslot: AM Website
  • Human-Centered Explainability for Healthcare

    In recent years, the rapid advances in Artificial Intelligence (AI) techniques along with an ever-increasing availability of healthcare data have made many novel analyses possible. Significant successes have been observed in a wide range of tasks such as next diagnosis prediction, AKI prediction, adverse event predictions including mortality and unexpected hospital re-admissions. However, there has been limited adoption and use in clinical practice of these methods due to their black-box nature. Significant amount of research is currently focused on making such methods more interpretable or to make post-hoc explanations more accessible. However, most of this work is done at a very low level and as a result may not have direct impact at the point-of-care. This tutorial will provide an overview of the landscape of different approaches that have been developed for explainability in healthcare. Specifically, we will present the problem of explainability as it pertains to various personas involved in healthcare viz. data scientists, clinical researchers, and clinicians. We will chart out the requirements for such personas and present an overview of the different approaches that can address such needs. We will also walk-through several use-cases for such approaches. In the process, we will also demonstrate how to perform such analysis using open source tools such as the IBM AI Explainability 360 Open Source Toolkit.

    Presenter(s):

    Prithwish Chakraborty (IBM Research); Bum Chul Kwon (IBM Research); Sanjoy Dey (IBM Research); Amit Dhurandhar (IBM Research); Daniel Gruen (IBM Research); Kenney Ng (IBM Research); Daby Sow (IBM Research); Kush R Varshney (IBM Research)

    Timeslot: AM Website
  • Recent Advances in Multimodal Educational Data Mining in K-12 Education

    Recently we have seen a rapid rise in the amount of education data available through the digitization of education. This huge amount of education data usually exhibits in a mixture form of images, videos, speech, texts, etc. It is crucial to consider data from different modalities to build successful applications in AI in education (AIED). This tutorial targets AI researchers and practitioners who are interested in applying state-of-the-art multimodal machine learning techniques to tackle some of the hard-core AIED tasks. These include tasks such as automatic short answer grading, student assessment, class quality assurance, knowledge tracing, etc.

    In this tutorial, we will comprehensively review recent developments of applying multimodal learning approaches in AIED, with a focus on those classroom multimodal data. Beyond introducing the recent advances of computer vision, speech, natural language processing in education respectively, we will discuss how to combine data from different modalities and build AI driven educational applications on top of these data. More specifically, we will talk about (1) representation learning; (2) algorithmic assessment & evaluation; and (3) personalized feedback. Participants will learn about recent trends and emerging challenges in this topic, representative tools and learning resources to obtain ready-to-use models, and how related models and techniques benefit real-world AIED applications.

    Presenter(s):

    Zitao Liu (TAL Education Group); Songfan Yang (TAL Education Group); Jiliang Tang (Michigan State University); Neil Heffernan (Worcester Polytechnic Institute); Rose Luckin (University College London)

    Timeslot: AM Website
  • Online User Engagement: Metrics and Optimization

    User engagement plays a central role in companies operating online services, such as search engines, news portals, e-commerce sites, entertainment services, and social networks. A main challenge is to leverage collected knowledge about the daily online behavior of millions of users to understand what engage them short-term and more importantly long-term. Two critical steps of improving user engagement are metrics and their optimization. The most common way that engagement is measured is through various online metrics, acting as proxy measures of user engagement. This tutorial will review these metrics, their advantages and drawbacks, and their appropriateness to various types of online services. Once metrics are defined, how to optimize them will become the key issue. We will survey methodologies including machine learning models and experimental designs that are utilized to optimize these metrics via direct or indirect ways. As case studies, we will focus on four types of services, news, search, entertainment, and e-commerce.

    Presenter(s):

    Liangjie Hong (LinkedIn); Mounia Lalmas (Spotify)

    Timeslot: AM Website
  • Data Pricing – From Economics to Data Science

    It is well recognized that data are invaluable. How can we assess the value of data objectively and numerically? Pricing data has been studied and practiced in dispersed areas and principles, such as economy, data management and data mining, electronic commerce, and marketing. In this tutorial, we try to present a unified and comprehensive overview of this important and long overdue pillar in data science and engineering. We will examine various motivations behind data pricing, understand the economics of data pricing, review the development and evolution of pricing models, and compare the proposals of marketplaces of data. We will also connect data pricing with several highly related areas, such as cloud service pricing, privacy pricing, and decentralized privacy preserving infrastructure like blockchain. In addition, we will align with the industry practice where real business is running. This tutorial will be highly interdisciplinary and will be indeed a fusion of academia foundations and industry practice. We do not assume any background knowledge and will make the essential ideas highly accessible to practitioners and graduate students.

    Presenter(s):

    Jian Pei (Simon Fraser University)

    Timeslot: AM Website
  • Advanced Deep Graph Learning: Deeper, Faster, Robuster, and Unsupervised

    Many real data come in the form of non-grid objects, i.e. graphs, from social networks to molecules. Adaptation of deep learning from grid-alike data (e.g. images) to graphs has recently received unprecedented attention from both machine learning and data mining communities, leading to a new cross-domain field—Deep Graph Learning (DGL). Instead of painstaking feature engineering, DGL aims to learn informative representations of graphs in an end-to-end manner. It has exhibited remarkable success in various tasks, such as node/graph classification, link prediction, etc.

    Whilst several previous tutorials have been made for the introduction of Graph Neural Networks (GNNs) in KDD, seldom is there focus on the expressivity, trainability, and generalization of DGL algorithms. To make it more prevailing and advanced, this tutorial mainly covers the key achievements of DGL in recent years. Specifically, we will discuss four essential topics, that is, how to design and train deep GNNs in an efficient manner, how to adopt GNNs to cope with large-scale graphs, the adversarial attack on GNNs, and the unsupervised training of GNNs. Meanwhile, we will introduce the applications of DGL towards various domains, including but not limited to drug discovery, computer vision, and social network analysis.

    Presenter(s):

    Yu Rong (Tencent AI Lab); Wenbing Huang (Tsinghua University); Tingyang Xu (Tencent AI Lab); Hong Cheng (Chinese University of Hong Kong); Junzhou Huang (Tencent AI Lab); Yao Ma (Michigan State University); Yiqi Wang (Michigan State University); Tyler Derr (Michigan State University); Lingfei Wu (IBM T. J. Watson Research Center); Tengfei Ma (IBM Research)

    Timeslot: All day Website
  • Multi-modal Network Representation Learning: Methods and Applications

    In today’s information and computational society, complex systems are often modeled as multi-modal networks associated with heterogeneous structural relation, unstructured attribute/content, temporal context, or their combinations. The abundant information in multi-modal network requires both a domain understanding and large exploratory search space when doing feature engineering for building customized intelligent solutions in respond to different purposes. Therefore, automating the feature discovery through representation learning in multi-modal networks has become essential for many applications. In this tutorial, we systematically review the area of multi-modal network representation learning, including a series of recent methods and applications. These methods will be categorized and introduced in the perspectives of unsupervised, semi-supervised and supervised learning, with corresponding real applications respectively. In the end, we conclude the tutorial and raise open discussions. The authors of this tutorial are active and productive researchers in this area.

    Presenter(s):

    Chuxu Zhang (Brandeis University); Meng Jiang (University of Notre Dame); Xiangliang Zhang (" King Abdullah University of Science and Technology, Saudi Arabia"); Yanfang Ye (Case Western Reserve University); Nitesh Chawla (Notre Dame)

    Timeslot: PM Website
  • Data Science for the Real Estate Industry

    The multi-trillion-dollar real estate industry is one of the oldest and largest industries in the world, and is notoriously resistant to change. However, over the past few years many new real estate technology solutions have come to market and over the next decade, data science will bring a massive change to this huge industry. This tutorial’s primary goal is to fill this gap by introducing the world of real estate data science to non-real-estate data scientists. The tutorial will start with a short “Real Estate 101” course, introducing basic concepts, terminology, and the many different real estate businesses. Then, we will move through the different opportunities and challenges in the industry that can be addressed by data science and data scientists. We’ll touch on the different methods that are vital to applying data science to real estate and finally, we will introduce the real estate industry-wide Knowledge Graph, and get into the details of its construction process, its characteristics, its challenges, and its role in developing novel methodologies for making macro- and micro-level market predictions.

    Presenter(s):

    Ron Bekkerman (Cherre Inc.); Foster Provost (NYU); Ali Rauh (Airbnb); Vanja Josifovski (Airbnb)

    Timeslot: AM Website
  • Overview and Importance of Data Quality for Machine Learning Tasks

    It is well understood from literature that the performance of a machine learning (ML) model is upper bounded by the quality of the data. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality. One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and failure to do so can result in inaccurate analytics and unreliable decisions. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. This tutorial surveys all the important data quality related approaches discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.

    Presenter(s):

    Hima Patel (IBM Research); Nitin Gupta (IBM Research); Shazia Afzal (IBM Research); Shashank Mujumdar (IBM Research, India);

    Timeslot: AM Website
  • Interpreting and Explaining Deep Neural Networks: A Perspective on Time Series Data

    Explainable and interpretable machine learning models and algorithms are important topics which have received growing attention from research, application and administration. Many advanced Deep Neural Networks (DNNs) are often perceived as black-boxes. Researchers would like to be able to interpret what the DNN has learned in order to identify biases and failure models and improve models. In this tutorial, we will provide a comprehensive overview on methods to analyze deep neural networks and an insight how those XAI methods help us understand time series data.

    Presenter(s):

    Jaesik Choi (KAIST)

    Timeslot: PM Website
  • Edge AI: Systems Design and ML for IoT Data Analytics

    In this tutorial, we introduce the network science and ML techniques relevant to edge computing, discuss systems for ML (e.g., model compression, quantization, HW/SW co-design, etc.) and ML for systems design (e.g., run-time resource optimization, power management for training and inference on edge devices), and illustrate their impact in addressing concrete IoT applications.

    Presenter(s):

    Radu Marculescu (The Univ. of Texas at Austin) radum@utexas.edu; Diana Marculescu (The Univ. of Texas at Austin) dianam@utexas.edu; Umit Ogras (Univ. of Wisconsin-Madison) uogras@wisc.edu.

    Timeslot: PM Website
  • Data Sketching for Real Time Analytics: Theory and Practice

    Speed, cost, and scale. These are 3 of the biggest challenges in analyzing big data. While modern data systems continue to push the boundaries of scale, the problems of speed and cost are fundamentally tied to the size of data being scanned or processed. Processing thousands of queries that each access terabytes of data with sub-second latency remains infeasible. Data sketching techniques provide means to drastically reduce this size, allowing for real-time or interactive data analysis with reduced costs but with approximate answers.

    This tutorial covers a number of useful data sketching and sampling methods and demonstrates their use using the Apache DataSketches project.
    We focus particularly on common analytic problems such as counting distinct items, quantiles, histograms, heavy hitters, and aggregations with large group-bys. For these, we covers algorithms, techniques, and theory that can aid both practitioners and theorists in constructing sketches and designing systems that achieve desired error guarantees. For practitioners and implementers, we show how some of these sketches can be easily instantiated using the Apache Datasketches project.

    Presenter(s):

    Daniel Ting (Tableau Software); Jonathan Malkin (Verizon); Lee Rhodes (Verizon)

    Timeslot: PM Website
  • Deep Learning for Anomaly Detection

    Anomaly detection has been widely studied and used in diverse applications. Building an effective anomaly detection system requires researchers and developers to learn complex structure from noisy data, identify dynamic anomaly patterns, and detect anomalies with limited labels. Recent advancements in deep learning techniques have greatly improved anomaly detection performance, in comparison with classical approaches, and have extended anomaly detection to a wide variety of applications. This tutorial will help the audience gain a comprehensive understanding of deep learning-based anomaly detection techniques in various application domains. First, we give an overview of the anomaly detection problem, introducing the approaches taken before the deep model era and listing out the challenges they faced. Then we survey the state-of-the-art deep learning models that range from building block neural network structures such as MLP, CNN, and LSTM, to more complex structures such as autoencoder, generative models (VAE, GAN, Flow-based models), to deep one-class detection models, etc. In addition, we illustrate how techniques such as transfer learning and reinforcement learning can help amend the label sparsity issue in anomaly detection problems and how to collect and make the best use of user labels in practice. Second to last, we discuss real world use cases coming from and outside LinkedIn. The tutorial concludes with a discussion of future trends.

    Presenter(s):

    Ruoying Wang (LinkedIn); Kexin Nie (LinkedIn); Yen-Jung Chang (LinkedIn); Xinwei Gong (LinkedIn); Tie Wang (LinkedIn Corporation); Yang Yang (LinkedIn Corporation); Bo Long (LinkedIn Corporation)

    Timeslot: PM Website
  • Deep Learning for Industrial AI: Challenges, New Methods and Best Practices

    Applying deep learning techniques to industrial applications imposes a set of unique challenges, which include, but are not limited to, (1) limited data, highly skewed class distribution and occurrence of rare classes such as failures, (2) multi-modal data (sensors, events, images, text, etc.) indexed over space and time (3) the need for explainable decisions, (4) a need to attain consistency between different but “related” models and between multiple generations of the same model, and (5) decision making to optimize business outcomes where the cost of a mistake could be very high. This tutorial presents an overview of these challenges, along with new methods and best practices to address them. Examples of these methods include using sequence DL models and Functional Neural Networks (FNNs) for modeling sensor and spatiotemporal measurements; using multi-task learning, graph models and ensemble learning for improving consistency of DL models; using deep RL for health indicator learning and dynamic dispatching; cost-based decision making for prognostics; and using GANs for generating senor data for prognostics. Finally, we will present some open problems in Industrial AI and how the research community can shape the future of the next industrial and societal revolution.

    Presenter(s):

    Chetan Gupta (Industrial AI Lab, Hitachi America, Ltd. R&D); Ahmed Farahat (Industrial AI Lab, Hitachi America, Ltd. R&D)

    Timeslot: PM Website
  • Embedding-Driven Multi-Dimensional Topic Mining and Text Analysis

    People nowadays are immersed in a wealth of text data, ranging from news articles, to social media, academic publications, advertisements, and economic reports. A grand challenge of data mining is to develop effective, scalable and weakly-supervised methods for extracting actionable structures and knowledge from massive text data. Without requiring extensive and corpus-specific human annotations, these methods will satisfy people’s diverse applications and needs for comprehending and making good use of large-scale corpora.

    In this tutorial, we will introduce recent advances in text embeddings and their applications to a wide range of text mining tasks that facilitate multi-dimensional analysis of massive text corpora. Specifically, we first overview a set of recently developed unsupervised and weakly-supervised text embedding methods including state-of-the-art context-free embeddings and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several embedding-driven text mining techniques that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge, in the form of multi-dimensional topics and multi-faceted taxonomies, from large-scale text corpora. We finally show that the topics and taxonomies so discovered will naturally form a multi-dimensional TextCube structure, which greatly enhances text exploration and analysis for various important applications, including text classification, retrieval and summarization. We will demonstrate on the most recent real-world datasets (including political news articles as well as scientific publications related to the coronavirus) how multi-dimensional analysis of massive text corpora can be conducted with the introduced embedding-driven text mining techniques.

    Presenter(s):

    Yu Meng (University of Illinois at Urbana-Champaign); Jiaxin Huang (University of Illinois Urbana-Champaign); Jiawei Han (UIUC)

    Timeslot: AM Website
  • Learning by Exploration: New Challenges in Real-World Environments

    Learning is a predominant theme for any intelligent system, humans or machines. Moving beyond the classical paradigm of learning from past experience, e.g., supervised learning from given labels, a learner needs to actively collect exploratory feedback to learn from the unknowns, i.e., learning through exploration. This tutorial will introduce the learning by exploration paradigm, which is the key ingredient in many interactive online learning problems, including the multi-armed bandit and more generally reinforcement learning problems.

    In this tutorial, we will first motivate the need of exploration in machine learning algorithms and highlight its importance in many real-world problems where online sequential decision making is involved. In real-world application scenarios, considerable challenges arise in such a setting of machine learning, including sample complexity, costly and even outdated feedback, and ethical considerations of exploration (such as fairness and privacy). We will introduce several classical exploration strategies, and then highlight the aforementioned three fundamental challenges in the learning from exploration paradigm and introduce the recent research development on addressing them respectively.

    Presenter(s):

    Qingyun Wu (University of Virginia); Huazheng Wang (University of Virginia); Hongning Wang (University of Virginia)

    Timeslot: AM Website
  • Image and Video Understanding for Recommendation and Spam Detection Systems

    This tutorial is aimed at providing an overview of image and video understanding, and their practical applications in the industry. We focus on deep learning-based state of the art techniques for image and video understanding. This includes tasks like image classification and segmentation, image-based content retrieval and video classification. We also focus on applications of these technologies to large-scale recommendation and low quality content detection systems. We present concrete examples from various LinkedIn production systems, and also discuss associated practical challenges.

    Presenter(s):

    Aman Gupta (LinkedIn); Sirjan Kafle (LinkedIn); Di Wen (LinkedIn Corporation); Dylan Wang (LinkedIn); Sumit Srivastava (LinkedIn); Suhit Sinha (LinkedIn); Nikita Gupta (LinkedIn); Bharat Jain (LinkedIn); Ananth Sankar (LinkedIn); Liang Zhang (LinkedIn)

    Timeslot: AM Website
  • Data-Driven Never-Ending Learning Question Answering Systems

    This tutorial explores two research areas, namely Never-Ending Learning (NEL) and Question Answering (QA). NEL systems [2] are, in a very high-level, computer systems that learn over time to become better in solving a task. Different NEL approaches have been proposed and applied in different tasks and domains, with results that are not yet generalizable to every domain, but encourage us to keep addressing the problem of how to build computer systems that can take advantage of NEL principles. It is not always so straightforward to have NEL principles applied to ML models. In this tutorial we want to show (with hands-on examples and supporting theory, algorithms an models) how to model a problem in a NEL fashion and help KDD community to become familiar with such approaches. The presence of many question answering systems in our daily life (such as IBM Watson, Amazon Alexa, Apple Siri, MS Cortana, Google Home, etc.), and the recent release of new and bigger datasets focused on Open Domain Question Answering, have contributed to an increased interest on Question Answering and systems that can perform it. But, in spite of advances from the last years, Open Domain Question Answering models cannot yet achieve results comparable to human performance. Thus, Open Domain QA tends to be a good candidate to be modeled in a NEL approach. This tutorial aims at enabling the attendees to: 1) Better understand the current state-of-the art on NEL and QA; 2) Learn how to model a ML problem using a NEL approach; 3) Implement and deploy (with hands-on sessions and open code/data available on Github) a simple basic QA system capable of iteratively evolve and improve its own performance based on principles of NEL. 4) Be prepared to follow along NEL-QA idea and propose new approaches to boost performance of QA systems.

    Presenter(s):

    Estevam Hruschka (Megagon Labs)

    Timeslot: AM Website

How can we assist you?

We'll be updating the website as information becomes available. If you have a question that requires immediate attention, please feel free to contact us. Thank you!

Please enter the word you see in the image below: