KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Full Citation in the ACM Digital Library

SESSION: Keynote & Invited Talks

AI for Intelligent Financial Services: Examples and Discussion

There are many opportunities to pursue AI and ML in the financial domain. In this talk, I will overview several research directions we are pursuing in engagement with the lines of business, ranging from data and knowledge, learning from experience, reasoning and planning, multi agent systems, and secure and private AI. I will offer concrete examples of projects, and conclude with the many challenges and opportunities that AI can offer in the financial domain.

Keynote Speaker: Emery N. Brown

Emery Brown, M.D., Ph.D. is an American statistician, neuroscientist, and anesthesiologist. He is the Warren M. Zapol Professor of Anesthesia at Harvard Medical School and at Massachusetts General Hospital(MGH), and a practicing anesthesiologist at MGH. At MIT he is the Edward Hood Taplin Professor of Medical Engineering and professor of computational neuroscience, the Associate Director of the Institute for Medical Engineering and Science, and the Director of the Harvard-MIT Program in Health Sciences and Technology. Brown is one of only 19 individuals who has been elected to all three branches of the National Academies of Sciences, Engineering, and Medicine, Brown is also the first African American and first anesthesiologist to be elected to all three National Academies.

Keynote Speaker: Yolanda Gil

Dr. Yolanda Gil Dr. Yolanda Gil is Director of Knowledge Technologies and Associate Division Director at the Information Sciences Institute of the University of Southern California, and Research Professor in Computer Science and in Spatial Sciences. She is also Associate Director of Interdisciplinary Programs in Informatics. She received her M.S. and Ph. D. degrees in Computer Science from Carnegie Mellon University, with a focus on artificial intelligence. Her research is on intelligent interfaces for knowledge capture and discovery, which she investigates in a variety of projects concerning knowledge-based planning and problem solving, information analysis and assessment of trust, semantic annotation and metadata, and community-wide development of knowledge bases. Dr. Gil collaborates with scientists in different domains on semantic workflows and metadata capture, social knowledge collection, computer-mediated collaboration, and automated discovery. Dr. Gil has served in the Advisory Committee of the Computer Science and Engineering Directorate of the National Science Foundation. She initiated and chaired the W3C Provenance Group that led to a community standard in this area. Dr. Gil is a Fellow of the Association for Computing Machinery (ACM), and Past Chair of its Special Interest Group in Artificial Intelligence. She is also Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), and was elected as its 24th President in 2016.

Keynote Speaker: Alessandro Vespignani

Alessandro Vespignani research activity is focused on the study of "techno-social" systems, where infrastructures composed of different technological layers are interoperating within the social component that drives their use and development. In this context we aim at understanding how the very same elements assembled in large number can give rise - according to the various forces and elements at play - to different macroscopic and dynamical behaviors, opening the path to quantitative computational approaches and forecasting power. The main research lines pursued at the moment are: Develop analytical and computational models for the co-evolution and interdependence of large-scale social, technological and biological networks. Modeling contagion processes in structured populations. Developing predictive computational tools for the analysis of the spatial spread of emerging diseases. Analyze the dynamics and evolution of information and social networks. Model the adaptive behavior of social systems. Prof. Vespignani is a joint appointment between the College of Science, the College of Computer and Information Science, and the Bouvé College of Health Sciences.

SESSION: Research Track Papers

Learning Effective Road Network Representation with Hierarchical Graph Neural Networks

Road network is the core component of urban transportation, and it is widely useful in various traffic-related systems and applications. Due to its important role, it is essential to develop general, effective, and robust road network representation models. Although several efforts have been made in this direction, they cannot fully capture the complex characteristics of road networks.

In this paper, we propose a novel Hierarchical Road Network Representation model, named HRNR, by constructing a three-level neural architecture, corresponding to "functional zone", "structural regions" and "road segments", respectively. To associate the three kinds of nodes, we introduce two matrices consisting of probability distributions for modeling segment-to-region assignment or region-to-zone assignment. Based on the two assignment matrices, we carefully devise two reconstruction tasks, either based on network structure or human moving patterns. In this way, our node presentations are able to capture both structural and functional characteristics. Finally, we design a three-level hierarchical update mechanism for learning the node embeddings through the entire network. Extensive experiment results on three real-world datasets for four tasks have shown the effectiveness of the proposed model.

Interpretability is a Kind of Safety: An Interpreter-based Ensemble for Adversary Defense

While having achieved great success in rich real-life applications, deep neural network (DNN) models have long been criticized for their vulnerability to adversarial attacks. Tremendous research efforts have been dedicated to mitigating the threats of adversarial attacks, but the essential trait of adversarial examples is not yet clear, and most existing methods are yet vulnerable to hybrid attacks and suffer from counterattacks. In light of this, in this paper, we first reveal a gradient-based correlation between sensitivity analysis-based DNN interpreters and the generation process of adversarial examples, which indicates the Achilles's heel of adversarial attacks and sheds light on linking together the two long-standing challenges of DNN: fragility and unexplainability. We then propose an interpreter-based ensemble framework called X-Ensemble for robust adversary defense. X-Ensemble adopts a novel detection-rectification process and features in building multiple sub-detectors and a rectifier upon various types of interpretation information toward target classifiers. Moreover, X-Ensemble employs the Random Forests (RF) model to combine sub-detectors into an ensemble detector for adversarial hybrid attacks defense. The non-differentiable property of RF further makes it a precious choice against the counterattack of adversaries. Extensive experiments under various types of state-of-the-art attacks and diverse attack scenarios demonstrate the advantages of X-Ensemble to competitive baseline methods.

Higher-order Clustering in Complex Heterogeneous Networks

Heterogeneous networks are seemingly ubiquitous in the real world. Yet, most graph mining methods such as clustering have mostly focused on homogeneous graphs by ignoring semantic information in real-world systems. Moreover, most methods are based on first-order connectivity patterns (edges) despite that higher-order connectivity patterns are known to be important in understanding the structure and organization of such networks. In this work, we propose a framework for higher-order spectral clustering in heterogeneous networks through the notions of typed graphlets and typed-graphlet conductance. The proposed method builds clusters that preserve the connectivity of higher-order structures built up from typed graphlets. The approach generalizes previous work on higher-order spectral clustering. We theoretically prove a number of important results including a Cheeger-like inequality for typed-graphlet conductance that shows near-optimal bounds for the method. The theoretical results greatly simplify previous work while providing a unifying theoretical framework for analyzing higher-order spectral methods. Empirically, we demonstrate the effectiveness of the framework quantitatively for three important applications including clustering, compression, and link prediction.

Preserving Dynamic Attention for Long-Term Spatial-Temporal Prediction

Effective long-term predictions have been increasingly demanded in urban-wise data mining systems. Many practical applications, such as accident prevention and resource pre-allocation, require an extended period for preparation. However, challenges come as long-term prediction is highly error-sensitive, which becomes more critical when predicting urban-wise phenomena with complicated and dynamic spatial-temporal correlation. Specifically, since the amount of valuable correlation is limited, enormous irrelevant features introduce noises that trigger increased prediction errors. Besides, after each time step, the errors can traverse through the correlations and reach the spatial-temporal positions in every future prediction, leading to significant error propagation. To address these issues, we propose a Dynamic Switch-Attention Network (DSAN) with a novel Multi-Space Attention (MSA) mechanism that measures the correlations between inputs and outputs explicitly. To filter out irrelevant noises and alleviate the error propagation, DSAN dynamically extracts valuable information by applying self-attention over the noisy input and bridges each output directly to the purified inputs via implementing a switch-attention mechanism. Through extensive experiments on two spatial-temporal prediction tasks, we demonstrate the superior advantage of DSAN in both short-term and long-term predictions. The source code can be obtained from https://github.com/hxstarklin/DSAN.

Learning to Extract Attribute Value from Product via Question Answering: A Multi-task Approach

Attribute value extraction refers to the task of identifying values of an attribute of interest from product information. It is an important research topic which has been widely studied in e-Commerce and relation learning. There are two main limitations in existing attribute value extraction methods: scalability and generalizability. Most existing methods treat each attribute independently and build separate models for each of them, which are not suitable for large scale attribute systems in real-world applications. Moreover, very limited research has focused on generalizing extraction to new attributes.

In this work, we propose a novel approach for Attribute Value Extraction via Question Answering (AVEQA) using a multi-task framework. In particular, we build a question answering model which treats each attribute as a question and identifies the answer span corresponding to the attribute value in the product context. A unique BERT contextual encoder is adopted and shared across all attributes to encode both the context and the question, which makes the model scalable. A distilled masked language model with knowledge distillation loss is introduced to improve the model generalization ability. In addition, we employ a no-answer classifier to explicitly handle the cases where there are no values for a given attribute in the product context. The question answering, distilled masked language model and the no answer classification are then combined into a unified multi-task framework. We conduct extensive experiments on a public dataset. The results demonstrate that the proposed approach outperforms several state-of-the-art methods with large margin.

Kernel Assisted Learning for Personalized Dose Finding

An individualized dose rule recommends a dose level within a continuous safe dose range based on patient level information such as physical conditions, genetic factors and medication histories. Traditionally, personalized dose finding process requires repeating clinical visits of the patient and frequent adjustments of the dosage. Thus the patient is constantly exposed to the risk of underdosing and overdosing during the process. Statistical methods for finding an optimal individualized dose rule can lower the costs and risks for patients. In this article, we propose a kernel assisted learning method for estimating the optimal individualized dose rule. The proposed methodology can also be applied to all other continuous decision-making problems. Advantages of the proposed method include robustness to model misspecification and capability of providing statistical inference for the estimated parameters. In the simulation studies, we show that this method is capable of identifying the optimal individualized dose rule and produces favorable expected outcomes in the population. Finally, we illustrate our approach using data from a warfarin dosing study for thrombosis patients.

Graph Structure Learning for Robust Graph Neural Networks

Graph Neural Networks (GNNs) are powerful tools in representation learning for graphs. However, recent studies show that GNNs are vulnerable to carefully-crafted perturbations, called adversarial attacks. Adversarial attacks can easily fool GNNs in making predictions for downstream tasks. The vulnerability to adversarial attacks has raised increasing concerns for applying GNNs in safety-critical applications. Therefore, developing robust algorithms to defend adversarial attacks is of great significance. A natural idea to defend adversarial attacks is to clean the perturbed graph. It is evident that real-world graphs share some intrinsic properties. For example, many real-world graphs are low-rank and sparse, and the features of two adjacent nodes tend to be similar. In fact, we find that adversarial attacks are likely to violate these graph properties. Therefore, in this paper, we explore these properties to defend adversarial attacks on graphs. In particular, we propose a general framework Pro-GNN, which can jointly learn a structural graph and a robust graph neural network model from the perturbed graph guided by these properties. Extensive experiments on real-world graphs demonstrate that the proposed framework achieves significantly better performance compared with the state-of-the-art defense methods, even when the graph is heavily perturbed. We release the implementation of Pro-GNN to our DeepRobust repository for adversarial attacks and defenses. The specific experimental settings to reproduce our results can be found in https://github.com/ChandlerBang/Pro-GNN.

An Efficient Neighborhood-based Interaction Model for Recommendation on Heterogeneous Graph

There is an influx of heterogeneous information network (HIN) based recommender systems in recent years since HIN is capable of characterizing complex graphs and contains rich semantics. Although the existing approaches have achieved performance improvement, while practical, they still face the following problems. On one hand, most existing HIN-based methods rely on explicit path reachability to leverage path-based semantic relatedness between users and items, e.g., metapath-based similarities. These methods are hard to use and integrate since path connections are sparse or noisy, and are often of different lengths. On the other hand, other graph-based methods aim to learn effective heterogeneous network representations by compressing node together with its neighborhood information into single embedding before prediction. This weakly coupled manner in modeling overlooks the rich interactions among nodes, which introduces an early summarization issue. In this paper, we propose an end-to-end Neighborhood-based Interaction Model for Recommendation (NIRec) to address above problems. Specifically, we first analyze the significance of learning interactions in HINs and then propose a novel formulation to capture the interactive patterns between each pair of nodes through their metapath-guided neighborhoods. Then, to explore complex interactions between metapaths and deal with the learning complexity on large-scale networks, we formulate interaction in a convolutional way and learn efficiently with fast Fourier transform. The extensive experiments on four different types of heterogeneous graphs demonstrate the performance gains of NIRec comparing with state-of-the-arts. To the best of our knowledge, this is the first work providing an efficient neighborhood-based interaction model in the HIN-based recommendations.

Directional Multivariate Ranking

User-provided multi-aspect evaluations manifest users' detailed feedback on the recommended items and enable fine-grained understanding of their preferences. Extensive studies have shown that modeling such data greatly improves the effectiveness and explainability of the recommendations. However, as ranking is essential in recommendation, there is no principled solution yet for collectively generating multiple item rankings over different aspects.

In this work, we propose a directional multi-aspect ranking criterion to enable a holistic ranking of items with respect to multiple aspects. Specifically, we view multi-aspect evaluation as an integral effort from a user that forms a vector of his/her preferences over aspects. Our key insight is that the direction of the difference vector between two multi-aspect preference vectors reveals the pairwise order of comparison. Hence, it is necessary for a multi-aspect ranking criterion to preserve the observed directions from such pairwise comparisons. We further derive a complete solution for the multi-aspect ranking problem based on a probabilistic multivariate tensor factorization model. Comprehensive experimental analysis on a large TripAdvisor multi-aspect rating dataset and a Yelp review text dataset confirms the effectiveness of our solution.

Truth Discovery against Strategic Sybil Attack in Crowdsourcing

Crowdsourcing is an information system for recruiting online workers to perform human intelligent tasks (HITs) that are hard for computers. Due to the openness of crowdsourcing, dynamic online workers with different knowledge backgrounds might give conflicting labels to a task. With the assumption that workers provide their labels independently, most existing works aggregate worker labels in a voting manner, which is vulnerable to Sybil attack where the attacker earns easy rewards by coordinating several Sybil workers to share a randomized label on each task for dominating the aggregation result. A strategic Sybil attacker also attempts to evade Sybil detection. In this paper, we propose a novel approach, called TDSSA (Truth Discovery against Strategic Sybil Attack), to defend against strategic Sybil attack. Experimental results on real-world and synthetic datasets indicate that TDSSA ensures more accurate inference of true labels under various Sybil attacking scenarios, as compared to state-of-the-art methods.

Partial Multi-Label Learning via Probabilistic Graph Matching Mechanism

Partial Multi-Label learning (PML) learns from the ambiguous data where each instance is associated with a candidate label set, where only a part is correct. The key to solve such problem is to disambiguate the candidate label sets and identify the correct assignments between instances and their ground-truth labels. In this paper, we interpret such assignments as instance-to-label matchings, and formulate the task of PML as a matching selection problem. To model such problem, we propose a novel grapH mAtching based partial muLti-label lEarning (HALE) framework, where Graph Matching scheme is incorporated owing to its good performance of exploiting the instance and label relationship. Meanwhile, since conventional one-to-one graph matching algorithm does not satisfy the constraint of PML problem that multiple instances may correspond to multiple labels, we extend the traditional probabilistic graph matching algorithm from one-to-one constraint to many-to-many constraint, and make the proposed framework to accommodate to the PML problem. Moreover, to improve the performance of predictive model, both the minimum error reconstruction and k-nearest-neighbor weight voting scheme are employed to assign more accurate labels for unseen instances. Extensive experiments on various data sets demonstrate the superiority of our proposed method.

Spectrum-Guided Adversarial Disparity Learning

It has been a significant challenge to portray intraclass disparity precisely in the area of activity recognition, as it requires a robust representation of the correlation between subject-specific variation for each activity class. In this work, we propose a novel end-to-end knowledge directed adversarial learning framework, which portrays the class-conditioned intraclass disparity using two competitive encoding distributions and learns the purified latent codes by denoising learned disparity. Furthermore, the domain knowledge is incorporated in an unsupervised manner to guide the optimization and further boosts the performance. The experiments on four HAR benchmark datasets demonstrate the robustness and generalization of our proposed methods over a set of state-of-the-art. We further prove the effectiveness of automatic domain knowledge incorporation in performance enhancement.

Attention and Memory-Augmented Networks for Dual-View Sequential Learning

In recent years, sequential learning has been of great interest due to the advance of deep learning with applications in time-series forecasting, natural language processing, and speech recognition. Recurrent neural networks (RNNs) have achieved superior performance in single-view and synchronous multi-view sequential learning comparing to traditional machine learning models. However, the method remains less explored in asynchronous multi-view sequential learning, and the unalignment nature of multiple sequences poses a great challenge to learn the inter-view interactions. We develop an AMANet (Attention and Memory-Augmented Networks) architecture by integrating both attention and memory to solve asynchronous multi-view learning problem in general, and we focus on experiments in dual-view sequences in this paper. Self-attention and inter-attention are employed to capture intra-view interaction and inter-view interaction, respectively. History attention memory is designed to store the historical information of a specific object, which serves as local knowledge storage. Dynamic external memory is used to store global knowledge for each view. We evaluate our model in three tasks: medication recommendation from a patient's medical records, diagnosis-related group (DRG) classification from a hospital record, and invoice fraud detection through a company's taxation behaviors. The results demonstrate that our model outperforms all baselines and other state-of-the-art models in all tasks. Moreover, the ablation study of our model indicates that the inter-attention mechanism plays a key role in the model and it can boost the predictive power by effectively capturing the inter-view interactions from asynchronous views.

Semantic Search in Millions of Equations

Given the increase of publications, search for relevant papers becomes tedious. In particular, search across disciplines or schools of thinking is not supported. This is mainly due to the retrieval with keyword queries: technical terms differ in different sciences or at different times. Relevant articles might better be identified by their mathematical problem descriptions. Just looking at the equations in a paper already gives a hint to whether the paper is relevant. Hence, we propose a new approach for retrieval of mathematical expressions based on machine learning. We design an unsupervised representation learning task that combines embedding learning with self-supervised learning. Using graph convolutional neural networks we embed mathematical expression into low-dimensional vector spaces that allow efficient nearest neighbor queries. To train our models, we collect a huge dataset with over 29 million mathematical expressions from over 900,000 publications published on arXiv.org. The math is converted into an XML format, which we view as graph data. Our empirical evaluations involving a new dataset of manually annotated search queries show the benefits of using embedding models for mathematical retrieval.

SSumM: Sparse Summarization of Massive Graphs

Given a graph G and the desired size k in bits, how can we summarize G within k bits, while minimizing the information loss?

Large-scale graphs have become omnipresent, posing considerable computational challenges. Analyzing such large graphs can be fast and easy if they are compressed sufficiently to fit in main memory or even cache. Graph summarization, which yields a coarse-grained summary graph with merged nodes, stands out with several advantages among graph compression techniques. Thus, a number of algorithms have been developed for obtaining a concise summary graph with little information loss or equivalently small reconstruction error. However, the existing methods focus solely on reducing the number of nodes, and they often yield dense summary graphs, failing to achieve better compression rates. Moreover, due to their limited scalability, they can be applied only to moderate-size graphs.

In this work, we propose SSumM, a scalable and effective graph-summarization algorithm that yields a sparse summary graph. SSumM not only merges nodes together but also sparsifies the summary graph, and the two strategies are carefully balanced based on the minimum description length principle. Compared with state-of-the-art competitors, SSumM is (a) Concise: yields up to 11.2X smaller summary graphs with similar reconstruction error, (b) Accurate: achieves up to 4.2X smaller reconstruction error with similarly concise outputs, and (c) Scalable: summarizes 26X larger graphs while exhibiting linear scalability. We validate these advantages through extensive experiments on 10 real-world graphs.

Rethinking Pruning for Accelerating Deep Inference At the Edge

There is a growing trend to deploy deep neural networks at the edge for high-accuracy, real-time data mining and user interaction. Applications such as speech recognition and language understanding often apply a deep neural network to encode an input sequence and then use a decoder to generate the output sequence. A promising technique to accelerate these applications on resource-constrained devices is network pruning, which compresses the size of the deep neural network without severe drop in inference accuracy. However, we observe that although existing network pruning algorithms prove effective to speed up the prior deep neural network, they lead to dramatic slowdown of the subsequent decoding and may not always reduce the overall latency of the entire application. To rectify such drawbacks, we propose entropy-based pruning, a new regularizer that can be seamlessly integrated into existing network pruning algorithms. Our key theoretical insight is that reducing the information entropy of the deep neural network outputs decreases the upper bound of the subsequent decoding search space. We validate our solution with two state-of-the-art network pruning algorithms on two model architectures. Experimental results show that compared with existing network pruning algorithms, our entropy-based pruning method notably suppresses and even eliminates the increase of decoding time, and achieves shorter overall latency with only negligible extra accuracy loss in the applications.

Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems

Modern deep learning-based recommendation systems exploit hundreds to thousands of different categorical features, each with millions of different categories ranging from clicks to posts. To respect the natural diversity within the categorical data, embeddings map each category to a unique dense representation within an embedded space. Since each categorical feature could take on as many as tens of millions of different possible categories, the embedding tables form the primary memory bottleneck during both training and inference. We propose a novel approach for reducing the embedding size in an end-to-end fashion by exploiting complementary partitions of the category set to produce a unique embedding vector for each category without explicit definition. By storing multiple smaller embedding tables based on each complementary partition and combining embeddings from each table, we define a unique embedding for each category at smaller cost. This approach may be interpreted as using a specific fixed codebook to ensure uniqueness of each category's representation. Our experimental results demonstrate the effectiveness of our approach over the hashing trick for reducing the size of the embedding tables in terms of model loss and accuracy, while retaining a similar reduction in the number of parameters.

Structural Patterns and Generative Models of Real-world Hypergraphs

Graphs have been utilized as a powerful tool to model pairwise relationships between people or objects. Such structure is a special type of a broader concept referred to as hypergraph, in which each hyperedge may consist of an arbitrary number of nodes, rather than just two. A large number of real-world datasets are of this form - for example, lists of recipients of emails sent from an organization, users participating in a discussion thread or subject labels tagged in an online question. However, due to complex representations and lack of adequate tools, little attention has been paid to exploring the underlying patterns in these interactions.

In this work, we empirically study a number of real-world hypergraph datasets across various domains. In order to enable thorough investigations, we introduce the multi-level decomposition method, which represents each hypergraph by a set of pairwise graphs. Each pairwise graph, which we refer to as a k-level decomposed graph, captures the interactions between pairs of subsets of k nodes. We empirically find that at each decomposition level, the investigated hypergraphs obey five structural properties. These properties serve as criteria for evaluating how realistic a hypergraph is, and establish a foundation for the hypergraph generation problem. We also propose a hypergraph generator that is remarkably simple but capable of fulfilling these evaluation metrics, which are hardly achieved by other baseline generator models.

Efficient Algorithm for the b-Matching Graph

The b-matching graph is a useful approach to computing a graph from high-dimensional data. Unlike the k-NN graph that greedily connects each data point to its k nearest neighbors and typically has more than k edges, each data point in the b-matching graph uniformly has b edges; the idea is reduce edges between cross-clusters that have different semantics. In addition, edge weights are obtained from regression results of each data pointand restricted to be non-negative to improve the robustness for data noise. The b-matching graph can more effectively model high-dimensional data than the traditional k-NN graph. However, the construction cost of the b-matching graph is impractical for large-scale data sets. This is because, to determine edges in the graph, it needs to iteratively update messages between all pairs of data points until convergence, and it computes non-negative edge weights of each data point by applying a solver intended for quadratic programming problems. Our proposal, b-dash, can efficiently construct a b-matching graph because of its two key techniques: (1) it prunes unnecessary update messages in determining edges and (2) it incrementally computes edge weights by exploiting the Sherman-Morrison formula. Experiments show that our approach is up to 58.6 times faster than the previous approaches while guaranteeing result optimality.

Isolation Distributional Kernel: A New Tool for Kernel based Anomaly Detection

We introduce Isolation Distributional Kernel as a new way to measure the similarity between two distributions. Existing approaches based on kernel mean embedding, which converts a point kernel to a distributional kernel, have two key issues: the point kernel employed has a feature map with intractable dimensionality; and it is data independent. This paper shows that Isolation Distributional Kernel (IDK), which is based on a data dependent point kernel, addresses both key issues. We demonstrate IDK's efficacy and efficiency as a new tool for kernel based anomaly detection. Without explicit learning, using IDK alone outperforms existing kernel based anomaly detector OCSVM and other kernel mean embedding methods that rely on Gaussian kernel. We reveal for the first time that an effective kernel based anomaly detector based on kernel mean embedding must employ a characteristic kernel which is data dependent.

NodeAug: Semi-Supervised Node Classification with Data Augmentation

By using Data Augmentation (DA), we present a new method to enhance Graph Convolutional Networks (GCNs), that are the state-of-the-art models for semi-supervised node classification. DA for graph data remains under-explored. Due to the connections built by edges, DA for different nodes influence each other and lead to undesired results, such as uncontrollable DA magnitudes and changes of ground-truth labels. To address this issue, we present the NodeAug (Node-Parallel Augmentation) scheme, that creates a 'parallel universe' for each node to conduct DA, to block the undesired effects from other nodes. NodeAug regularizes the model prediction of every node (including unlabeled) to be invariant with respect to changes induced by Data Augmentation (DA), so as to improve the effectiveness. To augment the input features from different aspects, we propose three DA strategies by modifying both node attributes and the graph structure. In addition, we introduce the subgraph mini-batch training for the efficient implementation of NodeAug. The approach takes the subgraph corresponding to the receptive fields of a batch of nodes as the input per iteration, rather than the whole graph that the prior full-batch training takes. Empirically, NodeAug yields significant gains for strong GCN models on the Cora, Citeseer, Pubmed, and two co-authorship networks, with a more efficient training process thanks to the proposed subgraph mini-batch training approach.

An Embarrassingly Simple Approach for Trojan Attack in Deep Neural Networks

With the widespread use of deep neural networks (DNNs) in high-stake applications, the security problem of the DNN models has received extensive attention. In this paper, we investigate a specific security problem called trojan attack, which aims to attack deployed DNN systems relying on the hidden trigger patterns inserted by malicious hackers. We propose a training-free attack approach which is different from previous work, in which trojaned behaviors are injected by retraining model on a poisoned dataset. Specifically, we do not change parameters in the original model but insert a tiny trojan module (TrojanNet) into the target model. The infected model with a malicious trojan can misclassify inputs into a target label when the inputs are stamped with the special trigger. The proposed TrojanNet has several nice properties including (1) it activates by tiny trigger patterns and keeps silent for other signals, (2) it is model-agnostic and could be injected into most DNNs, dramatically expanding its attack scenarios, and (3) the training-free mechanism saves massive training efforts comparing to conventional trojan attack methods. The experimental results show that TrojanNet can inject the trojan into all labels simultaneously (all-label trojan attack) and achieves 100% attack success rate without affecting model accuracy on original tasks. Experimental analysis further demonstrates that state-of-the-art trojan detection algorithms fail to detect TrojanNet attack. The code is available at https://github.com/trx14/TrojanNet.

Kronecker Attention Networks

Attention operators have been applied on both 1-D data like texts and higher-order data such as images and videos. Use of attention operators on high-order data requires flattening of the spatial or spatial-temporal dimensions into a vector, which is assumed to follow a multivariate normal distribution. This not only incurs excessive requirements on computational resources, but also fails to preserve structures in data. In this work, we propose to avoid flattening by assuming the data follow matrix-variate normal distributions. Based on this new view, we develop Kronecker attention operators (KAOs) that operate on high-order tensor data directly. More importantly, the proposed KAOs lead to dramatic reductions in computational resources. Experimental results show that our methods reduce the amount of required computational resources by a factor of hundreds, with larger factors for higher-dimensional and higher-order data. Results also show that networks with KAOs outperform models without attention, while achieving competitive performance as those with original attention operators.

GRACE: Generating Concise and Informative Contrastive Sample to Explain Neural Network Model's Prediction

Despite the recent development in the topic of explainable AI/ML for image and text data, the majority of current solutions are not suitable to explain the prediction of neural network models when the datasets are tabular and their features are in high-dimensional vectorized formats. To mitigate this limitation, therefore, we borrow two notable ideas (i.e., "explanation by intervention" from causality and "explanation are contrastive" from philosophy) and propose a novel solution, named as GRACE, that better explains neural network models' predictions for tabular datasets. In particular, given a model's prediction as label X, GRACE intervenes and generates a minimally-modified contrastive sample to be classified as Y, with an intuitive textual explanation, answering the question of "Why X rather than Y?" We carry out comprehensive experiments using eleven public datasets of different scales and domains (e.g., # of features ranges from 5 to 216) and compare GRACE with competing baselines on different measures: fidelity, conciseness, info-gain, and influence. The user-studies show that our generated explanation is not only more intuitive and easy-to-understand but also facilitates end-users to make as much as 60% more accurate post-explanation decisions than that of Lime.

Hierarchical Attention Propagation for Healthcare Representation Learning

Medical ontologies are widely used to represent and organize medical terminologies. Examples include ICD-9, ICD-10, UMLS etc. The ontologies are often constructed in hierarchical structures, encoding the multi-level subclass relationships among different medical concepts, allowing very fine distinctions between concepts. Medical ontologies provide a great source for incorporating domain knowledge into a healthcare prediction system, which might alleviate the data insufficiency problem and improve predictive performance with rare categories. To incorporate such domain knowledge, Gram, a recent graph attention model, represents a medical concept as a weighted sum of its ancestors' embeddings in the ontology using an attention mechanism. Although showing improved performance, Gram only considers the unordered ancestors of a concept, which does not fully leverage the hierarchy thus having limited expressibility. In this paper, we propose Hierarchical Attention Propagation (HAP), a novel medical ontology embedding model that hierarchically propagate attention across the entire ontology structure, where a medical concept adaptively learns its embedding from all other concepts in the hierarchy instead of only its ancestors. We prove that HAP learns more expressive medical concept embeddings -- from any medical concept embedding we are able to fully recover the entire ontology structure. Experimental results on two sequential procedure/diagnosis prediction tasks demonstrate HAP's better embedding quality than Gram and other baselines. Furthermore, we find that it is not always best to use the full ontology. Sometimes using only lower levels of the hierarchy outperforms using all levels.

SCE: Scalable Network Embedding from Sparsest Cut

Large-scale network embedding is to learn a latent representation for each node in an unsupervised manner, which captures inherent properties and structural information of the underlying graph. In this field, many popular approaches are influenced by the skip-gram model from natural language processing. Most of them use a contrastive objective to train an encoder which forces the embeddings of similar pairs to be close and embeddings of negative samples to be far. A key of success to such contrastive learning methods is how to draw positive and negative samples. While negative samples that are generated by straightforward random sampling are often satisfying, methods for drawing positive examples remains a hot topic.

In this paper, we propose SCE for unsupervised network embedding only using negative samples for training. Our method is based on a new contrastive objective inspired by the well-known sparsest cut problem. To solve the underlying optimization problem, we introduce a Laplacian smoothing trick, which uses graph convolutional operators as low-pass filters for smoothing node representations. The resulting model consists of a GCN-type structure as the encoder and a simple loss function. Notably, our model does not use positive samples but only negative samples for training, which not only makes the implementation and tuning much easier, but also reduces the training time significantly.

Finally, extensive experimental studies on real world data sets are conducted. The results clearly demonstrate the advantages of our new model in both accuracy and scalability compared to strong baselines such as GraphSAGE, G2G and DGI.

Local Community Detection in Multiple Networks

Local community detection aims to find a set of densely-connected nodes containing given query nodes. Most existing local community detection methods are designed for a single network. However, a single network can be noisy and incomplete. Multiple networks are more informative in real-world applications. There are multiple types of nodes and multiple types of node proximities. Complementary information from different networks helps to improve detection accuracy. In this paper, we propose a novel RWM (Random Walk in Multiple networks) model to find relevant local communities in all networks for a given query node set from one network. RWM sends a random walker in each network to obtain the local proximity w.r.t. the query nodes (i.e., node visiting probabilities).

Walkers with similar visiting probabilities reinforce each other. They restrict the probability propagation around the query nodes to identify relevant subgraphs in each network and disregard irrelevant parts. We provide rigorous theoretical foundations for RWM and develop two speeding-up strategies with performance guarantees. Comprehensive experiments are conducted on synthetic and real-world datasets to evaluate the effectiveness and efficiency of RWM.

A Block Decomposition Algorithm for Sparse Optimization

Sparse optimization is a central problem in machine learning and computer vision. However, this problem is inherently NP-hard and thus difficult to solve in general. Combinatorial search methods find the global optimal solution but are confined to small-sized problems, while coordinate descent methods are efficient but often suffer from poor local minima. This paper considers a new block decomposition algorithm that combines the effectiveness of combinatorial search methods and the efficiency of coordinate descent methods. Specifically, we consider a random strategy or/and a greedy strategy to select a subset of coordinates as the working set, and then perform a global combinatorial search over the working set based on the original objective function. We show that our method finds stronger stationary points than Amir Beck et al.'s coordinate-wise optimization method. In addition, we establish the convergence rate of our algorithm. Our experiments on solving sparse regularized and sparsity constrained least squares optimization problems demonstrate that our method achieves state-of-the-art performance in terms of accuracy. For example, our method generally outperforms the well-known greedy pursuit method.

Adversarial Infidelity Learning for Model Interpretation

Model interpretation is essential in data mining and knowledge discovery. It can help understand the intrinsic model working mechanism and check if the model has undesired characteristics. A popular way of performing model interpretation is Instance-wise Feature Selection (IFS), which provides an importance score of each feature representing the data samples to explain how the model generates the specific output. In this paper, we propose a Model-agnostic Effective Efficient Direct (MEED) IFS framework for model interpretation, mitigating concerns about sanity, combinatorial shortcuts, model identifiability, and information transmission. Also, we focus on the following setting: using selected features to directly predict the output of the given model, which serves as a primary evaluation metric for model-interpretation methods. Apart from the features, we involve the output of the given model as an additional input to learn an explainer based on more accurate information. To learn the explainer, besides fidelity, we propose an Adversarial Infidelity Learning (AIL) mechanism to boost the explanation learning by screening relatively unimportant features. Through theoretical and experimental analysis, we show that our AIL mechanism can help learn the desired conditional distribution between selected features and targets. Moreover, we extend our framework by integrating efficient interpretation methods as proper priors to provide a warm start. Comprehensive empirical evaluation results are provided by quantitative metrics and human evaluation to demonstrate the effectiveness and superiority of our proposed method. Our code is publicly available online at https://github.com/langlrsw/MEED.

Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning

The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.

How to Count Triangles, without Seeing the Whole Graph

Triangle counting is a fundamental problem in the analysis of large graphs. There is a rich body of work on this problem, in varying streaming and distributed models, yet all these algorithms require reading the whole input graph. In many scenarios, we do not have access to the whole graph, and can only sample a small portion of the graph (typically through crawling). In such a setting, how can we accurately estimate the triangle count of the graph?

We formally study triangle counting in the random walk access model introduced by Dasgupta et al (WWW '14) and Chierichetti et al (WWW '16). We have access to an arbitrary seed vertex of the graph, and can only perform random walks. This model is restrictive in access and captures the challenges of collecting real-world graphs. Even sampling a uniform random vertex is a hard task in this model.

Despite these challenges, we design a provable and practical algorithm, TETRIS, for triangle counting in this model. TETRIS is the first provably sublinear algorithm (for most natural parameter settings) that approximates the triangle count in the random walk model, for graphs with low mixing time. Our result builds on recent advances in the theory of sublinear algorithms. The final sample built by TETRIS is a careful mix of random walks and degree-biased sampling of neighborhoods. Empirically, TETRIS accurately counts triangles on a variety of large graphs, getting estimates within 5% relative error by looking at 3% of the number of edges.

Incremental Lossless Graph Summarization

Given a fully dynamic graph, represented as a stream of edge insertions and deletions, how can we obtain and incrementally update a lossless summary of its current snapshot? As large-scale graphs are prevalent, concisely representing them is inevitable for efficient storage and analysis. Lossless graph summarization is an effective graph-compression technique with many desirable properties. It aims to compactly represent the input graph as (a) a summary graph consisting of supernodes (i.e., sets of nodes) and superedges (i.e., edges between supernodes), which provide a rough description, and (b) edge corrections which fix errors induced by the rough description. While a number of batch algorithms, suited for static graphs, have been developed for rapid and compact graph summarization, they are highly inefficient in terms of time and space for dynamic graphs, which are common in practice.

In this work, we propose MoSSo, the first incremental algorithm for lossless summarization of fully dynamic graphs. In response to each change in the input graph, MoSSo updates the output representation by repeatedly moving nodes among supernodes. MoSSo decides nodes to be moved and their destinations carefully but rapidly based on several novel ideas. Through extensive experiments on 10 real graphs, we show MoSSo is (a) Fast and 'any time': processing each change in near-constant time (less than 0.1 millisecond), up to 7 orders of magnitude faster than running state-of-the-art batch methods, (b) Scalable: summarizing graphs with hundreds of millions of edges, requiring sub-linear memory during the process, and (c) Effective: achieving comparable compression ratios even to state-of-the-art batch methods.

From Online to Non-i.i.d. Batch Learning

This paper initializes the study of online-to-batch conversion when the samples in batch learning are not i.i.d. Our motivation originated from two facts. First, sample sets in reality are seldom i.i.d., thus preventing the application of the existing conversions. Second, the online model of learning permits an adversarial stream of samples that almost for sure violates the i.i.d. assumption, raising the possibility of adapting an online algorithm effectively to learn from a non-i.i.d. sample set. We present a set of techniques to utilize an online algorithm as a black box to perform batch learning in the absence of the i.i.d. assumption. Our techniques are generic, and are applicable to virtually any online algorithms on classification. This provides strong evidence that the great variety of known algorithms in the online-learning literature can indeed be harnessed to learn from sufficiently-representative non-i.i.d. samples.

Towards Deeper Graph Neural Networks

Graph neural networks have shown significant success in the field of graph representation learning. Graph convolutions perform neighborhood aggregation and represent one of the most important graph operations. Nevertheless, one layer of these neighborhood aggregation methods only consider immediate neighbors, and the performance decreases when going deeper to enable larger receptive fields. Several recent studies attribute this performance deterioration to the over-smoothing issue, which states that repeated propagation makes node representations of different classes indistinguishable. In this work, we study this observation systematically and develop new insights towards deeper graph neural networks. First, we provide a systematical analysis on this issue and argue that the key factor compromising the performance significantly is the entanglement of representation transformation and propagation in current graph convolution operations. After decoupling these two operations, deeper graph neural networks can be used to learn graph node representations from larger receptive fields. We further provide a theoretical analysis of the above observation when building very deep models, which can serve as a rigorous and gentle description of the over-smoothing issue. Based on our theoretical and empirical analysis, we propose Deep Adaptive Graph Neural Network (DAGNN) to adaptively incorporate information from large receptive fields. A set of experiments on citation, co-authorship, and co-purchase datasets have confirmed our analysis and insights and demonstrated the superiority of our proposed methods.

Laplacian Change Point Detection for Dynamic Graphs

Dynamic and temporal graphs are rich data structures that are used to model complex relationships between entities over time. In particular, anomaly detection in temporal graphs is crucial for many real world applications such as intrusion identification in network systems, detection of ecosystem disturbances and detection of epidemic outbreaks. In this paper, we focus on change point detection in dynamic graphs and address two main challenges associated with this problem: I) how to compare graph snapshots across time, II) how to capture temporal dependencies. To solve the above challenges, we propose Laplacian Anomaly Detection (LAD) which uses the spectrum of the Laplacian matrix of the graph structure at each snapshot to obtain low dimensional embeddings. LAD explicitly models short term and long term dependencies by applying two sliding windows. In synthetic experiments, LAD outperforms the state-of-the-art method. We also evaluate our method on three real dynamic networks: UCI message network, US senate co-sponsorship network and Canadian bill voting network. In all three datasets, we demonstrate that our method can more effectively identify anomalous time points according to significant real world events.

Learning Transferrable Parameters for Long-tailed Sequential User Behavior Modeling

Sequential user behavior modeling plays a crucial role in online user-oriented services, such as product purchasing, news feed consumption, and online advertising. The performance of sequential modeling heavily depends on the scale and quality of historical behaviors. However, the number of user behaviors inherently follows a long-tailed distribution, which has been seldom explored. In this work, we argue that focusing on tail users could bring more benefits and address the long tails issue by learning transferrable parameters from both optimization and feature perspectives. Specifically, we propose a gradient alignment optimizer and adopt an adversarial training scheme to facilitate knowledge transfer from the head to the tail. Such methods can also deal with the cold-start problem of new users. Moreover, it could be directly adaptive to various well-established sequential models. Extensive experiments on four real-world datasets verify the superiority of our framework compared with the state-of-the-art baselines.

TranSlider: Transfer Ensemble Learning from Exploitation to Exploration

In transfer learning, what and where to transfer has been widely studied. Nevertheless, the learned transfer strategies are at high risk of over-fitting, especially when only a few annotated instances are available in the target domain. In this paper, we introduce the concept of transfer ensemble learning, a new direction to tackle the over-fitting of transfer strategies. Intuitively, models with different transfer strategies offer various perspectives on what and where to transfer. Therefore a core problem is to search these diversely transferred models for ensemble so as to achieve better generalization. Towards this end, we propose the Transferability Slider (TranSlider) for transfer ensemble learning. By decreasing the transferability, we obtain a spectrum of base models ranging from pure exploitation of the source model to unconstrained exploration for the target domain. Furthermore, the manner of decreasing transferability with parameter sharing guarantees fast optimization at no additional training cost. Finally, we conduct extensive experiments with various analyses, which demonstrate that TranSlider achieves the state-of-the-art on comprehensive benchmark datasets.

InFoRM: Individual Fairness on Graph Mining

Algorithmic bias and fairness in the context of graph mining have largely remained nascent. The sparse literature on fair graph mining has almost exclusively focused on group-based fairness notation. However, the notion of individual fairness, which promises the fairness notion at a much finer granularity, has not been well studied. This paper presents the first principled study of Individual Fairness on gRaph Mining (InFoRM). First, we present a generic definition of individual fairness for graph mining which naturally leads to a quantitative measure of the potential bias in graph mining results. Second, we propose three mutually complementary algorithmic frameworks to mitigate the proposed individual bias measure, namely debiasing the input graph, debiasing the mining model and debiasing the mining results. Each algorithmic framework is formulated from the optimization perspective, using effective and efficient solvers, which are applicable to multiple graph mining tasks. Third, accommodating individual fairness is likely to change the original graph mining results without the fairness consideration. We conduct a thorough analysis to develop an upper bound to characterize the cost (i.e., the difference between the graph mining results with and without the fairness consideration). We perform extensive experimental evaluations on real-world datasets to demonstrate the efficacy and generality of the proposed methods.

Local Motif Clustering on Time-Evolving Graphs

Graph motifs are subgraph patterns that occur in complex networks, which are of key importance for gaining deep insights into the structure and functionality of the graph. Motif clustering aims at finding clusters consisting of dense motif patterns. It is commonly used in various application domains, ranging from social networks to collaboration networks, from market-basket analysis to neuroscience applications. More recently, local clustering techniques have been proposed for motif-aware clustering, which focuses on a small neighborhood of the input seed node instead of the entire graph. However, most of these techniques are designed for static graphs and may render sub-optimal results when applied to large time-evolving graphs. To bridge this gap, in this paper, we propose a novel framework, Local Motif Clustering on Time-Evolving Graphs (L-MEGA), which provides the evolution pattern of the local motif cluster in an effective and efficient way. The core of L-MEGA is approximately tracking the temporal evolution of the local motif cluster via novel techniques such as edge filtering, motif push operation, and incremental sweep cut. Furthermore, we theoretically analyze the efficiency and effectiveness of these techniques on time-evolving graphs. Finally, we evaluate the L-MEGA framework via extensive experiments on both synthetic and real-world temporal networks.

A Data-Driven Graph Generative Model for Temporal Interaction Networks

Deep graph generative models have recently received a surge of attention due to its superiority of modeling realistic graphs in a variety of domains, including biology, chemistry, and social science. Despite the initial success, most, if not all, of the existing works are designed for static networks. Nonetheless, many realistic networks are intrinsically dynamic and presented as a collection of system logs (i.e., timestamped interactions/edges between entities), which pose a new research direction for us: how can we synthesize realistic dynamic networks by directly learning from the system logs? In addition, how can we ensure the generated graphs preserve both the structural and temporal characteristics of the real data?

To address these challenges, we propose an end-to-end deep generative framework named TagGen. In particular, we start with a novel sampling strategy for jointly extracting structural and temporal context information from temporal networks. On top of that, TagGen parameterizes a bi-level self-attention mechanism together with a family of local operations to generate temporal random walks. At last, a discriminator gradually selects generated temporal random walks, that are plausible in the input data, and feeds them to an assembling module for generating temporal networks. The experimental results in seven real-world data sets across a variety of metrics demonstrate that (1) TagGen outperforms all baselines in the temporal interaction network generation problem, and (2) TagGen significantly boosts the performance of the prediction models in the tasks of anomaly detection and link prediction.

Recurrent Networks for Guided Multi-Attention Classification

Attention-based image classification has gained increasing popularity in recent years. State-of-the-art methods for attention-based classification typically require a large training set and operate under the assumption that the label of an image depends solely on a single object (i.e. region of interest) in the image. However, in many real-world applications (e.g. medical imaging), it is very expensive to collect a large training set. Moreover, the label of each image is usually determined jointly by multiple regions of interest (ROIs). Fortunately, for such applications, it is often possible to collect the locations of the ROIs in each training image. In this paper, we study the problem of guided multi-attention classification, the goal of which is to achieve high accuracy under the dual constraints of (1) small sample size, and (2) multiple ROIs for each image. We propose a model, called Guided Attention Recurrent Network (GARN), for multi-attention classification. Different from existing attention-based methods, GARN utilizes guidance information regarding multiple ROIs thus allowing it to work well even when sample size is small. Empirical studies on three different visual tasks show that our guided attention approach can effectively boost model performance for multi-attention image classification.

Vulnerability vs. Reliability: Disentangled Adversarial Examples for Cross-Modal Learning

The vulnerability of deep neural networks has gained a great upsurge of research attention, which engages well-designed examples through adding little perturbations to fool a well-performed network. Meanwhile, a progress has been made in leveraging adversarial examples to boost the robustness of deep cross-modal networks. However, for cross-modal learning, both the causes of adversarial examples and their latent advantages in learning cross-modal correlations are under-explored. In this paper, we propose novel Disentangled Adversarial examples for Cross-Modal learning, dubbed DACM. Specifically, we first divide cross-modal data into two aspects, namely modality-related component and modality-unrelated counterpart, and then learn to improve the reliability of network using the modality-related component. To achieve this goal, we apply the generation of adversarial perturbations to strengthen cross-modal correlations, wherein the modality-related component is acquired through gradually detaching the modality-unrelated component. Finally, the proposed DACM is employed to create modality-related examples towards the application of cross-modal hashing retrieval. Extensive experiments carried out on two cross-modal benchmarks show that the adversarial examples learned by DACM are efficient at fooling a target deep cross-modal hashing network. On the other hand, training this target model by merely leveraging our created modality-related examples in turn significantly promotes the robustness of this model itself.

XGNN: Towards Model-Level Explanations of Graph Neural Networks

Graphs neural networks (GNNs) learn node features by aggregating and combining neighbor information, which have achieved promising performance on many graph tasks. However, GNNs are mostly treated as black-boxes and lack human intelligible explanations. Thus, they cannot be fully trusted and used in certain application domains if GNN models cannot be explained. In this work, we propose a novel approach, known as XGNN, to interpret GNNs at the model-level. Our approach can provide high-level insights and generic understanding of how GNNs work. In particular, we propose to explain GNNs by training a graph generator so that the generated graph patterns maximize a certain prediction of the model. We formulate the graph generation as a reinforcement learning task, where for each step, the graph generator predicts how to add an edge into the current graph. The graph generator is trained via a policy gradient method based on information from the trained GNNs. In addition, we incorporate several graph rules to encourage the generated graphs to be valid. Experimental results on both synthetic and real-world datasets show that our proposed methods help understand and verify the trained GNNs. Furthermore, our experimental results indicate that the generated graphs can provide guidance on how to improve the trained GNNs.

CAST: A Correlation-based Adaptive Spectral Clustering Algorithm on Multi-scale Data

We study the problem of applying spectral clustering to cluster multi-scale data, which is data whose clusters are of various sizes and densities. Traditional spectral clustering techniques discover clusters by processing a similarity matrix that reflects the proximity of objects. For multi-scale data, distance-based similarity is not effective because objects of a sparse cluster could be far apart while those of a dense cluster have to be sufficiently close. Following [16], we solve the problem of spectral clustering on multi-scale data by integrating the concept of objects' "reachability similarity" with a given distance-based similarity to derive an objects' coefficient matrix. We propose the algorithm CAST that applies trace Lasso to regularize the coefficient matrix. We prove that the resulting coefficient matrix has the "grouping effect" and that it exhibits "sparsity". We show that these two characteristics imply very effective spectral clustering. We evaluate CAST and 10 other clustering methods on a wide range of datasets w.r.t. various measures. Experimental results show that CAST provides excellent performance and is highly robust across test cases of multi-scale data.

INPREM: An Interpretable and Trustworthy Predictive Model for Healthcare

Building a predictive model based on historical Electronic Health Records (EHRs) for personalized healthcare has become an active research area. Benefiting from the powerful ability of feature extraction, deep learning (DL) approaches have achieved promising performance in many clinical prediction tasks. However, due to the lack of interpretability and trustworthiness, it is difficult to apply DL in real clinical cases of decision making. To address this, in this paper, we propose an interpretable and trustworthy predictive model~(INPREM) for healthcare. Firstly, INPREM is designed as a linear model for interpretability while encoding non-linear relationships into the learning weights for modeling the dependencies between and within each visit. This enables us to obtain the contribution matrix of the input variables, which is served as the evidence of the prediction result(s), and help physicians understand why the model gives such a prediction, thereby making the model more interpretable. Secondly, for trustworthiness, we place a random gate (which follows a Bernoulli distribution to turn on or off) over each weight of the model, as well as an additional branch to estimate data noises. With the help of the Monto Carlo sampling and an objective function accounting for data noises, the model can capture the uncertainty of each prediction. The captured uncertainty, in turn, allows physicians to know how confident the model is, thus making the model more trustworthy. We empirically demonstrate that the proposed INPREM outperforms existing approaches with a significant margin. A case study is also presented to show how the contribution matrix and the captured uncertainty are used to assist physicians in making robust decisions.

Policy-GNN: Aggregation Optimization for Graph Neural Networks

Graph data are pervasive in many real-world applications. Recently, increasing attention has been paid on graph neural networks (GNNs), which aim to model the local graph structures and capture the hierarchical patterns by aggregating the information from neighbors with stackable network modules. Motivated by the observation that different nodes often require different iterations of aggregation to fully capture the structural information, in this paper, we propose to explicitly sample diverse iterations of aggregation for different nodes to boost the performance of GNNs. It is a challenging task to develop an effective aggregation strategy for each node, given complex graphs and sparse features. Moreover, it is not straightforward to derive an efficient algorithm since we need to feed the sampled nodes into different number of network layers. To address the above challenges, we propose Policy-GNN, a meta-policy framework that models the sampling procedure and message passing of GNNs into a combined learning process. Specifically, Policy-GNN uses a meta-policy to adaptively determine the number of aggregations for each node. The meta-policy is trained with deep reinforcement learning~(RL) by exploiting the feedback from the model. We further introduce parameter sharing and a buffer mechanism to boost the training efficiency. Experimental results on three real-world benchmark datasets suggest that Policy-GNN significantly outperforms the state-of-the-art alternatives, showing the promise in aggregation optimization for GNNs.

Malicious Attacks against Deep Reinforcement Learning Interpretations

The past years have witnessed the rapid development of deep reinforcement learning (DRL), which is a combination of deep learning and reinforcement learning (RL). However, the adoption of deep neural networks makes the decision-making process of DRL opaque and lacking transparency. Motivated by this, various interpretation methods for DRL have been proposed. However, those interpretation methods make an implicit assumption that they are performed in a reliable and secure environment. In practice, sequential agent-environment interactions expose the DRL algorithms and their corresponding downstream interpretations to extra adversarial risk. In spite of the prevalence of malicious attacks, there is no existing work studying the possibility and feasibility of malicious attacks against DRL interpretations. To bridge this gap, in this paper, we investigate the vulnerability of DRL interpretation methods. Specifically, we introduce the first study of the adversarial attacks against DRL interpretations, and propose an optimization framework based on which the optimal adversarial attack strategy can be derived. In addition, we study the vulnerability of DRL interpretation methods to the model poisoning attacks, and present an algorithmic framework to rigorously formulate the proposed model poisoning attack. Finally, we conduct both theoretical analysis and extensive experiments to validate the effectiveness of the proposed malicious attacks against DRL interpretations.

Disentangled Self-Supervision in Sequential Recommenders

To learn a sequential recommender, the existing methods typically adopt the sequence-to-item (seq2item) training strategy, which supervises a sequence model with a user's next behavior as the label and the user's past behaviors as the input. The seq2item strategy, however, is myopic and usually produces non-diverse recommendation lists. In this paper, we study the problem of mining extra signals for supervision by looking at the longer-term future. There exist two challenges: i) reconstructing a future sequence containing many behaviors is exponentially harder than reconstructing a single next behavior, which can lead to difficulty in convergence, and ii) the sequence of all future behaviors can involve many intentions, not all of which may be predictable from the sequence of earlier behaviors. To address these challenges, we propose a sequence-to-sequence (seq2seq) training strategy based on latent self-supervision and disentanglement. Specifically, we perform self-supervision in the latent space, i.e., reconstructing the representation of the future sequence as a whole, instead of reconstructing the items in the future sequence individually. We also disentangle the intentions behind any given sequence of behaviors and construct seq2seq training samples using only pairs of sub-sequences that involve a shared intention. Results on real-world benchmarks and synthetic data demonstrate the improvement brought by seq2seq training.

DETERRENT: Knowledge Guided Graph Attention Network for Detecting Healthcare Misinformation

To provide accurate and explainable misinformation detection, it is often useful to take an auxiliary source (e.g., social context and knowledge base) into consideration. Existing methods use social contexts such as users' engagements as complementary information to improve detection performance and derive explanations. However, due to the lack of sufficient professional knowledge, users seldom respond to healthcare information, which makes these methods less applicable. In this work, to address these shortcomings, we propose a novel knowledge guided graph attention network for detecting health misinformation better. Our proposal, named as DETERRENT, leverages on the additional information from medical knowledge graph by propagating information along with the network, incorporates a Medical Knowledge Graph and an Article-Entity Bipartite Graph, and propagates the node embeddings through Knowledge Paths. In addition, an attention mechanism is applied to calculate the importance of entities to each article, and the knowledge guided article embeddings are used for misinformation detection. DETERRENT addresses the limitation on social contexts in the healthcare domain and is capable of providing useful explanations for the results of detection. Empirical validation using two real-world datasets demonstrated the effectiveness of DETERRENT. Comparing with the best results of eight competing methods, in terms of F1 Score, DETERRENT outperforms all methods by at least 4.78% on the diabetes dataset and 12.79% on cancer dataset. We release the source code of DETERRENT at: https://github.com/cuilimeng/DETERRENT.

MultiImport: Inferring Node Importance in a Knowledge Graph from Multiple Input Signals

Given multiple input signals, how can we infer node importance in a knowledge graph (KG)? Node importance estimation is a crucial and challenging task that can benefit a lot of applications including recommendation, search, and query disambiguation. A key challenge towards this goal is how to effectively use input from different sources. On the one hand, a KG is a rich source of information, with multiple types of nodes and edges. On the other hand, there are external input signals, such as the number of votes or pageviews, which can directly tell us about the importance of entities in a KG. While several methods have been developed to tackle this problem, their use of these external signals has been limited as they are not designed to consider multiple signals simultaneously. In this paper, we develop an end-to-end model MultiImport, which infers latent node importance from multiple, potentially overlapping, input signals. MultiImport is a latent variable model that captures the relation between node importance and input signals, and effectively learns from multiple signals with potential conflicts. Also, MultiImport provides an effective estimator based on attentive graph neural networks. We ran experiments on real-world KGs to show that MultiImport handles several challenges involved with inferring node importance from multiple input signals, and consistently outperforms existing methods, achieving up to 23.7% higher NDCG@100 than the state-of-the-art method.

Geodesic Forests

Together with the curse of dimensionality, nonlinear dependencies in large data sets persist as major challenges in data mining tasks. A reliable way to accurately preserve nonlinear structure is to compute geodesic distances between data points. Manifold learning methods, such as Isomap, aim to preserve geodesic distances in a Riemannian manifold. However, as manifold learning algorithms operate on the ambient dimensionality of the data, the essential step of geodesic distance computation is sensitive to high-dimensional noise. Therefore, a direct application of these algorithms to high-dimensional, noisy data often yields unsatisfactory results and does not accurately capture nonlinear structure.

We propose an unsupervised random forest approach called geodesic forests (GF) to geodesic distance estimation in linear and nonlinear manifolds with noise. GF operates on low-dimensional sparse linear combinations of features, rather than the full observed dimensionality. To choose the optimal split in a computationally efficient fashion, we developed Fast-BIC, a fast Bayesian Information Criterion statistic for Gaussian mixture models.

We additionally propose geodesic precision and geodesic recall as novel evaluation metrics that quantify how well the geodesic distances of a latent manifold are preserved. Empirical results on simulated and real data demonstrate that GF is robust to high-dimensional noise, whereas other methods, such as Isomap, UMAP, and FLANN, quickly deteriorate in such settings. Notably, GF is able to estimate geodesic distances better than other approaches on a real connectome dataset.

Z-Miner: An Efficient Method for Mining Frequent Arrangements of Event Intervals

Mining frequent patterns of event intervals from a large collection of interval sequences is a problem that appears in several application domains. In this paper, we propose Z-Miner, a novel algorithm for solving this problem that addresses the deficiencies of existing competitors by employing two novel data structures: Z-Table, a hierarchical hash-based data structure for time-efficient candidate generation and support count, and Z-Arrangement, a data structure for efficient memory consumption. The proposed algorithm is able to handle patterns with repetitions of the same event label, allowing for gap and error tolerance constraints, as well as keeping track of the exact occurrences of the extracted frequent patterns. Our experimental evaluation on eight real-world and six synthetic datasets demonstrates the superiority of Z-Miner against four state-of-the-art competitors in terms of runtime efficiency and memory footprint.

Imputing Various Incomplete Attributes via Distance Likelihood Maximization

Missing values may appear in various attributes. By "various", we mean (1) different types of values in a tuple, such as numerical or categorical, and (2) different attributes in a tuple, either the dependent or determinant attributes of regression models or dependency rules. Such varieties unfortunately prevent the imputation performing. In this paper, we propose to study the distance models that predict distances between tuples for missing data imputation. The immediate benefits are in two aspects, (1) uniformly processing and collaboratively utilizing the distances on all the attributes with various types of values, and (2) rather than enumerating the combinations of imputation candidates on various attributes, we can directly calculate the most likely distances of missing values to other complete ones and thus infer the corresponding imputations. Our major technical highlights include (1) introducing the imputation statistically explainable by the likelihood on distances, (2) proving NP-hardness of finding the maximum likelihood imputation, and (3) devising the approximation algorithm with performance guarantees. Experiments over datasets with real missing values demonstrate the superiority of the proposed method compared to 11 existing approaches in 5 categories. Our proposal improves not only the imputation accuracy but also the downstream applications such as classification, clustering and record matching.

WeightGrad: Geo-Distributed Data Analysis Using Quantization for Faster Convergence and Better Accuracy

High network communication cost for synchronizing weights and gradients in geo-distributed data analysis consumes the benefits of advancement in computation and optimization techniques. Many quantization methods for weight, gradient or both have been proposed in recent years where weight-quantized model suffers from error related to weight dimension and gradient-quantized method suffers from slow convergence rate by a factor related to the gradient quantization resolution and gradient dimension. All these methods have been proved to be infeasible in terms of distributed training across multiple data centers all over the world. Moreover recent studies show that communicating over WANs can significantly degrade DNN model performance by upto 53.7x because of unstable and limited WAN bandwidth. Our goal in this work is to design a geo-distributed Deep-Learning system that (1) ensures efficient and faster communication over LAN and WAN and (2) maintain accuracy and convergence for complex DNNs with billions of parameters. In this paper, we introduce WeightGrad which acknowledges the limitations of quantization and provides loss-aware weight-quantized networks with quantized gradients for local convergence and for global convergence it dynamically eliminates insignificant communication between data centers while still guaranteeing the correctness of DNN models. Our experiments on our developed prototypes of WeightGrad running across 3 Amazon EC2 global regions and on a cluster that emulates EC2 WAN bandwidth show that WeightGrad provides 1.06% gain in top-1 accuracy, 5.36x speedup over baseline and 1.4x-2.26x over the four state-of-the-art distributed ML systems.

Feature-Induced Manifold Disambiguation for Multi-View Partial Multi-label Learning

In conventional multi-label learning framework, each example is assumed to be represented by a single feature vector and associated with multiple valid labels simultaneously. Nonetheless, real-world objects usually exhibit complicated properties which can have multi-view feature representation as well as false positive labeling. Accordingly, the problem of multi-view partial multi-label learning (MVPML) is studied in this paper, where each example is assumed to be presented by multiple feature vectors while associated with multiple candidate labels which are only partially valid. To learn from MVPML examples, a novel approach named FIMAN is proposed which makes use of multi-view feature representation to tackle the noisy labeling information. Firstly, an aggregate manifold structure over training examples is generated by adaptively fusing affinity information conveyed by feature vectors of different views. Then, candidate labels of each training example are disambiguated by preserving the feature-induced manifold structure in label space. Finally, the resulting predictive models are learned by fitting modeling outputs with the disambiguated labels. Extensive experiments on a number of real-world data sets show that FIMAN achieves highly competitive performance against state-of-the-art approaches in solving the MVPML problem.

MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance

We study a fundamental problem in data analytics: similarity search under edit distance (or, edit similarity search for short). In this problem we try to build an index on a set of n strings S = s1, ..., sn, with the goal of answering the following two types of queries: (1) the threshold query: given a query string t and a threshold K, output all si ∈ S such that the edit distance between si and t is at most K; (2) the top-k query: given a query string t, output the k strings in S that are closest to t in terms of edit distance. Edit similarity search has numerous applications in bioinformatics, databases, data mining, information retrieval, etc., and has been studied extensively in the literature. In this paper we propose a novel algorithm for edit similarity search named MinSearch. The algorithm is randomized, and we can show mathematically that it outputs the correct answer with high probability for both types of queries. We have conducted an extensive set of experiments on MinSearch, and compared it with the best existing algorithms for edit similarity search. Our experiments show that MinSearch has a clear advantage (often in orders of magnitudes) against the best previous algorithms in query time, and MinSearch is always one of the best among all competitors in the indexing time and space usage. Finally, MinSearch achieves perfect accuracy for both types of queries on all datasets that we have tested.

Mining Large Quasi-cliques with Quality Guarantees from Vertex Neighborhoods

Mining dense subgraphs is an important primitive across a spectrum of graph-mining tasks. In this work, we formally establish that two recurring characteristics of real-world graphs, namely heavy-tailed degree distributions and large clustering coefficients, imply the existence of substantially large vertex neighborhoods with high edge-density. This observation suggests a very simple approach for extracting large quasi-cliques: simply scan the vertex neighborhoods, compute the clustering coefficient of each vertex, and output the best such subgraph. The implementation of such a method requires counting the triangles in a graph, which is a well-studied problem in graph mining. When empirically tested across a number of real-world graphs, this approach reveals a surprise: vertex neighborhoods include maximal cliques of non-trivial sizes, and the density of the best neighborhood often compares favorably to subgraphs produced by dedicated algorithms for maximizing subgraph density. For graphs with small clustering coefficients, we demonstrate that small vertex neighborhoods can be refined using a local-search method to grow larger cliques and near-cliques. Our results indicate that contrary to worst-case theoretical results, mining cliques and quasi-cliques of non-trivial sizes from real-world graphs is often not a difficult problem, and provides motivation for further work geared towards a better explanation of these empirical successes.

Residual Correlation in Graph Neural Network Regression

A graph neural network transforms features in each vertex's neighborhood into a vector representation of the vertex. Afterward, each vertex's representation is used independently for predicting its label. This standard pipeline implicitly assumes that vertex labels are conditionally independent given their neighborhood features. However, this is a strong assumption, and we show that it is far from true on many real-world graph datasets. Focusing on regression tasks, we find that this conditional independence assumption severely limits predictive power. This should not be that surprising, given that traditional graph-based semi-supervised learning methods such as label propagation work in the opposite fashion by explicitly modeling the correlation in predicted outcomes.

Here, we address this problem with an interpretable and efficient framework that can improve any graph neural network architecture simply by exploiting correlation structure in the regression residuals. In particular, we model the joint distribution of residuals on vertices with a parameterized multivariate Gaussian, and estimate the parameters by maximizing the marginal likelihood of the observed labels. Our framework achieves substantially higher accuracy than competing baselines, and the learned parameters can be interpreted as the strength of correlation among connected vertices. Furthermore, we develop linear time algorithms for low-variance, unbiased model parameter estimates, allowing us to scale to large networks. We also provide a basic version of our method that makes stronger assumptions on correlation structure but is painless to implement, often leading to great practical performance with minimal overhead.

Towards Fair Truth Discovery from Biased Crowdsourced Answers

Crowdsourcing systems have gained considerable interest and adoption in recent years. One important research problem for crowdsourcing systems is truth discovery, which aims to aggregate noisy answers contributed by the workers to obtain the correct answer (truth) of each task. However, since the collected answers are highly prone to the workers' biases, aggregating these biased answers without proper treatment will unavoidably lead to discriminatory truth discovery results for particular race, gender and political groups. To address this challenge, in this paper, first, we define a new fairness notion named θ-disparity for truth discovery. Intuitively, θ-disparity bounds the difference in the probabilities that the truth of both protected and unprotected groups being predicted to be positive. Second, we design three fairness enhancing methods, namely Pre-TD, FairTD, and Post-TD, for truth discovery. Pre-TD is a pre-processing method that removes the bias in workers' answers before truth discovery. FairTD is an in-processing method that incorporates fairness into the truth discovery process. And Post-TD is a post-processing method that applies additional treatment on the discovered truth to make it satisfy θ-disparity. We perform an extensive set of experiments on both synthetic and real-world crowdsourcing datasets. Our results demonstrate that among the three fairness enhancing methods, FairTD produces the best accuracy with θ-disparity. In some settings, the accuracy of FairTD is even better than truth discovery without fairness, as it removes some low-quality answers as side effects.

AutoShuffleNet: Learning Permutation Matrices via an Exact Lipschitz Continuous Penalty in Deep Convolutional Neural Networks

ShuffleNet is a state-of-the-art light weight convolutional neural network architecture. Its basic operations include group, channel-wise convolution and channel shuffling. However, channel shuffling is manually designed on empirical grounds. Mathematically, shuffling is a multiplication by a permutation matrix. In this paper, we propose to automate channel shuffling by learning permutation matrices in network training. We introduce an exact Lipschitz continuous non-convex penalty so that it can be incorporated in the stochastic gradient descent to approximate permutation at high precision. Exact permutations are obtained by simple rounding at the end of training and are used in inference. The resulting network, referred to as AutoShuffleNet, achieved improved classification accuracies on data from CIFAR-10, CIFAR-100 and ImageNet while preserving the inference costs of ShuffleNet. In addition, we found experimentally that the standard convex relaxation of permutation matrices into stochastic matrices leads to poor performance. We prove theoretically the exactness (error bounds) in recovering permutation matrices when our penalty function is zero (very small). We present examples of permutation optimization through graph matching and two-layer neural network models where the loss functions are calculated in closed analytical form. In the examples, convex relaxation failed to capture permutations whereas our penalty succeeded.

MoFlow: An Invertible Flow Model for Generating Molecular Graphs

Generating molecular graphs with desired chemical properties driven by deep graph generative models provides a very promising way to accelerate drug discovery process. Such graph generative models usually consist of two steps: learning latent representations and generation of molecular graphs. However, to generate novel and chemically-valid molecular graphs from latent representations is very challenging because of the chemical constraints and combinatorial complexity of molecular graphs. In this paper, we propose MoFlow, a flow-based graph generative model to learn invertible mappings between molecular graphs and their latent representations. To generate molecular graphs, our MoFlow first generates bonds (edges) through a Glow based model, then generates atoms (nodes) given bonds by a novel graph conditional flow, and finally assembles them into a chemically valid molecular graph with a posthoc validity correction. Our MoFlow has merits including exact and tractable likelihood training, efficient one-pass embedding and generation, chemical validity guarantees, 100% reconstruction of training data, and good generalization ability. We validate our model by four tasks: molecular graph generation and reconstruction, visualization of the continuous latent space, property optimization, and constrained property optimization. Our MoFlow achieves state-of-the-art performance, which implies its potential efficiency and effectiveness to explore large chemical space for drug discovery.

Parallel DNN Inference Framework Leveraging a Compact RISC-V ISA-based Multi-core System

RISC-V is an open-source instruction set and now has been examined as a universal standard to unify the heterogeneous platforms. However, current research focuses primarily on the design and fabrication of general-purpose processors based on RISC-V, despite the fact that in the era of IoT (Internet of Things), the fusion of heterogeneous platforms should also take application-specific processors into account. Accordingly, this paper proposes a collaborative RISC-V multi-core system for Deep Neural Network (DNN) accelerators. To the best of our knowledge, this is the first time that a multi-core scheduling architecture for DNN acceleration is formulated and RISC-V is explored as the ISA of a multi-core system to bridge the gap between the memory and the DNN Processor in order to increase the entire system throughput. The experiment realizes a four-stage design of the RISC-V core, and further reveals that a multi-core design along with an appropriate scheduling algorithm can efficiently decrease the runtime and elevate the throughput. Moreover, the experiment also provides us with a constructive suggestion regarding the ideal proportion of the cores to Process Engines (PE), which provides us with significant assistance in building highly efficient AI System-on-Chips (SoCs) in resource-aware situations.

Missing Value Imputation for Mixed Data via Gaussian Copula

Missing data imputation forms the first critical step of many data analysis pipelines. The challenge is greatest for mixed data sets, including real, Boolean, and ordinal data, where standard techniques for imputation fail basic sanity checks: for example, the imputed values may not follow the same distributions as the data. This paper proposes a new semiparametric algorithm to impute missing values, with no tuning parameters. The algorithm models mixed data as a Gaussian copula. This model can fit arbitrary marginals for continuous variables and can handle ordinal variables with many levels, including Boolean variables as a special case. We develop an efficient approximate EM algorithm to estimate copula parameters from incomplete mixed data. The resulting model reveals the statistical associations among variables. Experimental results on several synthetic and real datasets show the superiority of our proposed algorithm to state-of-the-art imputation algorithms for mixed data.

HiTANet: Hierarchical Time-Aware Attention Networks for Risk Prediction on Electronic Health Records

Deep learning methods especially recurrent neural network based models have demonstrated early success in disease risk prediction on longitudinal patient data. Existing works follow a strong assumption to implicitly assume the stationary disease progression during each time period, and thus, take a homogeneous way to decay the information from previous time steps for all patients. However,in reality, disease progression is non-stationary. Besides, the key time steps for a target disease vary among patients. To leverage time information for risk prediction in a more reasonable way, we propose a new hierarchical time-aware attention network, named HiTANet, which imitates the decision making process of doctors inrisk prediction. Particularly, HiTANet models time information in local and global stages. The local evaluation stage has a time aware Transformer that embeds time information into visit-level embed-ding and generates local attention weight for each visit. The global synthesis stage further adopts a time-aware key-query attention mechanism to assign global weights to different time steps. Finally, the two types of attention weights are dynamically combined to generate the patient representations for further risk prediction. We evaluate HiTANet on three real-world datasets. Compared with the best results among twelve competing baselines, HiTANet achieves over 7% in terms of F1 score on all datasets, which demonstrates the effectiveness of the proposed model and the necessity of modeling time information in risk prediction task.

Personalized PageRank to a Target Node, Revisited

Personalized PageRank (PPR) is a widely used node proximity measure in graph mining and network analysis. Given a source node s and a target node t, the PPR value π(s,t) represents the probability that a random walk from s terminates at t, and thus indicates the bidirectional importance between s and t. The majority of the existing work focuses on the single-source queries, which asks for the PPR value of a given source node s and every node t ∈ V. However, the single-source query only reflects the importance of each node t with respect to s. In this paper, we consider the single-target PPR query, which measures the opposite direction of importance for PPR. Given a target node t, the single-target PPR query asks for the PPR value of every node $s\in V$ to a given target node t. We propose RBS, a novel algorithm that answers approximate single-target queries with optimal computational complexity. We show that RBS improves three concrete applications: heavy hitters PPR query, single-source SimRank computation, and scalable graph neural networks. We conduct experiments to demonstrate that RBS outperforms the state-of-the-art algorithms in terms of both efficiency and precision on real-world benchmark datasets.

Edge-consensus Learning: Deep Learning on P2P Networks with Nonhomogeneous Data

An effective Deep Neural Network (DNN) optimization algorithm that can use decentralized data sets over a peer-to-peer (P2P) network is proposed. In applications such as medical data analysis, the aggregation of data in one location may not be possible due to privacy issues. Hence, we formulate an algorithm to reach a global DNN model that does not require transmission of data among nodes. An existing solution for this issue is gossip stochastic gradient descend (SGD), which updates by averaging node models over a P2P network. However, in practical situations where the data are statistically heterogeneous across the nodes and/or where communication is asynchronous, gossip SGD often gets trapped in local minimum since the model gradients are noticeably different. To overcome this issue, we solve a linearly constrained DNN cost minimization problem, which results in variable update rules that restrict differences among all node models. Our approach can be based on the Primal-Dual Method of Multipliers (PDMM) or the Alternating Direction Method of Multiplier (ADMM), but the cost function is linearized to be suitable for deep learning. It facilitates asynchronous communication. The results of our numerical experiments using CIFAR-10 indicate that the proposed algorithms converge to a global recognition model even though statistically heterogeneous data sets are placed on the nodes.

Deep Learning of High-Order Interactions for Protein Interface Prediction

Protein interactions are important in a broad range of biological processes. Traditionally, computational methods have been developed to automatically predict protein interface from hand-crafted features. Recent approaches employ deep neural networks and predict the interaction of each amino acid pair independently. However, these methods do not incorporate the important sequential information from amino acid chains and the high-order pairwise interactions. Intuitively, the prediction of an amino acid pair should depend on both their features and the information of other amino acid pairs. In this work, we propose to formulate the protein interface prediction as a 2D dense prediction problem. In addition, we propose a novel deep model to incorporate the sequential information and high-order pairwise interactions to perform interface predictions. We represent proteins as graphs and employ graph neural networks to learn node features. Then we propose the sequential modeling method to incorporate the sequential information and reorder the feature matrix. Next, we incorporate high-order pairwise interactions to generate a 3D tensor containing different pairwise interactions. Finally, we employ convolutional neural networks to perform 2D dense predictions. Experimental results on multiple benchmarks demonstrate that our proposed method can consistently improve the protein interface prediction performance.

MAMO: Memory-Augmented Meta-Optimization for Cold-start Recommendation

A common challenge for most current recommender systems is the cold-start problem. Due to the lack of user-item interactions, the fine-tuned recommender systems are unable to handle situations with new users or new items. Recently, some works introduce the meta-optimization idea into the recommendation scenarios, i.e. predicting the user preference by only a few of past interacted items. The core idea is learning a global sharing initialization parameter for all users and then learning the local parameters for each user separately. However, most meta-learning based recommendation approaches adopt model-agnostic meta-learning for parameter initialization, where the global sharing parameter may lead the model into local optima for some users. In this paper, we design two memory matrices that can store task-specific memories and feature-specific memories. Specifically, the feature-specific memories are used to guide the model with personalized parameter initialization, while the task-specific memories are used to guide the model fast predicting the user preference. And we adopt a meta-optimization approach for optimizing the proposed method. We test the model on two widely used recommendation datasets and consider four cold-start situations. The experimental results show the effectiveness of the proposed methods.

Finding Effective Geo-social Group for Impromptu Activities with Diverse Demands

Geo-social group search aims to find a group of people proximate to a location while socially related. One of the driven applications for geo-social group search is organizing an impromptu activity. This is because the social cohesiveness of a found geo-social group ensures a good communication atmosphere for the activity and the spatial closeness of the geo-social group reduces the preparation time for the activity. Most existing works treat geo-social group search as a problem that finds a group satisfying a single social constraint while optimizing the spatial proximity. However, since different impromptu activities have diverse demands on attendees, e.g. an activity could require (or prefer) the attendees to have skills (or favorites) related to the activity, the existing works cannot find this kind of geo-social groups effectively. In this paper, we propose a novel geo-social group model, equipped with elegant keyword constraints, to fill this gap. We propose a novel search framework which first significantly narrows down the search space with theoretical guarantees and then efficiently finds the optimum result. To evaluate the effectiveness, we conduct experiments on real datasets, demonstrating the superiority of our proposed model. We conduct extensive experiments on large semi-synthetic datasets for justifying the efficiency of the proposed search algorithms.

Representing Temporal Attributes for Schema Matching

Temporal data are prevalent, where one or several time attributes present. It is challenging to identify the temporal attributes from heterogeneous sources. The reason is that the same attribute could contain distinct values in different time spans, whereas different attributes may have highly similar timestamps and alike values. Existing studies on schema matching seldom explore the temporal information for matching attributes. In this paper, we argue to order the values in an attribute A by some time attribute T as a time series. To learn deep temporal features in the attribute pair (T, A), we devise an auto-encoder to embed the transitions of values in the time series into a vector. The temporal attribute matching (TAM) is thus to evaluate matching distance of two temporal attribute pairs by comparing their transition vectors. We show that computing the optimal matching distance is NP-hard, and present an approximation algorithm. Experiments on real datasets demonstrate the superiority of our proposal in matching temporal attributes compared to the generic schema matching approaches.

Estimating Properties of Social Networks via Random Walk considering Private Nodes

Accurately analyzing graph properties of social networks is a challenging task because of access limitations to the graph data. To address this challenge, several algorithms to obtain unbiased estimates of properties from few samples via a random walk have been studied. However, existing algorithms do not consider private nodes who hide their neighbors in real social networks, leading to some practical problems. Here we design random walk-based algorithms to accurately estimate properties without any problems caused by private nodes. First, we design a random walk-based sampling algorithm that comprises the neighbor selection to obtain samples having the Markov property and the calculation of weights for each sample to correct the sampling bias. Further, for two graph property estimators, we propose the weighting methods to reduce not only the sampling bias but also estimation errors due to private nodes. The proposed algorithms improve the estimation accuracy of the existing algorithms by up to 92.6% on real-world datasets.

ASGN: An Active Semi-supervised Graph Neural Network for Molecular Property Prediction

Molecular property prediction (e.g., energy) is an essential problem in chemistry and biology. Unfortunately, many supervised learning methods usually suffer from the problem of scarce labeled molecules in the chemical space, where such property labels are generally obtained by Density Functional Theory (DFT) calculation which is extremely computational costly. An effective solution is to incorporate the unlabeled molecules in a semi-supervised fashion. However, learning semi-supervised representation for large amounts of molecules is challenging, including the joint representation issue of both molecular essence and structure, the conflict between representation and property leaning. Here we propose a novel framework called Active Semi-supervised Graph Neural Network (ASGN) by incorporating both labeled and unlabeled molecules. Specifically, ASGN adopts a teacher-student framework. In the teacher model, we propose a novel semi-supervised learning method to learn general representation that jointly exploits information from molecular structure and molecular distribution. Then in the student model, we target at property prediction task to deal with the learning loss conflict. At last, we proposed a novel active learning strategy in terms of molecular diversities to select informative data during the whole framework learning. We conduct extensive experiments on several public datasets. Experimental results show the remarkable performance of our ASGN framework.

Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks

Modeling multivariate time series has long been a subject that has attracted researchers from a diverse range of fields including economics, finance, and traffic. A basic assumption behind multivariate time series forecasting is that its variables depend on one another but, upon looking closely, it is fair to say that existing methods fail to fully exploit latent spatial dependencies between pairs of variables. In recent years, meanwhile, graph neural networks (GNNs) have shown high capability in handling relational dependencies. GNNs require well-defined graph structures for information propagation which means they cannot be applied directly for multivariate time series where the dependencies are not known in advance. In this paper, we propose a general graph neural network framework designed specifically for multivariate time series data. Our approach automatically extracts the uni-directed relations among variables through a graph learning module, into which external knowledge like variable attributes can be easily integrated. A novel mix-hop propagation layer and a dilated inception layer are further proposed to capture the spatial and temporal dependencies within the time series. The graph learning, graph convolution, and temporal convolution modules are jointly learned in an end-to-end framework. Experimental results show that our proposed model outperforms the state-of-the-art baseline methods on 3 of 4 benchmark datasets and achieves on-par performance with other approaches on two traffic datasets which provide extra structural information.

Learning Opinion Dynamics From Social Traces

Opinion dynamics the research field dealing with how people's opinions form and evolve in a social context? traditionally uses agent-based models to validate the implications of sociological theories. These models encode the causal mechanism that drives the opinion formation process, and have the advantage of being easy to interpret. However, as they do not exploit the availability of data, their predictive power is limited. Moreover, parameter calibration and model selection are manual and difficult tasks.

In this work we propose an inference mechanism for fitting a generative, agent-like model of opinion dynamics to real-world social traces. Given a set of observables (e.g., actions and interactions between agents), our model can recover the most-likely latent opinion trajectories that are compatible with the assumptions about the process dynamics. This type of model retains the benefits of agent-based ones (i.e., causal interpretation), while adding the ability to perform model selection and hypothesis testing on real data.

We showcase our proposal by translating a classical agent-based model of opinion dynamics into its generative counterpart. We then design an inference algorithm based on online expectation maximization to learn the latent parameters of the model. Such algorithm can recover the latent opinion trajectories from traces generated by the classical agent-based model. In addition, it can identify the most likely set of macro parameters used to generate a data trace, thus allowing testing of sociological hypotheses. Finally, we apply our model to real-world data from Reddit to explore the long-standing question about the impact of the backfire effect. Our results suggest a low prominence of the effect in Reddit's political conversation.

Enterprise Cooperation and Competition Analysis with a Sign-Oriented Preference Network

The development of effective cooperative and competitive strategies has been recognized as the key to the success of many companies in a globalized world. Therefore, many efforts have been made on the analysis of cooperation and competition among companies. However, existing studies either rely on labor intensive empirical analysis with specific cases or do not consider the heterogeneous company information when quantitatively measuring company relationships in a company network. More importantly, it is not clear how to generate a unified representation for cooperative and competitive strategies in a data driven way. To this end, in this paper, we provide a large-scale data driven analysis on the cooperative and competitive relationships among companies in a Sign-oriented Preference Network (SOPN). Specifically, we first exploit a Relational Graph Convolutional Network (RGCN) for generating a deep representation of the heterogeneous company features and a company relation network. Then, based on the representation, we generate two sets of preference vectors for each company by utilizing the attention mechanism to model the importance of different relations, representing their cooperative and competitive strategies respectively. Also, we design a sign constraint to model the dependency between cooperation and competition relations. Finally, we conduct extensive experiments on a real-world dataset, and verify the effectiveness of our approach. Moreover, we provide a case study to show some interesting patterns and their potential business value.

BLOB: A Probabilistic Model for Recommendation that Combines Organic and Bandit Signals

A common task for recommender systems is to build a profile of the interests of a user from items in their browsing history and later to recommend items to the user from the same catalog. The users' behavior consists of two parts: the sequence of items that they viewed without intervention (the organic part) and the sequences of items recommended to them and their outcome (the bandit part).

In this paper, we propose Bayesian Latent Organic Bandit model (BLOB), a probabilistic approach to combine the 'organic' and 'bandit' signals in order to improve the estimation of recommendation quality. The bandit signal is valuable as it gives direct feedback of recommendation performance, but the signal quality is very uneven, as it is highly concentrated on the recommendations deemed optimal by the past version of the recommender system. In contrast, the organic signal is typically strong and covers most items, but is not always relevant to the recommendation task. In order to leverage the organic signal to efficiently learn the bandit signal in a Bayesian model we identify three fundamental types of distances, namely action-history, action-action and history-history distances. We implement a scalable approximation of the full model using variational auto-encoders and the local re-paramerization trick. We show using extensive simulation studies that our method out-performs or matches the value of both state-of-the-art organic-based recommendation algorithms, and of bandit-based methods (both value and policy-based) both in organic and bandit-rich environments.

AutoST: Efficient Neural Architecture Search for Spatio-Temporal Prediction

Spatio-temporal (ST) prediction (e.g. crowd flow prediction) is of great importance in a wide range of smart city applications from urban planning, intelligent transportation and public safety. Recently, many deep neural network models have been proposed to make accurate prediction. However, manually designing neural networks requires amount of expert efforts and ST domain knowledge. How to automatically construct a general neural network for diverse spatio-temporal predication tasks in cities? In this paper, we study Neural Architecture Search (NAS) for spatio-temporal prediction and propose an efficient spatio-temporal neural architecture search method, entitled AutoST. To our best knowledge, the search space is an important human prior to the success of NAS in different applications while current NAS models concentrated on optimizing search strategy in the fixed search space. Thus, we design a novel search space tailored for ST-domain which consists of two categories of components: (i) optional convolution operations at each layer to automatically extract multi-range spatio-temporal dependencies; (ii) learnable skip connections among layers to dynamically fuse low- and high-level ST-features. We conduct extensive experiments on four real-word spatio-temporal prediction tasks, including taxi flow and crowd flow, showing that the learned network architectures can significantly improve the performance of representative ST neural network models. Furthermore, our proposed efficient NAS approach searches 8-10x faster than state-of-the-art NAS approaches, demonstrating the efficiency and effectiveness of AutoST.

COMPOSE: Cross-Modal Pseudo-Siamese Network for Patient Trial Matching

Clinical trials play important roles in drug development but often suffer from expensive, inaccurate and insufficient patient recruitment. The availability of massive electronic health records (EHR) data and trial eligibility criteria (EC) bring a new opportunity to data driven patient recruitment. One key task named patient-trial matching is to find qualified patients for clinical trials given structured EHR and unstructured EC text (both inclusion and exclusion criteria). How to match complex EC text with longitudinal patient EHRs? How to embed many-to-many relationships between patients and trials? How to explicitly handle the difference between inclusion and exclusion criteria? In this paper, we proposed CrOss-Modal PseudO-SiamEse network (COMPOSE) to address these challenges for patient-trial matching. One path of the network encodes EC using convolutional highway network. The other path processes EHR with multi-granularity memory network that encodes structured patient records into multiple levels based on medical ontology. Using the EC embedding as query, COMPOSE performs attentional record alignment and thus enables dynamic patient-trial matching. COMPOSE also introduces a composite loss term to maximize the similarity between patient records and inclusion criteria while minimize the similarity to the exclusion criteria. Experiment results show COMPOSE can reach 98.0% AUC on patient-criteria matching and 83.7% accuracy on patient-trial matching, which leads 24.3% improvement over the best baseline on real-world patient-trial matching tasks.

Discovering Succinct Pattern Sets Expressing Co-Occurrence and Mutual Exclusivity

Pattern mining is one of the core topics of data mining. We consider the problem of mining a succinct set of patterns that together explain the data in terms of mutual exclusivity and co-occurence. That is, we extend the traditional pattern languages beyond conjunctions, enabling us to capture more complex relationships, such as replacable sub-components or antagonists in biological pathways.

We formally define this problem in terms of the Minimum Description Length principle, by which we identify the best set of patterns as the one that most succinctly describes the data. To avoid spurious results---in sparse data mutual exclusivity is likely just due to chance---we propose an efficient statistical test for K-ary mutual exclusivity. As the search space for the optimal model is enormous and unstructured, we propose Mexican, a heuristic algorithm to efficiently discover high quality sets of patterns of co-occurences and mutual exclusivity. Through extensive experiments we show that Mexican recovers the ground truth on synthetic data, and meaningful results on real-world data. Both in stark contrast to the state of the art, that result in millions of spurious patterns.

TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework for Deep Learning with Anonymized Intermediate Representations

The success of deep learning partially benefits from the availability of various large-scale datasets. These datasets are often crowdsourced from individual users and contain private information like gender, age, etc. The emerging privacy concerns from users on data sharing hinder the generation or use of crowdsourcing datasets and lead to hunger of training data for new deep learning applications. One naive solution is to pre-process the raw data to extract features at the user-side, and then only the extracted features will be sent to the data collector. Unfortunately, attackers can still exploit these extracted features to train an adversary classifier to infer private attributes. Some prior arts leveraged game theory to protect private attributes. However, these defenses are designed for known primary learning tasks, the extracted features work poorly for unknown learning tasks. To tackle the case where the learning task may be unknown or changing, we present TIPRDC, a task-independent privacy-respecting data crowdsourcing framework with anonymized intermediate representation. The goal of this framework is to learn a feature extractor that can hide the privacy information from the intermediate representations; while maximally retaining the original information embedded in the raw data for the data collector to accomplish unknown learning tasks. We design a hybrid training method to learn the anonymized intermediate representation: (1) an adversarial training process for hiding private information from features; (2) maximally retain original information using a neural-network-based mutual information estimator. We extensively evaluate TIPRDC and compare it with existing methods using two image datasets and one text dataset. Our results show that TIPRDC substantially outperforms other existing methods. Our work is the first task-independent privacy-respecting data crowdsourcing framework.

AutoGrow: Automatic Layer Growing in Deep Convolutional Networks

Depth is a key component of Deep Neural Networks (DNNs), however, designing depth is heuristic and requires many human efforts. We proposeAutoGrow to automate depth discovery in DNNs: starting from a shallow seed architecture,AutoGrow grows new layers if the growth improves the accuracy; otherwise, stops growing and thus discovers the depth. We propose robust growing and stopping policies to generalize to different network architectures and datasets. Our experiments show that by applying the same policy to different network architectures,AutoGrow can always discover near-optimal depth on various datasets of MNIST, FashionMNIST, SVHN, CIFAR10, CIFAR100 and ImageNet. For example, in terms of accuracy-computation trade-off,AutoGrow discovers a better depth combination in \resnets than human experts. OurAutoGrow is efficient. It discovers depth within similar time of training a single DNN. Our code is available at \urlhttps://github.com/wenwei202/autogrow.

Curb-GAN: Conditional Urban Traffic Estimation through Spatio-Temporal Generative Adversarial Networks

Given an urban development plan and the historical traffic observations over the road network, the Conditional Urban Traffic Estimation problem aims to estimate the resulting traffic status prior to the deployment of the plan. This problem is of great importance to urban development and transportation management, yet is very challenging because the plan would change the local travel demands drastically and the new travel demand pattern might be unprecedented in the historical data. To tackle these challenges, we propose a novel Conditional Urban Traffic Generative Adversarial Network (Curb-GAN), which provides traffic estimations in consecutive time slots based on different (unprecedented) travel demands, thus enables urban planners to accurately evaluate urban plans before deploying them. The proposed Curb-GAN adopts and advances the conditional GAN structure through a few novel ideas: (1) dealing with various travel demands as the "conditions" and generating corresponding traffic estimations, (2) integrating dynamic convolutional layers to capture the local spatial auto-correlations along the underlying road networks, (3) employing self-attention mechanism to capture the temporal dependencies of the traffic across different time slots. Extensive experiments on two real-world spatio-temporal datasets demonstrate that our Curb-GAN outperforms major baseline methods in estimation accuracy under various conditions and can produce more meaningful estimations.

Incremental Mobile User Profiling: Reinforcement Learning with Spatial Knowledge Graph for Modeling Event Streams

We study the integration of reinforcement learning and spatial knowledge graph for incremental mobile user profiling, which aims to map mobile users to dynamically-updated profile vectors by incremental learning from a mixed-user event stream. After exploring many profiling methods, we identify a new imitation based criteria to better evaluate and optimize profiling accuracy. Considering the objective of teaching an autonomous agent to imitate a mobile user to plan next-visit based on the user's profile, the user profile is the most accurate when the agent can perfectly mimic the activity patterns of the user. We propose to formulate the problem into a reinforcement learning task, where an agent is a next-visit planner, an action is a POI that a user will visit next, and the state of environment is a fused representation of a user and spatial entities (e.g., POIs, activity types, functional zones). An event that a user takes an action to visit a POI, will change the environment, resulting into a new state of user profiles and spatial entities, which helps the agent to predict next visit more accurately. After analyzing such interactions among events, users, and spatial entities, we identify (1)semantic connectivity among spatial entities, and, thus, introduce a spatial Knowledge Graph (KG) to characterize the semantics of user visits over connected locations, activities, and zones. Besides, we identify (2) mutual influence between users and the spatial KG, and, thus, develop a mutual-updating strategy between users and the spatial KG, mixed with temporal context, to quantify the state representation that evolves over time. Along these lines, we develop a reinforcement learning framework integrated with spatial KG. The proposed framework can achieve incremental learning in multi-user profiling given a mixed-user event stream. Finally, we apply our approach to human mobility activity prediction and present extensive experiments to demonstrate improved performances.

Identifying Sepsis Subphenotypes via Time-Aware Multi-Modal Auto-Encoder

Sepsis is a heterogeneous clinical syndrome that is the leading cause of mortality in hospital intensive care units (ICUs). Identification of sepsis subphenotypes may allow for more precise treatments and lead to more targeted clinical interventions. Recently, sepsis subtyping on electronic health records (EHRs) has attracted interest from healthcare researchers. However, most sepsis subtyping studies ignore the temporality of EHR data and suffer from missing values. In this paper, we propose a new sepsis subtyping framework to address the two issues. Our subtyping framework consists of a novel Time-Aware Multi-modal auto-Encoder (TAME) model which introduces time-aware attention mechanism and incorporates multi-modal inputs (e.g., demographics, diagnoses, medications, lab tests and vital signs) to impute missing values, a dynamic time wrapping (DTW) method to measure patients' temporal similarity based on the imputed EHR data, and a weighted k-means algorithm to cluster patients. Comprehensive experiments on real-world datasets show TAME outperforms the baselines on imputation accuracy. After analyzing TAME-imputed EHR data, we identify four novel subphenotypes of sepsis patients, paving the way for improved personalization of sepsis management.

A Causal Look at Statistical Definitions of Discrimination

Predictive parity and error rate balance are both widely accepted and adopted criteria for assessing fairness of classifiers. The realization that these equally reasonable criteria can lead to contradictory results has, nonetheless, generated a lot of debate/controversy, and has motivated the development of mathematical results establishing the impossibility of concomitantly satisfying predictive parity and error rate balance. Here, we investigate these fairness criteria from a causality perspective. By taking into consideration the data generation process giving rise to the observed data, as well as, the data generation process giving rise to the predictions, and assuming faithfulness, we prove that when the base rates differ across the protected groups and there is no perfect separation, then a standard classifier cannot achieve exact predictive parity. (Where, by standard classifier we mean a classifier trained in the usual way, without adopting pre-processing, in-processing, or post-processing fairness techniques.) This result holds in general, irrespective of the data generation process giving rise to the observed data. Furthermore, we show that the amount of disparate mistreatment for the positive predictive value metric is proportional to the difference between the base rates. For the error rate balance, as well as, the closely related equalized odds and equality of opportunity criteria, we show that there are, nonetheless, data generation processes that can still satisfy these criteria when the base rates differ by protected group, and we characterize the conditions under which these criteria hold. We illustrate our results using synthetic data, and with the re-analysis of the COMPAS data.

Targeted Data-driven Regularization for Out-of-Distribution Generalization

Due to biases introduced by large real-world datasets, deviations of deep learning models from their expected behavior on out-of-distribution test data are worrisome. Especially when data come from imbalanced or heavy-tailed label distributions, or minority groups of a sensitive feature. Classical approaches to address these biases are mostly data- or application-dependent, hence are burdensome to tune. Some meta-learning approaches, on the other hand, aim to learn hyperparameters in the learning process using different objective functions on training and validation data. However, these methods suffer from high computational complexity and are not scalable to large datasets. In this paper, we propose a unified data-driven regularization approach to learn a generalizable model from biased data. The proposed framework, named as targeted data-driven regularization (TDR), is model- and dataset-agnostic, and employs a target dataset that resembles the desired nature of test data in order to guide the learning process in a coupled manner. We cast the problem as a bilevel optimization and propose an efficient stochastic gradient descent based method to solve it. The framework can be utilized to alleviate various types of biases in real-world applications. We empirically show, on both synthetic and real-world datasets, the superior performance of TDR for resolving issues stem from these biases.

Neural Dynamics on Complex Networks

Learning continuous-time dynamics on complex networks is crucial for understanding, predicting, and controlling complex systems in science and engineering. However, this task is very challenging due to the combinatorial complexities in the structures of high dimensional systems, their elusive continuous-time nonlinear dynamics, and their structural-dynamic dependencies. To address these challenges, we propose to combine Ordinary Differential Equation Systems (ODEs) and Graph Neural Networks (GNNs) to learn continuous-time dynamics on complex networks in a data-driven manner. We model differential equation systems by GNNs. Instead of mapping through a discrete number of neural layers in the forward process, we integrate GNN layers over continuous time numerically, leading to capturing continuous-time dynamics on graphs. Our model can be interpreted as a Continuous-time GNN model or a Graph Neural ODEs model. Our model can be utilized for continuous-time network dynamics prediction, structured sequence prediction (a regularly-sampled case), and node semi-supervised classification tasks (a one-snapshot case) in a unified framework. We validate our model by extensive experiments in the above three scenarios. The promising experimental results demonstrate our model's capability of jointly capturing the structure and dynamics of complex systems in a unified framework.

Grammatically Recognizing Images with Tree Convolution

Similar to language, understanding an image can be considered as a hierarchical decomposition process from scenes to objects, parts, pixels, and the corresponding spatial/contextual relations. However, the existing convolutional networks concentrate on stacking redundant convolutional layers with a large number of kernels in a hierarchical organization to implicitly approximate this decomposition. This may limit the network to learn the semantic information conveyed in the internal feature maps that may reveal minor yet crucial differences for visual understanding. Attempting to tackle this problem, this paper proposes a simple yet effective tree convolution (TreeConv) operation for deep neural networks. Specifically, inspired by the image grammar techniques[73] that serve as a unified framework of object representation, learning, and recognition, our TreeConv designs a generative image grammar, i.e., tree generation rule, to parse the hierarchy of internal feature maps by generating tree structures and implicitly learning the specific visual grammars for each object category. Extensive experiments on a variety of benchmarks, i.e., classification (ImageNet / CIFAR), detection & segmentation (COCO 2017), and person re-identification (CUHK03), demonstrate the superiority of our TreeConv in both boosting the accuracy and reducing the computational cost. The source code will be available at: https://github.com/wanggrun/TreeConv.

Generic Outlier Detection in Multi-Armed Bandit

In this paper, we study the problem of outlier arm detection in multi-armed bandit settings, which finds plenty of applications in many high-impact domains such as finance, healthcare, and online advertising. For this problem, a learner aims to identify the arms whose expected rewards deviate significantly from most of the other arms. Different from existing work, we target the generic outlier arms or outlier arm groups whose expected rewards can be larger, smaller, or even in between those of normal arms. To this end, we start by providing a comprehensive definition of such generic outlier arms and outlier arm groups. Then we propose a novel pulling algorithm named GOLD to identify such generic outlier arms. It builds a real-time neighborhood graph based on upper confidence bounds and catches the behavior pattern of outliers from normal arms. We also analyze its performance from various aspects. In the experiments conducted on both synthetic and real-world data sets, the proposed algorithm achieves 98% accuracy while saving 83% exploration cost on average compared with state-of-the-art techniques.

Robust Spammer Detection by Nash Reinforcement Learning

Online reviews provide product evaluations for customers to make decisions. Unfortunately, the evaluations can be manipulated using fake reviews ("spams") by professional spammers, who have learned increasingly insidious and powerful spamming strategies by adapting to the deployed detectors. Spamming strategies are hard to capture, as they can be varying quickly along time, different across spammers and target products, and more critically, remained unknown in most cases. Furthermore, most existing detectors focus on detection accuracy, which is not well-aligned with the goal of maintaining the trustworthiness of product evaluations. To address the challenges, we formulate a minimax game where the spammers and spam detectors compete with each other on their practical goals that are not solely based on detection accuracy. Nash equilibria of the game lead to stable detectors that are agnostic to any mixed detection strategies. However, the game has no closed-form solution and is not differentiable to admit the typical gradient-based algorithms. We turn the game into two dependent Markov Decision Processes (MDPs) to allow efficient stochastic optimization based on multi-armed bandit and policy gradient. We experiment on three large review datasets using various state-of-the-art spamming and detection strategies and show that the optimization algorithm can reliably find an equilibrial detector that can robustly and effectively prevent spammers with any mixed spamming strategies from attaining their practical goal. Our code is available at https://github.com/YingtongDou/Nash-Detect.

Mining Persistent Activity in Continually Evolving Networks

Frequent pattern mining is a key area of study that gives insights into the structure and dynamics of evolving networks, such as social or road networks. However, not only does a network evolve, but often the way that it evolves, itself evolves. Thus, knowing, in addition to patterns' frequencies, for how long and how regularly they have occurred-i.e., their persistence-can add to our understanding of evolving networks. In this work, we propose the problem of mining activity that persists through time in continually evolving networks-i.e., activity that repeatedly and consistently occurs. We extend the notion of temporal motifs to capture activity among specific nodes, in what we call activity snippets, which are small sequences of edge-updates that reoccur. We propose axioms and properties that a measure of persistence should satisfy, and develop such a persistence measure. We also propose PENminer, an efficient framework for mining activity snippets' Persistence in Evolving Networks, and design both offline and streaming algorithms. We apply PENminer to numerous real, large-scale evolving networks and edge streams, and find activity that is surprisingly regular over a long period of time, but too infrequent to be discovered by aggregate count alone, and bursts of activity exposed by their lack of persistence. Our findings with PENminer include neighborhoods in NYC where taxi traffic persisted through Hurricane Sandy, the opening of new bike-stations, characteristics of social network users, and more. Moreover, we use PENminer towards identifying anomalies in multiple networks, outperforming baselines at identifying subtle anomalies by 9.8-48% in AUC.

Towards Automated Neural Interaction Discovery for Click-Through Rate Prediction

Click-Through Rate (CTR) prediction is one of the most important machine learning tasks in recommender systems, driving personalized experience for billions of consumers. Neural architecture search (NAS), as an emerging field, has demonstrated its capabilities in discovering powerful neural network architectures, which motivates us to explore its potential for CTR predictions. Due to 1) diverse unstructured feature interactions, 2) heterogeneous feature space, and 3) high data volume and intrinsic data randomness, it is challenging to construct, search, and compare different architectures effectively for recommendation models. To address these challenges, we propose an automated interaction architecture discovering framework for CTR prediction named AutoCTR. Via modularizing simple yet representative interactions as virtual building blocks and wiring them into a space of direct acyclic graphs, AutoCTR performs evolutionary architecture exploration with learning-to-rank guidance at the architecture level and achieves acceleration using low-fidelity model. Empirical analysis demonstrates the effectiveness of AutoCTR on different datasets comparing to human-crafted architectures. The discovered architecture also enjoys generalizability and transferability among different datasets.

High-Dimensional Similarity Search with Quantum-Assisted Variational Autoencoder

Recent progress in quantum algorithms and hardware indicates the potential importance of quantum computing in the near future. However, finding suitable application areas remains an active area of research. Quantum machine learning is touted as a potential approach to demonstrate quantum advantage within both the gate-model and the adiabatic schemes. For instance, the Quantum-assisted Variational Autoencoder (QVAE) has been proposed as a quantum enhancement to the discrete VAE. We extend on previous work and study the real-world applicability of a QVAE by presenting a proof-of-concept for similarity search in large-scale high-dimensional datasets. While exact and fast similarity search algorithms are available for low dimensional datasets, scaling to high-dimensional data is non-trivial. We show how to construct a space-efficient search index based on the latent space representation of a QVAE. Our experiments show a correlation between the Hamming distance in the embedded space and the Euclidean distance in the original space on the Moderate Resolution Imaging Spectroradiometer (MODIS) dataset.Further, we find real-world speedups compared to linear search and demonstrate memory-efficient scaling to half a billion data points.

Off-policy Bandits with Deficient Support

Learning effective contextual-bandit policies from past actions of a deployed system is highly desirable in many settings (e.g. voice assistants, recommendation, search), since it enables the reuse of large amounts of log data. State-of-the-art methods for such off-policy learning, however, are based on inverse propensity score (IPS) weighting. A key theoretical requirement of IPS weighting is that the policy that logged the data has "full support", which typically translates into requiring non-zero probability for any action in any context. Unfortunately, many real-world systems produce support deficient data, especially when the action space is large, and we show how existing methods can fail catastrophically. To overcome this gap between theory and applications, we identify three approaches that provide various guarantees for IPS-based learning despite the inherent limitations of support-deficient data: restricting the action space, reward extrapolation, and restricting the policy space. We systematically analyze the statistical and computational properties of these three approaches, and we empirically evaluate their effectiveness. In addition to providing the first systematic analysis of support-deficiency in contextual-bandit learning, we conclude with recommendations that provide practical guidance.

Adaptive Graph Encoder for Attributed Graph Embedding

Attributed graph embedding, which learns vector representations from graph topology and node features, is a challenging task for graph analysis. Recently, methods based on graph convolutional networks (GCNs) have made great progress on this task. However,existing GCN-based methods have three major drawbacks. Firstly,our experiments indicate that the entanglement of graph convolutional filters and weight matrices will harm both the performance and robustness. Secondly, we show that graph convolutional filters in these methods reveal to be special cases of generalized Laplacian smoothing filters, but they do not preserve optimal low-pass characteristics. Finally, the training objectives of existing algorithms are usually recovering the adjacency matrix or feature matrix, which are not always consistent with real-world applications. To address these issues, we propose Adaptive Graph Encoder (AGE), a novel attributed graph embedding framework. AGE consists of two modules: (1) To better alleviate the high-frequency noises in the node features, AGE first applies a carefully-designed Laplacian smoothing filter. (2) AGE employs an adaptive encoder that iteratively strengthens the filtered features for better node embeddings. We conduct experiments using four public benchmark datasets to validate AGE on node clustering and link prediction tasks. Experimental results show that AGE consistently outperforms state-of-the-artgraph embedding methods considerably on these tasks.

NetTrans: Neural Cross-Network Transformation

Finding node associations across different networks is the cornerstone behind a wealth of high-impact data mining applications. Traditional approaches are often, explicitly or implicitly, built upon the linearity and/or consistency assumptions. On the other hand, the recent network embedding based methods promise a natural way to handle the non-linearity, yet they could suffer from the disparate node embedding space of different networks. In this paper, we address these limitations and tackle cross-network node associations from a new angle, i.e., cross-network transformation. We ask a generic question: Given two different networks, how can we transform one network to another? We propose an end-to-end model that learns a composition of nonlinear operations so that one network can be transformed to another in a hierarchical manner. The proposed model bears three distinctive advantages. First (composite transformation), it goes beyond the linearity/consistency assumptions and performs the cross-network transformation through a composition of nonlinear computations. Second (representation power), it can learn the transformation of both network structures and node attributes at different resolutions while identifying the cross-network node associations. Third (generality), it can be applied to various tasks, including network alignment, recommendation, cross-layer dependency inference. Extensive experiments on different tasks validate and verify the effectiveness of the proposed model.

Redundancy-Free Computation for Graph Neural Networks

Graph Neural Networks (GNNs) are based on repeated aggregations of information from nodes' neighbors in a graph. However, because nodes share many neighbors, a naive implementation leads to repeated and inefficient aggregations and represents significant computational overhead. Here we propose Hierarchically Aggregated computation Graphs(HAGs), a new GNN representation technique that explicitly avoids redundancy by managing intermediate aggregation results hierarchically and eliminates repeated computations and unnecessary data transfers in GNN training and inference. HAGs perform the same computations and give the same models/accuracy as traditional GNNs, but in a much shorter time dueto optimized computations. To identify redundant computations,we introduce an accurate cost function and use a novel search algorithm to find optimized HAGs. Experiments show that the HAG representation significantly outperforms the standard GNN by increasing the end-to-end training throughput by up to 2.8× and reducing the aggregations and data transfers in GNN training byup to 6.3× and 5.6×, with only 0.1% memory overhead. Overall,our results represent an important advancement in speeding-up and scaling-up GNNs without any loss in model predictive performance.

Improving Conversational Recommender Systems via Knowledge Graph based Semantic Fusion

Conversational recommender systems (CRS) aim to recommend high-quality items to users through interactive conversations. Although several efforts have been made for CRS, two major issues still remain to be solved. First, the conversation data itself lacks of sufficient contextual information for accurately understanding users' preference. Second, there is a semantic gap between natural language expression and item-level user preference.

To address these issues, we incorporate both word-oriented and entity-oriented knowledge graphs~(KG) to enhance the data representations in CRSs, and adopt Mutual Information Maximization to align the word-level and entity-level semantic spaces. Based on the aligned semantic representations, we further develop a KG-enhanced recommender component for making accurate recommendations, and a KG-enhanced dialog component that can generate informative keywords or entities in the response text. Extensive experiments have demonstrated the effectiveness of our approach in yielding better performance on both recommendation and conversation tasks.

Sliding Sketches: A Framework using Time Zones for Data Stream Processing in Sliding Windows

Data stream processing has become a hot issue in recent years due to the arrival of big data era. There are three fundamental stream processing tasks: membership query, frequency query and heavy hitter query. While most existing solutions address these queries in fixed windows, this paper focuses on a more challenging task: answering these queries in sliding windows. While most existing solutions address different kinds of queries by using different algorithms, this paper focuses on a generic framework. In this paper, we propose a generic framework, namely Sliding sketches, which can be applied to many existing solutions for the above three queries, and enable them to support queries in sliding windows. We apply our framework to five state-of-the-art sketches for the above three kinds of queries. Theoretical analysis and extensive experimental results show that after using our framework, the accuracy of existing sketches that do not support sliding windows becomes much higher than the corresponding best prior art. We released all the source code at Github.

STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths

Taxonomies are important knowledge ontologies that underpin numerous applications on a daily basis, but many taxonomies used in practice suffer from the low coverage issue. We study the taxonomy expansion problem, which aims to expand existing taxonomies with new concept terms. We propose a self-supervised taxonomy expansion model named STEAM, which leverages natural supervision in the existing taxonomy for expansion. To generate natural self-supervision signals, STEAM samples mini-paths from the existing taxonomy, and formulates a node attachment prediction task between anchor mini-paths and query terms. To solve the node attachment task, it learns feature representations for query-anchor pairs from multiple views and performs multi-view co-training for prediction. Extensive experiments show that STEAM outperforms state-of-the-art methods for taxonomy expansion by 11.6% in accuracy and 7.0% in mean reciprocal rank on three public benchmarks. The code and data for STEAM can be found at https://github.com/yueyu1030/STEAM.

Probabilistic Metric Learning with Adaptive Margin for Top-K Recommendation

Personalized recommender systems are playing an increasingly important role as more content and services become available and users struggle to identify what might interest them. Although matrix factorization and deep learning based methods have proved effective in user preference modeling, they violate the triangle inequality and fail to capture fine-grained preference information. To tackle this, we develop a distance-based recommendation model with several novel aspects: (i) each user and item are parameterized by Gaussian distributions to capture the learning uncertainties; (ii) an adaptive margin generation scheme is proposed to generate the margins regarding different training triplets; (iii) explicit user-user/item-item similarity modeling is incorporated in the objective function. The Wasserstein distance is employed to determine preferences because it obeys the triangle inequality and can measure the distance between probabilistic distributions. Via a comparison using five real-world datasets with state-of-the-art methods, the proposed model outperforms the best existing models by 4-22% in terms of recall@K on Top-K recommendation.

Re-identification Attack to Privacy-Preserving Data Analysis with Noisy Sample-Mean

In mining sensitive databases, access to sensitive class attributes of individual records is often prohibited by enforcing field-level security, while only aggregate class-specific statistics are allowed to be released. We consider a common privacy-preserving data analytics scenario where only a noisy sample mean of the class of interest can be queried. Such practice is widely found in medical research and business analytics settings. This paper studies the hazard of re-identification of entire class caused by revealing a noisy sample mean of the class. With a novel formulation of the re-identification attack as a generalized positive-unlabeled learning problem, we prove that the risk function of the re-identification problem is closely related to that of learning with complete data. We demonstrate that with a one-sided noisy sample mean, an effective re-identification attack can be devised with existing PU learning algorithms. We then propose a novel algorithm, growPU, that exploits the unique property of sample mean and consistently outperforms existing PU learning algorithms on the re-identification task. GrowPU achieves re-identification accuracy of 93.6% on the MNIST dataset and 88.1% on an online behavioral dataset with noiseless sample mean. With noise that guarantees 0.01-differential privacy, growPU achieves 91.9% on the MNIST dataset and 84.6% on the online behavioral dataset.

BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision

We study the open-domain named entity recognition (NER) problem under distant supervision. The distant supervision, though does not require large amounts of manual annotations, yields highly incomplete and noisy distant labels via external knowledge bases. To address this challenge, we propose a new computational framework -- BOND, which leverages the power of pre-trained language models (e.g., BERT and RoBERTa) to improve the prediction performance of NER models. Specifically, we propose a two-stage training algorithm: In the first stage, we adapt the pre-trained language model to the NER tasks using the distant labels, which can significantly improve the recall and precision; In the second stage, we drop the distant labels, and propose a self-training approach to further improve the model performance. Thorough experiments on 5 benchmark datasets demonstrate the superiority of BOND over existing distantly supervised NER methods. The code and distantly labeled data have been released in https://github.com/cliang1453/BOND.

Graph Structural-topic Neural Network

Graph Convolutional Networks (GCNs) achieved tremendous success by effectively gathering local features for nodes. However, commonly do GCNs focus more on node features but less on graph structures within the neighborhood, especially higher-order structural patterns. However, such local structural patterns are shown to be indicative of node properties in numerous fields. In addition, it is not just single patterns, but the distribution over all these patterns matter, because networks are complex and the neighborhood of each node consists of a mixture of various nodes and structural patterns. Correspondingly, in this paper, we propose Graph Structural topic Neural Network, abbreviated GraphSTONE 1, a GCN model that utilizes topic models of graphs, such that the structural topics capture indicative graph structures broadly from a probabilistic aspect rather than merely a few structures. Specifically, we build topic models upon graphs using anonymous walks and Graph Anchor LDA, an LDA variant that selects significant structural patterns first, so as to alleviate the complexity and generate structural topics efficiently. In addition, we design multi-view GCNs to unify node features and structural topic features and utilize structural topics to guide the aggregation. We evaluate our model through both quantitative and qualitative experiments, where our model exhibits promising performance, high efficiency, and clear interpretability.

Correlation Networks for Extreme Multi-label Text Classification

This paper develops the Correlation Networks (CorNet) architecture for the extreme multi-label text classification (XMTC) task, where the objective is to tag an input text sequence with the most relevant subset of labels from an extremely large label set. XMTC can be found in many real-world applications, such as document tagging and product annotation. Recently, deep learning models have achieved outstanding performances in XMTC tasks. However, these deep XMTC models ignore the useful correlation information among different labels. CorNet addresses this limitation by adding an extra CorNet module at the prediction layer of a deep model, which is able to learn label correlations, enhance raw label predictions with correlation knowledge and output augmented label predictions. We show that CorNet can be easily integrated with deep XMTC models and generalize effectively across different datasets. We further demonstrate that CorNet can bring significant improvements over the existing deep XMTC models in terms of both performance and convergence rate. The models and datasets are available at: https://github.com/XunGuangxu/CorNet.

Predicting Temporal Sets with Deep Neural Networks

Given a sequence of sets, where each set contains an arbitrary number of elements, the problem of temporal sets prediction aims to predict the elements in the subsequent set. In practice, temporal sets prediction is much more complex than predictive modelling of temporal events and time series, and is still an open problem. Many possible existing methods, if adapted for the problem of temporal sets prediction, usually follow a two-step strategy by first projecting temporal sets into latent representations and then learning a predictive model with the latent representations. The two-step approach often leads to information loss and unsatisfactory prediction performance. In this paper, we propose an integrated solution based on the deep neural networks for temporal sets prediction. A unique perspective of our approach is to learn element relationship by constructing set-level co-occurrence graph and then perform graph convolutions on the dynamic relationship graphs. Moreover, we design an attention-based module to adaptively learn the temporal dependency of elements and sets. Finally, we provide a gated updating mechanism to find the hidden shared patterns in different sequences and fuse both static and dynamic information to improve the prediction performance. Experiments on real-world data sets demonstrate that our approach can achieve competitive performances even with a portion of the training data and can outperform existing methods with a significant margin.

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over visual renderings of websites. In this paper, we present a novel two-stage neural approach, named FreeDOM, which overcomes both these limitations. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites from that vertical without requiring expensive hand-crafted features over visual renderings of the page. Through experiments on a public dataset with 8 different verticals, we show that FreeDOM beats the previous state of the art by nearly 3.7 F1 points on average without requiring features over rendered pages or expensive hand-crafted features.

SEAL: Learning Heuristics for Community Detection with Generative Adversarial Networks

Community detection is an important task with many applications. However, there is no universal definition of communities, and a variety of algorithms have been proposed based on different assumptions. In this paper, we instead study the semi-supervised community detection problem where we are given several communities in a network as training data and aim to discover more communities. This setting makes it possible to learn concepts of communities from data without any prior knowledge. We propose the Seed Expansion with generative Adversarial Learning (SEAL), a framework for learning heuristics for community detection. SEAL contains a generative adversarial network, where the discriminator predicts whether a community is real or fake, and the generator generates communities that cheat the discriminator by implicitly fitting characteristics of real ones. The generator is a graph neural network specialized in sequential decision processes and gets trained by policy gradient. Moreover, a locator is proposed to avoid well-known free-rider effects by forming a dual learning task with the generator. Last but not least, a seed selector is utilized to provide promising seeds to the generator. We evaluate SEAL on 5 real-world networks and prove its effectiveness.

Matrix Profile XXI: A Geometric Approach to Time Series Chains Improves Robustness

Time series motifs have become a fundamental tool to characterize repeated and conserved structure in systems, such as manufacturing telemetry, economic activities, and both human physiological and cultural behaviors. Recently time series chains were introduced as a generalization of time series motifs to represent evolving patterns in time series, in order to characterize the evolution of systems. Time series chains are a very promising primitive; however, we have observed that the original definition can be brittle in the sense that a small fluctuation in time series may "cut" a chain. Furthermore, the original definition does not provide a measure of the "significance" of a chain, and therefore cannot support top-k search for chains or provide a mechanism to discard spurious chains that might be discovered when searching large datasets. Inspired by observations from dynamical systems theory, this paper introduces two novel quality metrics for time series chains, directionality and graduality, to improve robustness and to enable top-K search. With extensive empirical work we show that our proposed definition is much more robust to the vagaries of real-word datasets and allows us to find unexpected regularities in time series datasets.

Retrospective Loss: Looking Back to Improve Training of Deep Neural Networks

Deep neural networks (DNNs) are powerful learning machines that have enabled breakthroughs in several domains. In this work, we introduce a new retrospective loss to improve the training of deep neural network models by utilizing the prior experience available in past model states during training. Minimizing the retrospective loss, along with the task-specific loss, pushes the parameter state at the current training step towards the optimal parameter state while pulling it away from the parameter state at a previous training step. Although a simple idea, we analyze the method as well as to conduct comprehensive sets of experiments across domains - images, speech, text, and graphs - to show that the proposed loss results in improved performance across input domains, tasks, and architectures.

Average Sensitivity of Spectral Clustering

Spectral clustering is one of the most popular clustering methods for finding clusters in a graph, which has found many applications in data mining. However, the input graph in those applications may have many missing edges due to error in measurement, withholding for a privacy reason, or arbitrariness in data conversion. To make reliable and efficient decisions based on spectral clustering, we assess the stability of spectral clustering against edge perturbations in the input graph using the notion of average sensitivity, which is the expected size of the symmetric difference of the output clusters before and after we randomly remove edges. We first prove that the average sensitivity of spectral clustering is proportional to $łambda_2/łambda_3^2$, where $łambda_i$ is the i-th smallest eigenvalue of the (normalized) Laplacian. We also prove an analogous bound for k-way spectral clustering, which partitions the graph into k clusters. Then, we empirically confirm our theoretical bounds by conducting experiments on synthetic and real networks. Our results suggest that spectral clustering is stable against edge perturbations when there is a cluster structure in the input graph.

Semi-Supervised Multi-Label Learning from Crowds via Deep Sequential Generative Model

Multi-label classification (MLC) is pervasive in real-world applications. Conventional MLC algorithms assume that enough ground truth labels are available for training a classifier. While in reality, obtaining ground truth labels is expensive and time-consuming. In the field of data mining, it is more efficient to use crowdsourcing for label collection. In this setting, an MLC algorithm needs to deal with the noisiness of the crowdsourced labels as well as the remaining massive unlabeled data. In this paper, we propose a deep generative model to describe the label generation process for this semi-supervised multi-label learning problem. Although deep generative models are widely used for MLC problems, no previous work could address the noisy crowdsourced multi-labels and unlabeled data simultaneously. To address this challenging problem, our novel generative model incorporates latent variables to describe the labeled/unlabeled data as well as the labeling process of crowdsourcing. We introduce an efficient sequential inference model to approximate the model posterior and infer the ground truth labels. Our experimental results on various scales of datasets demonstrate the effectiveness of our proposed model. It performs favorably against four state-of-the-art deep generative models.

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training

Graph representation learning has emerged as a powerful technique for addressing real-world problems. Various downstream graph learning tasks have benefited from its recent developments, such as node classification, similarity search, and graph classification. However, prior arts on graph representation learning focus on domain specific problems and train a dedicated model for each graph dataset, which is usually non-transferable to out-of-domain data. Inspired by the recent advances in pre-training from natural language processing and computer vision, we design Graph Contrastive Coding (GCC) --- a self-supervised graph neural network pre-training framework --- to capture the universal network topological properties across multiple networks. We design GCC's pre-training task as subgraph instance discrimination in and across networks and leverage contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations. We conduct extensive experiments on three graph learning tasks and ten graph datasets. The results show that GCC pre-trained on a collection of diverse datasets can achieve competitive or better performance to its task-specific and trained-from-scratch counterparts. This suggests that the pre-training and fine-tuning paradigm presents great potential for graph representation learning.

HGCN: A Heterogeneous Graph Convolutional Network-Based Deep Learning Model Toward Collective Classification

Collective classification, as an important technique to study networked data, aims to exploit the label autocorrelation for a group of inter-connected entities with complex dependencies. As the emergence of various heterogeneous information networks (HINs), collective classification at present is confronting several severe challenges stemming from the heterogeneity of HINs, such as complex relational hierarchy, potential incompatible semantics and node-context relational semantics. To address the challenges, in this paper, we propose a novel heterogeneous graph convolutional network-based deep learning model, called HGCN, to collectively categorize the entities in HINs. Our work involves three primary contributions: i) HGCN not only learns the latent relations from the relation-sophisticated HINs via multi-layer heterogeneous convolutions, but also captures the semantic incompatibility among relations with properly-learned edge-level filter parameters; ii) to preserve the fine-grained relational semantics of different-type nodes, we propose a heterogeneous graph convolution to directly tackle the original HINs without any in advance transforming the network from heterogeneity to homogeneity; iii) we perform extensive experiments using four real-world datasets to validate our proposed HGCN, the multi-facet results show that our proposed HGCN can significantly improve the performance of collective classification compared with the state-of-the-art baseline methods.

Handling Information Loss of Graph Neural Networks for Session-based Recommendation

Recently, graph neural networks (GNNs) have gained increasing popularity due to their convincing performance in various applications. Many previous studies also attempted to apply GNNs to session-based recommendation and obtained promising results. However, we spot that there are two information loss problems in these GNN-based methods for session-based recommendation, namely the lossy session encoding problem and the ineffective long-range dependency capturing problem. The first problem is the lossy session encoding problem. Some sequential information about item transitions is ignored because of the lossy encoding from sessions to graphs and the permutation-invariant aggregation during message passing. The second problem is the ineffective long-range dependency capturing problem. Some long-range dependencies within sessions cannot be captured due to the limited number of layers. To solve the first problem, we propose a lossless encoding scheme and an edge-order preserving aggregation layer based on GRU that is dedicatedly designed to process the losslessly encoded graphs. To solve the second problem, we propose a shortcut graph attention layer that effectively captures long-range dependencies by propagating information along shortcut connections. By combining the two kinds of layers, we are able to build a model that does not have the information loss problems and outperforms the state-of-the-art models on three public datasets.

Ultrafast Local Outlier Detection from a Data Stream with Stationary Region Skipping

Real-time outlier detection from a data stream is an increasingly important problem, especially as sensor-generated data streams abound in many applications owing to the prevalence of IoT and emergence of digital twins. Several density-based approaches have been proposed to address this problem, but arguably none of them is fast enough to meet the performance demand of real applications. This paper is founded upon a novel observation that, in many regions of the data space, data distributions hardly change across window slides. We propose a new algorithm, abbr. STARE, which identifies local regions in which data distributions hardly change and then skips updating the densities in those regions-a notion called stationary region skipping. Two techniques, data distribution approximation and cumulative net-change-based skip, are employed to efficiently and effectively implement the notion. Extensive experiments using synthetic and real data streams as well as a case study show that STARE is several orders of magnitude faster than the existing algorithms while achieving comparable or higher accuracy.

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly available at https://aka.ms/layoutlm.

Block Model Guided Unsupervised Feature Selection

Feature selection is a core area of data mining with a recent innovation of graph-driven unsupervised feature selection for linked data. In this setting we have a dataset Y consisting of n instances each with m features and a corresponding n node graph (whose adjacency matrix is A) with an edge indicating that the two instances are similar. Existing efforts for unsupervised feature selection on attributed networks have explored either directly regenerating the links by solving for f such that f(yi,yj) ~ Ai,j or finding community structure in A and using the features in Y to predict these communities. However, graph-driven unsupervised feature selection remains an understudied area with respect to exploring more complex guidance. Here we take the novel approach of first building a block model on the graph and then using the block model for feature selection. That is, we discover FMFT ~ A and then find a subset of features S that induces another graph to preserve both F and M. We call our approach Block Model Guided Unsupervised Feature Selection (BMGUFS). Experimental results show that our method outperforms the state of the art on several real-world public datasets in finding high-quality features for clustering.

Data Compression as a Comprehensive Framework for Graph Drawing and Representation Learning

Embedding a graph into feature space is a promising approach to understand its structure. Embedding into 2D or 3D space enables visualization; representation in higher-dimensional vector space (typically >100D) enables the application of data mining techniques. For the success of knowledge discovery it is essential that the distances between the embedded vertices truly reflect the structure of the graph. Our fundamental idea is to compress the adjacency matrix by predicting the existence of an edge from the Euclidean distance between the corresponding vertices in the embedding, and to use the achieved compression as a quality measure for the embedding. We call this quality measure Predictive Entropy (PE). PE uses a sigmoid function to define the probability which is monotonically decreasing with the Euclidean distance. We use this sigmoid probability to compress the adjacency matrix of the graph by an entropy coding. While PE could be used to assess the result of any graph drawing or representation learning method we particularly use it as objective function in our new method GEMPE (Graph Embedding by Minimizing the Predictive Entropy). We demonstrate in our experiments that GEMPE clearly outperforms comparison methods with respect to quality of the visual result, clustering and node-labeling accuracy on the discovered coordinates.

Joint Policy-Value Learning for Recommendation

Conventional approaches to recommendation often do not explicitly take into account information on previously shown recommendations and their recorded responses. One reason is that, since we do not know the outcome of actions the system did not take, learning directly from such logs is not a straightforward task. Several methods for off-policy or counterfactual learning have been proposed in recent years, but their efficacy for the recommendation task remains understudied. Due to the limitations of offline datasets and the lack of access of most academic researchers to online experiments, this is a non-trivial task. Simulation environments can provide a reproducible solution to this problem.

In this work, we conduct the first broad empirical study of counterfactual learning methods for recommendation, in a simulated environment. We consider various different policy-based methods that make use of the Inverse Propensity Score (IPS) to perform Counterfactual Risk Minimisation (CRM), as well as value-based methods based on Maximum Likelihood Estimation (MLE). We highlight how existing off-policy learning methods fail due to stochastic and sparse rewards, and show how a logarithmic variant of the traditional IPS estimator can solve these issues, whilst convexifying the objective and thus facilitating its optimisation. Additionally, under certain assumptions the value- and policy-based methods have an identical parameterisation, allowing us to propose a new model that combines both the MLE and CRM objectives. Extensive experiments show that this "Dual Bandit" approach achieves state-of-the-art performance in a wide range of scenarios, for varying logging policies, action spaces and training sample sizes.

FedFast: Going Beyond Average for Faster Training of Federated Recommender Systems

Federated learning (FL) is quickly becoming the de facto standard for the distributed training of deep recommendation models, using on-device user data and reducing server costs. In a typical FL process, a central server tasks end-users to train a shared recommendation model using their local data. The local models are trained over several rounds on the users' devices and the server combines them into a global model, which is sent to the devices for the purpose of providing recommendations. Standard FL approaches use randomly selected users for training at each round, and simply average their local models to compute the global model. The resulting federated recommendation models require significant client effort to train and many communication rounds before they converge to a satisfactory accuracy. Users are left with poor quality recommendations until the late stages of training. We present a novel technique, FedFast, to accelerate distributed learning which achieves good accuracy for all users very early in the training process. We achieve this by sampling from a diverse set of participating clients in each training round and applying an active aggregation method that propagates the updated model to the other clients. Consequently, with FedFast the users benefit from far lower communication costs and more accurate models that can be consumed anytime during the training process even at the very early stages. We demonstrate the efficacy of our approach across a variety of benchmark datasets and in comparison to state-of-the-art recommendation techniques.

AM-GCN: Adaptive Multi-channel Graph Convolutional Networks

Graph Convolutional Networks (GCNs) have gained great popularity in tackling various analytics tasks on graph and network data. However, some recent studies raise concerns about whether GCNs can optimally integrate node features and topological structures in a complex graph with rich information. In this paper, we first present an experimental investigation. Surprisingly, our experimental results clearly show that the capability of the state-of-the-art GCNs in fusing node features and topological structures is distant from optimal or even satisfactory. The weakness may severely hinder the capability of GCNs in some classification tasks, since GCNs may not be able to adaptively learn some deep correlation information between topological structures and node features. Can we remedy the weakness and design a new type of GCNs that can retain the advantages of the state-of-the-art GCNs and, at the same time, enhance the capability of fusing topological structures and node features substantially? We tackle the challenge and propose an adaptive multi-channel graph convolutional networks for semi-supervised classification (AM-GCN). The central idea is that we extract the specific and common embeddings from node features, topological structures, and their combinations simultaneously, and use the attention mechanism to learn adaptive importance weights of the embeddings. Our extensive experiments on benchmark data sets clearly show that AM-GCN extracts the most correlated information from both node features and topological structures substantially, and improves the classification accuracy with a clear margin.

Discovering Approximate Functional Dependencies using Smoothed Mutual Information

We consider the task of discovering the top-K reliable approximate functional dependencies X -> Y from high dimensional data. While naively maximizing mutual information involving high dimensional entropies over empirical data is subject to false discoveries, correcting the empirical estimator against data sparsity can lead to efficient exact algorithms for robust dependency discovery. Previous approaches focused on correcting by subtracting expected values of different null hypothesis models. In this paper, we consider a different correction strategy and counter data sparsity using uniform priors and smoothing techniques, that leads to an efficient and robust estimating process. In addition, we derive an admissible and tight bounding function for the smoothed estimator that allows us to efficiently solve via branch-and-bound the hard search problem for the top-K dependencies. Our experiments show that our approach is much faster than previous proposals, and leads to the discovery of sparse and informative functional dependencies.

Competitive Analysis for Points of Interest

The competitive relationship of Points of Interest (POIs) refers to the degree of competition between two POIs for business opportunities from third parties in an urban area. Existing studies for competitive analysis usually focus on mining competitive relationships of entities, such as companies or products, from textual data. However, there are few studies which have a focus on competitive analysis for POIs. Indeed, the growing availability of user behavior data about POIs, such as POI reviews and human mobility data, enables a new paradigm for understanding the competitive relationships among POIs. To this end, in this paper, we study how to predict the POI competitive relationship. Along this line, a very first challenge is how to integrate heterogeneous user behavior data with the spatial features of POIs. As a solution, we first build a heterogeneous POI information network (HPIN) from POI reviews and map search data. Then, we develop a graph neural network-based deep learning framework, named DeepR, for POI competitive relationship prediction based on HPIN. Specifically, DeepR contains two components: a spatial adaptive graph neural network (SA-GNN) and a POI pairwise knowledge extraction learning (PKE) model. The SA-GNN is a novel GNN architecture with incorporating POI's spatial information and location distribution by a specially designed spatial oriented aggregation layer and spatial-dependency attentive propagation mechanism. In addition, PKE is devised to distill the POI pairwise knowledge in HPIN being useful for relationship prediction into condensate vectors with relational graph convolution and cross attention. Finally, extensive experiments on two real-world datasets demonstrate the effectiveness of our method.

HOPS: Probabilistic Subtree Mining for Small and Large Graphs

Frequent subgraph mining, i.e., the identification of relevant patterns in graph databases, is a well-known data mining problem with high practical relevance, since next to summarizing the data, the resulting patterns can also be used to define powerful domain-specific similarity functions for prediction. In recent years, significant progress has been made towards subgraph mining algorithms that scale to complex graphs by focusing on tree patterns and probabilistically allowing a small amount of incompleteness in the result. Nonetheless, the complexity of the pattern matching component used for deciding subtree isomorphism on arbitrary graphs has significantly limited the scalability of existing approaches. In this paper, we adapt sampling techniques from mathematical combinatorics to the problem of probabilistic subtree mining in arbitrary databases of many small to medium-size graphs or a single large graph. By restricting on tree patterns, we provide an algorithm that approximately counts or decides subtree isomorphism for arbitrary transaction graphs in sub-linear time with one-sided error. Our empirical evaluation on a range of benchmark graph datasets shows that the novel algorithm substantially outperforms state-of-the-art approaches both in the task of approximate counting of embeddings in single large graphs and in probabilistic frequent subtree mining in large databases of small to medium sized graphs.

The NodeHopper: Enabling Low Latency Ranking with Constraints via a Fast Dual Solver

Modern recommender systems need to deal with multiple objectives like balancing user engagement with recommending diverse and fresh content. An appealing way to optimally trade these off is by imposing constraints on the ranking according to which items are presented to a user. This results in a constrained ranking optimization problem that can be solved as a linear program (LP). However, off-the-shelf LP solvers are unable to meet the severe latency constraints in systems that serve live traffic. To address this challenge, we exploit the structure of the dual optimization problem to develop a fast solver. We analyze theoretical properties of our solver and show experimentally that it is able to solve constrained ranking problems on synthetic and real-world recommendation datasets an order of magnitude faster than off-the-shelf solvers, thereby enabling their deployment under severe latency constraints.

HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incompleteness

With the advances in data collection techniques, large amounts of multimodal data collected from multiple sources are becoming available. Such multimodal data can provide complementary information that can reveal fundamental characteristics of real-world subjects. Thus, multimodal machine learning has become an active research area. Extensive works have been developed to exploit multimodal interactions and integrate multi-source information. However, multimodal data in the real world usually comes with missing modalities due to various reasons, such as sensor damage, data corruption, and human mistakes in recording. Effectively integrating and analyzing multimodal data with incompleteness remains a challenging problem. We propose a Heterogeneous Graph-based Multimodal Fusion (HGMF) approach to enable multimodal fusion of incomplete data within a heterogeneous graph structure. The proposed approach develops a unique strategy for learning on incomplete multimodal data without data deletion or data imputation. More specifically, we construct a heterogeneous hypernode graph to model the multimodal data having different combinations of missing modalities, and then we formulate a graph neural network based transductive learning framework to project the heterogeneous incomplete data onto a unified embedding space, and multi-modalities are fused along the way. The learning framework captures modality interactions from available data, and leverages the relationships between different incompleteness patterns. Our experimental results demonstrate that the proposed method outperforms existing graph-based as well as non-graph based baselines on three different datasets.

ST-SiameseNet: Spatio-Temporal Siamese Networks for Human Mobility Signature Identification

Given the historical movement trajectories of a set of individual human agents (e.g., pedestrians, taxi drivers) and a set of new trajectories claimed to be generated by a specific agent, the Human Mobility Signature Identification (HuMID) problem aims at validating if the incoming trajectories were indeed generated by the claimed agent. This problem is important in many real-world applications such as driver verification in ride-sharing services, risk analysis for auto insurance companies, and criminal identification. Prior work on identifying human mobility behaviors requires additional data from other sources besides the trajectories, e.g., sensor readings in the vehicle for driving behavior identification. However, these data might not be universally available and is costly to obtain. To deal with this challenge, in this work, we make the first attempt to match identities of human agents only from the observed location trajectory data by proposing a novel and efficient framework named Spatio-temporal Siamese Networks (ST-SiameseNet). For each human agent, we extract a set of profile and online features from his/her trajectories. We train ST-SiameseNet to predict the mobility signature similarity between each pair of agents, where each agent is represented by his/her trajectories and the extracted features. Experimental results on a real-world taxi trajectory dataset show that our proposed ST-SiamesNet can achieve an $F_1$ score of $0.8508$, which significantly outperforms the state-of-the-art techniques.

A Novel Deep Learning Model by Stacking Conditional Restricted Boltzmann Machine and Deep Neural Network

A real-world system often exhibits complex dynamics arising from interaction among its subunits. In machine learning and data mining, these interactions are usually formulated as dependency and correlation among system variables. Similar to Convolution Neural Network dealing with spatially correlated features and Recurrent Neural Network with temporally correlated features, in this paper we present a novel deep learning model to tackle functionally interactive features by stacking a Conditional Restricted Boltzmann Machine and a Deep Neural Network (CRBM-DNN). Variables with their dependency relationships are organized into a bipartite graph, which is further converted into a Restricted Boltzmann Machine conditioned by domain knowledge. We integrate this CRBM and a DNN into one deep learning model constrained by one overall cost function. CRBM-DNN can solve both supervised and unsupervised learning problems. Compared to a regular neural network of the same size, CRBM-DNN has fewer parameters so they require fewer training samples. We perform extensive comparative studies with a large number of supervised learning and unsupervised learning methods using several challenging real-world datasets, and achieve significant superior performance.

InfiniteWalk: Deep Network Embeddings as Laplacian Embeddings with a Nonlinearity

The skip-gram model for learning word embeddings (Mikolov et al. 2013) has been widely popular, and DeepWalk (Perozzi et al. 2014), among other methods, has extended the model to learning node representations from networks. Recent work of Qiu et al. (2018) provides a closed-form expression for the DeepWalk objective, obviating the need for sampling for small datasets and improving accuracy. In these methods, the "window size" T within which words or nodes are considered to co-occur is a key hyperparameter. We study the objective in the limit as T goes to infinity, which allows us to simplify the expression of Qiu et al. We prove that this limiting objective corresponds to factoring a simple transformation of the pseudoinverse of the graph Laplacian, linking DeepWalk to extensive prior work in spectral graph embeddings. Further, we show that by a applying a simple nonlinear entrywise transformation to this pseudoinverse, we recover a good approximation of the finite-T objective and embeddings that are competitive with those from DeepWalk and other skip-gram methods in multi-label classification. Surprisingly, we find that even simple binary thresholding of the Laplacian pseudoinverse is often competitive, suggesting that the core advancement of recent methods is a nonlinearity on top of the classical spectral embedding approach.

xGAIL: Explainable Generative Adversarial Imitation Learning for Explainable Human Decision Analysis

To make daily decisions, human agents devise their own "strategies" governing their mobility dynamics (e.g., taxi drivers have preferred working regions and times, and urban commuters have preferred routes and transit modes). Recent research such as generative adversarial imitation learning (GAIL) demonstrates successes in learning human decision-making strategies from their behavior data using deep neural networks (DNNs), which can accurately mimic how humans behave in various scenarios, e.g., playing video games, etc. However, such DNN-based models are "black box" models in nature, making it hard to explain what knowledge the models have learned from human, and how the models make such decisions, which was not addressed in the literature of imitation learning. This paper addresses this research gap by proposing xGAIL, the first explainable generative adversarial imitation learning framework. The proposed xGAIL framework consists of two novel components, including Spatial Activation Maximization (SpatialAM) and Spatial Randomized Input Sampling Explanation (SpatialRISE), to extract both global and local knowledge from a well-trained GAIL model that explains how a human agent makes decisions. Especially, we take taxi drivers' passenger-seeking strategy as an example to validate the effectiveness of the proposed xGAIL framework. Our analysis on a large-scale real-world taxi trajectory data shows promising results from two aspects: i) global explainable knowledge of what nearby traffic condition impels a taxi driver to choose a particular direction to find the next passenger, and ii) local explainable knowledge of what key (sometimes hidden) factors a taxi driver considers when making a particular decision.

Catalysis Clustering with GAN by Incorporating Domain Knowledge

Clustering is an important unsupervised learning method with serious challenges when data is sparse and high-dimensional. Generated clusters are often evaluated with general measures, which may not be meaningful or useful for practical applications and domains. Using a distance metric, a clustering algorithm searches through the data space, groups close items into one cluster, and assigns far away samples to different clusters. In many real-world applications, the number of dimensions is high and data space becomes very sparse. Selection of a suitable distance metric is very difficult and becomes even harder when categorical data is involved. Moreover, existing distance metrics are mostly generic, and clusters created based on them will not necessarily make sense to domain-specific applications. One option to address these challenges is to integrate domain-defined rules and guidelines into the clustering process. In this work we propose a GAN-based approach called Catalysis Clustering to incorporate domain knowledge into the clustering process. With GANs we generate catalysts, which are special synthetic points drawn from the original data distribution and verified to improve clustering quality when measured by a domain-specific metric. We then perform clustering analysis using both catalysts and real data. Final clusters are produced after catalyst points are removed. Experiments on two challenging real-world datasets clearly show that our approach is effective and can generate clusters that are meaningful and useful for real-world applications.

Prediction and Profiling of Audience Competition for Online Television Series

Understanding the target audience for popular television series is valuable for online video platform to manage advertising sales, purchase video copyrights, and compete with other video service platforms. Existing studies in this domain generally focus on using data mining and machine learning techniques to recommend television series to individual users or predict the popularity of television series. Knowing only the popularity of television series may, however, limit our ability to answer more in-depth questions and develop more intelligent applications. In this paper, we develop a data-driven framework to model and predict audience competition patterns for popular online television series. Specifically, we first construct a sequence of dynamic competition networks of television series by mining the detailed viewership records. Then, we design the Dynamic Deep Network Factorization (DDNF), a hybrid modeling framework for predicting the future competition networks. Our framework adopts the deep neural network (DNN) and the knowledge-base (KB) embedding to incorporate static features, and integrates the Long Short-Term Memory (LSTM) network to learn dynamic features of the television series. Finally, extensive experiments on real-world data sets validate the effectiveness of our approach compared with state-of-the-art baselines in predicting the audience competition for existing and new television series.

Multi-Class Data Description for Out-of-distribution Detection

The capability of reliably detecting out-of-distribution samples is one of the key factors in deploying a good classifier, as the test distribution always does not match with the training distribution in most real-world applications. In this work, we present a deep multi-class data description, termed as Deep-MCDD, which is effective to detect out-of-distribution (OOD) samples as well as classify in-distribution (ID) samples. Unlike the softmax classifier that only focuses on the linear decision boundary partitioning its latent space into multiple regions, our Deep-MCDD aims to find a spherical decision boundary for each class which determines whether a test sample belongs to the class or not. By integrating the concept of Gaussian discriminant analysis into deep neural networks, we propose a deep learning objective to learn class-conditional distributions that are explicitly modeled as separable Gaussian distributions. Thereby, we can define the confidence score by the distance of a test sample from each class-conditional distribution, and utilize it for identifying OOD samples. Our empirical evaluation on multi-class tabular and image datasets demonstrates that Deep-MCDD achieves the best performances in distinguishing OOD samples while showing the classification accuracy as high as the other competitors.

In and Out: Optimizing Overall Interaction in Probabilistic Graphs under Clustering Constraints

We study two novel clustering problems in which the pairwise interactions between entities are characterized by probability distributions and conditioned by external factors within the environment where the entities interact. This covers any scenario where a set of actions can alter the entities' interaction behavior. In particular, we consider the case where the interaction conditioning factors can be modeled as cluster memberships of entities in a graph and the goal is to partition a set of entities such as to maximize the overall vertex interactions or, equivalently, minimize the loss of interactions in the graph. We show that both problems are NP-hard and they are equivalent in terms of optimality. However, we focus on the minimization formulation as it enables the possibility of devising both practical and efficient approximation algorithms and heuristics. Experimental evaluation of our algorithms, on both synthetic and real network datasets, has shown evidence of their meaningfulness as well as superiority with respect to competing methods, both in terms of effectiveness and efficiency.

Recurrent Halting Chain for Early Multi-label Classification

Early multi-label classification of time series, the assignment of a label set to a time series before the series is entirely observed, is critical for time-sensitive domains such as healthcare. In such cases, waiting too long to classify can render predictions useless, regardless of their accuracy, while predicting prematurely can result in potentially costly erroneous results. When predicting multiple labels (for example, types of infections), dependencies between labels can be learned and leveraged to improve overall accuracy. Together, reliably predicting the correct label set of a time series while observing as few timesteps as possible is challenging because these goals are contradictory in that fewer timesteps often means worse accuracy. To achieve early yet sufficiently accurate predictions, correlations between labels must be accounted for since direct evidence of some labels may only appear late in the series. We design an effective solution to this open problem, the Recurrent Halting Chain (RHC), that for the first time integrates key innovations in both Early and Multi-label Classification into one multi-objective model. RHC uses a recurrent neural network to jointly model raw time series as well as correlations between labels, resulting in a novel order-free classifier chain that tackles this time-sensitive multi-label learning task. Further, RHC employs a reinforcement learning-based halting network to decide at each timestep which, if any, classes should be predicted, learning to build the label set over time. Using two real-world time-sensitive datasets and popular multi-label metrics, we show that RHC outperforms recent alternatives by predicting more-accurate label sets earlier.

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed intoembedding approximation variance in the forward stage andstochastic gradient variance in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.

Discovering Functional Dependencies from Mixed-Type Data

Given complex data collections, practitioners can perform non-parametric functional dependency discovery (FDD) to uncover relationships between variables that were previously unknown. However, known FDD methods are applicable to nominal data, and in practice non-nominal variables are discretized, e.g., in a pre-processing step. This is problematic because, as soon as a mix of discrete and continuous variables is involved, the interaction of discretization with the various dependency measures from the literature is poorly understood. In particular, it is unclear whether a given discretization method even leads to a consistent dependency estimate. In this paper, we analyze these fundamental questions and derive formal criteria as to when a discretization process applied to a mixed set of random variables leads to consistent estimates of mutual information. With these insights, we derive an estimator framework applicable to any task that involves estimating mutual information from multivariate and mixed-type data. Last, we extend with this framework a previously proposed FDD approach for reliable dependencies. Experimental evaluation shows that the derived reliable estimator is both computationally and statistically efficient, and leads to effective FDD algorithms for mixed-type data.

Attackability Characterization of Adversarial Evasion Attack on Discrete Data

Evasion attack on discrete data is a challenging, while practically interesting research topic. It is intrinsically an NP-hard combinatorial optimization problem. Characterizing the conditions guaranteeing the solvability of an evasion attack task thus becomes the key to understand the adversarial threat. Our study is inspired by the weak submodularity theory. We characterize the attackability of a targeted classifier on discrete data in evasion attack by bridging the attackability measurement and the regularity of the targeted classifier. Based on our attackability analysis, we propose a computationally efficient orthogonal matching pursuit-guided attack method for evasion attack on discrete data. It provides provably computational efficiency and attack performances. Substantial experimental results on real-world datasets validate the proposed attackability conditions and the effectiveness of the proposed attack method.

The Spectral Zoo of Networks: Embedding and Visualizing Networks with Spectral Moments

Network embedding methods have been widely and successfully used in network-based applications such as node classification and link prediction. However, an ideal network embedding should not only be useful for machine learning, but interpretable. We introduce a spectral embedding method for a network, its Spectral Point, which is basically the first few spectral moments of a network. Spectral moments are interpretable, where we prove their close relationships to network structure (e.g. number of triangles and squares) and various network properties (e.g. degree distribution, clustering coefficient, and network connectivity). Using spectral points, we introduce a visualizable and bounded 3D embedding space for all possible graphs, in which one can characterize various types of graphs (e.g., cycles), or real-world networks from different categories (e.g., social or biological networks). We demonstrate that spectral points can be used for network identification (i.e., what network is this subgraph sampled from?) and that by using just the first few moments one does not lose much predictive power.

Unsupervised Differentiable Multi-aspect Network Embedding

Network embedding is an influential graph mining technique for representing nodes in a graph as distributed vectors. However, the majority of network embedding methods focus on learning a single vector representation for each node, which has been recently criticized for not being capable of modeling multiple aspects of a node. To capture the multiple aspects of each node, existing studies mainly rely on offline graph clustering performed prior to the actual embedding, which results in the cluster membership of each node (i.e., node aspect distribution) fixed throughout training of the embedding model. We argue that this not only makes each node always have the same aspect distribution regardless of its dynamic context, but also hinders the end-to-end training of the model that eventually leads to the final embedding quality largely dependent on the clustering. In this paper, we propose a novel end-to-end framework for multi-aspect network embedding, called asp2vec, in which the aspects of each node are dynamically assigned based on its local context. More precisely, among multiple aspects, we dynamically assign a single aspect to each node based on its current context, and our aspect selection module is end-to-end differentiable via the Gumbel-Softmax trick. We also introduce the aspect regularization framework to capture the interactions among the multiple aspects in terms of relatedness and diversity. We further demonstrate that our proposed framework can be readily extended to heterogeneous networks. Extensive experiments towards various downstream tasks on various types of homogeneous networks and a heterogeneous network demonstrate the superiority of asp2vec.

AutoML Pipeline Selection: Efficiently Navigating the Combinatorial Space

Data scientists seeking a good supervised learning model on a dataset have many choices to make: they must preprocess the data, select features, possibly reduce the dimension, select an estimation algorithm, and choose hyperparameters for each of these pipeline components. With new pipeline components comes a combinatorial explosion in the number of choices! In this work, we design a new AutoML system TensorOboe to address this challenge: an automated system to design a supervised learning pipeline. TensorOboe uses low rank tensor decomposition as a surrogate model for efficient pipeline search. We also develop a new greedy experiment design protocol to gather information about a new dataset efficiently. Experiments on large corpora of real-world classification problems demonstrate the effectiveness of our approach.

Towards Physics-informed Deep Learning for Turbulent Flow Prediction

While deep learning has shown tremendous success in a wide range of domains, it remains a grand challenge to incorporate physical principles in a systematic manner to the design, training, and inference of such models. In this paper, we aim to predict turbulent flow by learning its highly nonlinear dynamics from spatiotemporal velocity fields of large-scale fluid flow simulations of relevance to turbulence modeling and climate modeling. We adopt a hybrid approach by marrying two well-established turbulent flow simulation techniques with deep learning. Specifically, we introduce trainable spectral filters in a coupled model of Reynolds-averaged Navier-Stokes (RANS) and Large Eddy Simulation (LES), followed by a specialized U-net for prediction. Our approach, which we call Turbulent-Flow Net, is grounded in a principled physics model, yet offers the flexibility of learned representations. We compare our model with state-of-the-art baselines and observe significant reductions in error for predictions 60 frames ahead. Most importantly, our method predicts physical fields that obey desirable physical characteristics, such as conservation of mass, whilst faithfully emulating the turbulent kinetic energy field and spectrum, which are critical for accurate prediction of turbulent flows.

Evaluating Fairness Using Permutation Tests

Machine learning models are central to people's lives and impact society in ways as fundamental as determining how people access information. The gravity of these models imparts a responsibility to model developers to ensure that they are treating users in a fair and equitable manner. Before deploying a model into production, it is crucial to examine the extent to which its predictions demonstrate biases. This paper deals with the detection of bias exhibited by a machine learning model through statistical hypothesis testing. We propose a permutation testing methodology that performs a hypothesis test that a model is fair across two groups with respect to any given metric. There are increasingly many notions of fairness that can speak to different aspects of model fairness. Our aim is to provide a flexible framework that empowers practitioners to identify significant biases in any metric they wish to study. We provide a formal testing mechanism as well as extensive experiments to show how this method works in practice.

Leveraging Model Inherent Variable Importance for Stable Online Feature Selection

Feature selection can be a crucial factor in obtaining robust and accurate predictions. Online feature selection models, however, operate under considerable restrictions; they need to efficiently extract salient input features based on a bounded set of observations, while enabling robust and accurate predictions. In this work, we introduce FIRES, a novel framework for online feature selection. The proposed feature weighting mechanism leverages the importance information inherent in the parameters of a predictive model. By treating model parameters as random variables, we can penalize features with high uncertainty and thus generate more stable feature sets. Our framework is generic in that it leaves the choice of the underlying model to the user. Strikingly, experiments suggest that the model complexity has only a minor effect on the discriminative power and stability of the selected feature sets. In fact, using a simple linear model, FIRES obtains feature sets that compete with state-of-the-art methods, while dramatically reducing computation time. In addition, experiments show that the proposed framework is clearly superior in terms of feature selection stability.

Multi-level Graph Convolutional Networks for Cross-platform Anchor Link Prediction

Cross-platform account matching plays a significant role in social network analytics, and is beneficial for a wide range of applications. However, existing methods either heavily rely on high-quality user generated content (including user profiles) or suffer from data insufficiency problem if only focusing on network topology, which brings researchers into an insoluble dilemma of model selection. In this paper, to address this problem, we propose a novel framework that considers multi-level graph convolutions on both local network structure and hypergraph structure in a unified manner. The proposed method overcomes data insufficiency problem of existing work and does not necessarily rely on user demographic information. Moreover, to adapt the proposed method to be capable of handling large-scale social networks, we propose a two-phase space reconciliation mechanism to align the embedding spaces in both network partitioning based parallel training and account matching across different social networks. Extensive experiments have been conducted on two large-scale real-life social networks. The experimental results demonstrate that the proposed method outperforms the state-of-the-art models with a big margin.

Evaluating Conversational Recommender Systems via User Simulation

Conversational information access is an emerging research area. Currently, human evaluation is used for end-to-end system evaluation, which is both very time and resource intensive at scale, and thus becomes a bottleneck of progress. As an alternative, we propose automated evaluation by means of simulating users. Our user simulator aims to generate responses that a real human would give by considering both individual preferences and the general flow of interaction with the system. We evaluate our simulation approach on an item recommendation task by comparing three existing conversational recommender systems. We show that preference modeling and task-specific interaction models both contribute to more realistic simulations, and can help achieve high correlation between automatic evaluation measures and manual human assessments.

Measuring Model Complexity of Neural Networks with Curve Activation Functions

It is fundamental to measure model complexity of deep neural networks. A good model complexity measure can help to tackle many challenging problems, such as overfitting detection, model selection, and performance improvement. The existing literature on model complexity mainly focuses on neural networks with piecewise linear activation functions. Model complexity of neural networks with general curve activation functions remains an open problem. To tackle the challenge, in this paper, we first propose linear approximation neural network (LANN for short), a piecewise linear framework to approximate a given deep model with curve activation function. LANN constructs individual piecewise linear approximation for the activation function of each neuron, and minimizes the number of linear regions to satisfy a required approximation degree. Then, we analyze the upper bound of the number of linear regions formed by LANNs, and derive the complexity measure based on the upper bound. To examine the usefulness of the complexity measure, we experimentally explore the training process of neural networks and detect overfitting. Our results demonstrate that the occurrence of overfitting is positively correlated with the increase of model complexity during training. We find that the L1 and L2 regularizations suppress the increase of model complexity. Finally, we propose two approaches to prevent overfitting by directly constraining model complexity, namely neuron pruning and customized L1 regularization.

Diverse Rule Sets

While machine-learning models are flourishing and transforming many aspects of everyday life, the inability of humans to understand complex models poses difficulties for these models to be fully trusted and embraced. Thus, interpretability of models has been recognized as an equally important quality as their predictive power. In particular, rule-based systems are experiencing a renaissance owing to their intuitive if-then representation.

However, simply being rule-based does not ensure interpretability. For example, overlapped rules spawn ambiguity and hinder interpretation. Here we propose a novel approach of inferring diverse rule sets, by optimizing small overlap among decision rules with a 2-approximation guarantee under the framework of Max-Sum diversification. We formulate the problem as maximizing a weighted sum of discriminative quality and diversity of a rule set.

In order to overcome an exponential-size search space of association rules, we investigate several natural options for a small candidate set of high-quality rules, including frequent and accurate rules, and examine their hardness. Leveraging the special structure in our formulation, we then devise an efficient randomized algorithm, which samples rules that are highly discriminative and have small overlap. The proposed sampling algorithm analytically targets a distribution of rules that is tailored to our objective.

We demonstrate the superior predictive power and interpretability of our model with a comprehensive empirical study against strong baselines.

Vamsa: Automated Provenance Tracking in Data Science Scripts

There has recently been a lot of ongoing research in the areas of fairness, bias and explainability of machine learning (ML) models due to the self-evident or regulatory requirements of various ML applications. We make the following observation: All of these approaches require a robust understanding of the relationship between ML models and the data used to train them. In this work, we introduce the ML provenance tracking problem: the fundamental idea is to automatically track which columns in a dataset have been used to derive the features/labels of an ML model. We discuss the challenges in capturing such information in the context of Python, the most common language used by data scientists.

We then present Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the users' code. Using 26K real data science scripts, we verify the effectiveness of Vamsa in terms of coverage, and performance. We also evaluate Vamsa's accuracy on a smaller subset of manually labeled data. Our analysis shows that Vamsa's precision and recall range from 90.4% to 99.1% and its latency is in the order of milliseconds for average size scripts. Drawing from our experience in deploying ML models in production, we also present an example in which Vamsa helps automatically identify models that are affected by data corruption issues.

Deep State-Space Generative Model For Correlated Time-to-Event Predictions

Capturing the inter-dependencies among multiple types of clinically-critical events is critical not only to accurate future event prediction, but also to better treatment planning. In this work, we propose a deep latent state-space generative model to capture the interactions among different types of correlated clinical events (e.g., kidney failure, mortality) by explicitly modeling the temporal dynamics of patients' latent states. Based on these learned patient states, we further develop a new general discrete-time formulation of the hazard rate function to estimate the survival distribution of patients with significantly improved accuracy. Extensive evaluations over real EMR data show that our proposed model compares favorably to various state-of-the-art baselines. Furthermore, our method also uncovers meaningful insights about the latent correlations among mortality and different types of organ failures.

Meta-learning on Heterogeneous Information Networks for Cold-start Recommendation

Cold-start recommendation has been a challenging problem due to sparse user-item interactions for new users or items. Existing efforts have alleviated the cold-start issue to some extent, most of which approach the problem at the data level. Earlier methods often incorporate auxiliary data as user or item features, while more recent methods leverage heterogeneous information networks (HIN) to capture richer semantics via higher-order graph structures. On the other hand, recent meta-learning paradigm sheds light on addressing cold-start recommendation at the model level, given its ability to rapidly adapt to new tasks with scarce labeled data, or in the context of cold-start recommendation, new users and items with very few interactions. Thus, we are inspired to develop a novel meta-learning approach named MetaHIN to address cold-start recommendation on HINs, to exploit the power of meta-learning at the model level and HINs at the data level simultaneously. The solution is non-trivial, for how to capture HIN-based semantics in the meta-learning setting, and how to learn the general knowledge that can be easily adapted to multifaceted semantics, remain open questions. In MetaHIN, we propose a novel semantic-enhanced tasks constructor and a co-adaptation meta-learner to address the two questions. Extensive experiments demonstrate that MetaHIN significantly outperforms the state of the arts in various cold-start scenarios. (Code and dataset are available at https://github.com/rootlu/MetaHIN.)

WavingSketch: An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams

Finding top-k items in data streams is a fundamental problem in data mining. Existing algorithms that can achieve unbiased estimation suffer from poor accuracy. In this paper, we propose a new sketch, WavingSketch, which is much more accurate than existing unbiased algorithms. WavingSketch is generic, and we show how it can be applied to four applications: finding top-k frequent items, finding top-k heavy changes, finding top-k persistent items, and finding top-k Super-Spreaders. We theoretically prove that WavingSketch can provide unbiased estimation, and then give an error bound of our algorithm. Our experimental results show that, compared with the state-of-the-art, WavingSketch has 4.50 times higher insertion speed and up to 9 x 106 times (2 x 104 times in average) lower error rate in finding frequent items when memory size is tight. For other applications, WavingSketch can also achieve up to 286 times lower error rate. All related codes are open-sourced and available at Github anonymously.

Dynamic Knowledge Graph based Multi-Event Forecasting

Modeling concurrent events of multiple types and their involved actors from open-source social sensors is an important task for many domains such as health care, disaster relief, and financial analysis. Forecasting events in the future can help human analysts better understand global social dynamics and make quick and accurate decisions. Anticipating participants or actors who may be involved in these activities can also help stakeholders to better respond to unexpected events. However, achieving these goals is challenging due to several factors: (i) it is hard to filter relevant information from large-scale input, (ii) the input data is usually high dimensional, unstructured, and Non-IID (Non-independent and identically distributed) and (iii) associated text features are dynamic and vary over time. Recently, graph neural networks have demonstrated strengths in learning complex and relational data. In this paper, we study a temporal graph learning method with heterogeneous data fusion for predicting concurrent events of multiple types and inferring multiple candidate actors simultaneously. In order to capture temporal information from historical data, we propose Glean, a graph learning framework based on event knowledge graphs to incorporate both relational and word contexts. We present a context-aware embedding fusion module to enrich hidden features for event actors. We conducted extensive experiments on multiple real-world datasets and show that the proposed method is competitive against various state-of-the-art methods for social event prediction and also provides much-need interpretation capabilities.

A Geometric Approach to Predicting Bounds of Downstream Model Performance

This paper presents the motivation and methodology for including model application criteria into baseline analysis. We will focus on detailing the interplay between the common measures of mean square error (MSE) and accuracy as it relates to perceived model performance. MSE is a common aggregate measure for the performance of predictive regression models. The advantages are numerous. MSE is agnostic to the choice of model given that the set of possible outcome values are defined on the appropriate metric space. In practice, decisions on how to subsequently use a trained model are based on predictive performance, relative to a baseline where input features are not used - colloquially a "random model". However, the relative performance gains of a model in terms of MSE to the baseline does not guarantee commensurate gains when deployed in downstream applications, systems, or processes. This paper demonstrates one derivation of a distribution to qualify MSE performance for multi-class decision making systems desiring a certain level of accuracy. The model error is qualified through comparison to relevant baselines tied to the application suited to evaluating individual outcome performance criteria.

Context-to-Session Matching: Utilizing Whole Session for Response Selection in Information-Seeking Dialogue Systems

We study the retrieval-based multi-turn information-seeking dialogue systems, which are widely used in many scenarios. Most of the previous works select the response according to the matching degree between the query's context and the candidate responses. Though great progress has been made, existing works ignore the contexts of the responses, which could provide rich information for selecting the most appropriate response. The more similar the query's context and certain response's context are, the more likely they are to indicate the same question, and thus, the more likely this response is to answer the query. In this paper, we consider the response and its context as a whole session and explore the task of matching the query's context with the sessions. More specifically, we propose to match between the query's context and response's context and integrate the context-to-context matching with context-to-response matching. Experiment results prove that our proposed context-to-session method outperforms the strong baselines significantly.

HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in Intensive Care Units

Deep learning models have achieved expert-level performance in healthcare with an exclusive focus on training accurate models. However, in many clinical environments such as intensive care unit (ICU), real-time model serving is equally if not more important than accuracy, because in ICU patient care is simultaneously more urgent and more expensive. Clinical decisions and their timeliness, therefore, directly affect both the patient outcome and the cost of care. To make timely decisions, we argue the underlying serving system must be latency-aware. To compound the challenge, health analytic applications often require a combination of models instead of a single model, to better specialize individual models for different targets, multi-modal data, different prediction windows, and potentially personalized predictions. To address these challenges, we propose HOLMES---an online model ensemble serving framework for healthcare applications. HOLMES dynamically identifies the best performing set of models to ensemble for highest accuracy, while also satisfying sub-second latency constraints on end-to-end prediction. We demonstrate that HOLMES is able to navigate the accuracy/latency tradeoff efficiently, compose the ensemble, and serve the model ensemble pipeline, scaling to simultaneously streaming data from 100 patients, each producing waveform data at 250~Hz. HOLMES outperforms the conventional offline batch-processed inference for the same clinical task in terms of accuracy and latency (by order of magnitude). HOLMES is tested on risk prediction task on pediatric cardio ICU data with above 95% prediction accuracy and sub-second latency on 64-bed simulation.

LogPar: Logistic PARAFAC2 Factorization for Temporal Binary Data with Missing Values

Binary data with one-class missing values are ubiquitous in real-world applications. They can be represented by irregular tensors with varying sizes in one dimension, where value one means presence of a feature while zero means unknown (i.e., either presence or absence of a feature). Learning accurate low-rank approximations from such binary irregular tensors is a challenging task. However, none of the existing models developed for factorizing irregular tensors take the missing values into account, and they assume Gaussian distributions, resulting in a distribution mismatch when applied to binary data. In this paper, we propose Logistic PARAFAC2 (LogPar) by modeling the binary irregular tensor with Bernoulli distribution parameterized by an underlying real-valued tensor. Then we approximate the underlying tensor with a positive-unlabeled learning loss function to account for the missing values. We also incorporate uniqueness and temporal smoothness regularization to enhance the interpretability. Extensive experiments using large-scale real-world datasets show that LogPar outperforms all baselines in both irregular tensor completion and downstream predictive tasks. For the irregular tensor completion, LogPar achieves up to 26% relative improvement compared to the best baseline. Besides, LogPar obtains relative improvement of 13.2% for heart failure prediction and 14% for mortality prediction on average compared to the state-of-the-art PARAFAC2 models.

RECORD: Resource Constrained Semi-Supervised Learning under Distribution Shift

Semi-supervised learning (SSL) tries to improve performance with the use of massive unlabeled data, which typically works in an offline manner with two assumptions. i) Data distribution is static; ii) Data storage overhead is unlimited. In many online tasks, however, none of the above assumptions is valid. For example, in online image classification, a large amount of unlabeled images increases sharply, which makes it difficult to store them in full; meanwhile, the content of unlabeled images changes constantly, and it is no longer suitable to assume a fixed distribution. We call such a novel setting Resource Constrained SSL under Distribution Shift (or Record for short) and to our best knowledge, it has not been thoroughly studied yet. This paper presents a systemic solution Record consisting of three sub-steps, that is, distribution tracking, sample selection and model updating. Specifically, we propose an effective method to track the distribution changes and locate distribution shifted samples. A novel influence-based approach is used to select the most influential samples for the distribution change based on resource constraints. Finally, we free up memory to put the latest unlabeled data with its pseudo-label for the next distribution tracking. Extensive empirical results confirm the effectiveness of our scheme. In the case of diverse and unknown distribution shifts, our solution is consistently and clearly better than many baseline and SOTA methods along with the memory budget and in some cases it can even approximate the performance of oracle.

Statistically Significant Pattern Mining with Ordinal Utility

Statistically significant patterns mining (SSPM) is an essential and challenging data mining task in the field of knowledge discovery in databases (KDD), in which each pattern is evaluated via a hypothesis test. Our study aims to introduce a preference relation into patterns and to discover the most preferred patterns under the constraint of statistical significance, which has never been considered in existing SSPM problems. We propose an iterative multiple testing procedure that can alternately reject a hypothesis and safely ignore the hypotheses that are less useful than the rejected hypothesis. One advantage of filtering out patterns with low utility is that it avoids consumption of the significance budget by rejection of useless (that is, uninteresting) patterns. This allows the significance budget to be focused on useful patterns, leading to more useful discoveries. We show that the proposed method can control the familywise error rate (FWER) under certain assumptions, that can be satisfied by a realistic problem class in SSPM. We also show that the proposed method always discovers a set of patterns that is at least equally or more useful than those discovered using the standard Tarone-Bonferroni method SSPM. Finally, we conducted several experiments with both synthetic and real-world data to evaluate the performance of our method. As a result, in the experiments with real-world datasets, the proposed method discovered a larger number of more useful patterns than the existing method for all five conducted tasks.

Certifiable Robustness of Graph Convolutional Networks under Structure Perturbations

Recent works show that message-passing neural networks (MPNNs) can be fooled by adversarial attacks on both the node attributes and the graph structure. Since MPNNs are currently being rapidly adopted in real-world applications, it is thus crucial to improve their reliablility and robustness. While there has been progress on robustness certification of MPNNs under perturbation of the node attributes, no existing method can handle structural perturbations. These perturbations are especially challenging because they alter the message passing scheme itself. In this work we close this gap and propose the first method to certify robustness of Graph Convolutional Networks (GCNs) under perturbations of the graph structure. We show how this problem can be expressed as a jointly constrained bilinear program - a challenging, yet well-studied class of problems - and propose a novel branch-and-bound algorithm to obtain lower bounds on the global optimum. These lower bounds are significantly tighter and can certify up to twice as many nodes compared to a standard linear relaxation.

Understanding Negative Sampling in Graph Representation Learning

Graph representation learning has been extensively studied in recent years, in which sampling is a critical point. Prior arts usually focus on sampling positive node pairs, while the strategy for negative sampling is left insufficiently explored. To bridge the gap, we systematically analyze the role of negative sampling from the perspectives of both objective and risk, theoretically demonstrating that negative sampling is as important as positive sampling in determining the optimization objective and the resulted variance. To the best of our knowledge, we are the first to derive the theory and quantify that a nice negative sampling distribution is pn(u|v) ∝ pd(u|v)α, 0 < α < 1. With the guidance of the theory, we propose MCNS, approximating the positive distribution with self-contrast approximation and accelerating negative sampling by Metropolis-Hastings. We evaluate our method on 5 datasets that cover extensive downstream graph learning tasks, including link prediction, node classification and recommendation, on a total of 19 experimental settings. These relatively comprehensive experimental results demonstrate its robustness and superiorities.

Aligning Superhuman AI with Human Behavior: Chess as a Model System

As artificial intelligence becomes increasingly intelligent---in some cases, achieving superhuman performance---there is growing potential for humans to learn from and collaborate with algorithms. However, the ways in which AI systems approach problems are often different from the ways people do, and thus may be uninterpretable and hard to learn from. A crucial step in bridging this gap between human and artificial intelligence is modeling the granular actions that constitute human behavior, rather than simply matching aggregate human performance. We pursue this goal in a model system with a long history in artificial intelligence: chess. The aggregate performance of a chess player unfolds as they make decisions over the course of a game. The hundreds of millions of games played online by players at every skill level form a rich source of data in which these decisions, and their exact context, are recorded in minute detail. Applying existing chess engines to this data, including an open-source implementation of AlphaZero, we find that they do not predict human moves well. We develop and introduce Maia, a customized version of AlphaZero trained on human chess games, that predicts human moves at a much higher accuracy than existing engines, and can achieve maximum accuracy when predicting decisions made by players at a specific skill level in a tuneable way. For a dual task of predicting whether a human will make a large mistake on the next move, we develop a deep neural network that significantly outperforms competitive baselines. Taken together, our results suggest that there is substantial promise in designing artificial intelligence systems with human collaboration in mind by first accurately modeling granular human decision-making.

Heidegger: Interpretable Temporal Causal Discovery

Temporal causal discovery aims to find cause-effect relationships between time-series. However, none of the existing techniques is able to identify the causal profile, the temporal pattern that the causal variable needs to follow in order to trigger the most significant change in the outcome. Toward a new horizon, this study introduces the novel problem of Causal Profile Discovery, which is crucial for many applications such as adverse drug reaction and cyber-attack detection. This work correspondingly proposes Heidegger to discover causal profiles, comprised of a flexible randomized block design for hypothesis evaluation and an efficient profile search via on-the-fly graph construction and entropy-based pruning. Heidegger's performance is demonstrated/evaluated extensively on both synthetic and real-world data. The experimental results show the proposed method is robust to noise and flexible at detecting complex patterns.

Interpretable Deep Graph Generation with Node-edge Co-disentanglement

Disentangled representation learning has recently attracted a significant amount of attention, particularly in the field of image representation learning. However, learning the disentangled representations behind a graph remains largely unexplored, especially for the attributed graph with both node and edge features. Disentanglement learning for graph generation has substantial new challenges including 1) the lack of graph deconvolution operations to jointly decode node and edge attributes; and 2) the difficulty in enforcing the disentanglement among latent factors that respectively influence: i) only nodes, ii) only edges, and iii) joint patterns between them. To address these challenges, we propose a new disentanglement enhancement framework for deep generative models for attributed graphs. In particular, a novel variational objective is proposed to disentangle the above three types of latent factors, with novel architecture for node and edge deconvolutions. Qualitative and quantitative experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed model and its extensions.

Minimizing Localized Ratio Cut Objectives in Hypergraphs

Hypergraphs are a useful abstraction for modeling multiway relationships in data, and hypergraph clustering is the task of detecting groups of closely related nodes in such data.Graph clustering has been studied extensively, and there are numerous methods for detecting small, localized clusters without having to explore an entire input graph. However, there are only a few specialized approaches for localized clustering in hypergraphs. Here we present a framework for local hypergraph clustering based on minimizing localized ratio cut objectives. Our framework takes an input set of reference nodes in a hypergraph and solves a sequence of hypergraph minimum s-t cut problems in order to identify a nearby well-connected cluster of nodes that overlaps substantially with the input set.

Our methods extend graph-based techniques but are significantly more general and have new output quality guarantees. First, our methods can minimize new generalized notions of hypergraph cuts, which depend on specific configurations of nodes within each hyperedge, rather than just on the number of cut hyperedges. Second, our framework has several attractive theoretical properties in terms of output cluster quality. Most importantly, our algorithm is strongly-local, meaning that its runtime depends only on the size of the input set, and does not need to explore the entire hypergraph to find good local clusters. We use our methodology to effectively identify clusters in hypergraphs of real-world data with millions of nodes, millions of hyperedges, and large average hyperedge size with runtimes ranging between a few seconds and a few minutes.

RECIPTOR: An Effective Pretrained Model for Recipe Representation Learning

Recipe representation plays an important role in food computing for perception, recognition, recommendation and other applications. Learning pretrained recipe embeddings is a challenging task, as there is a lack of high quality annotated food datasets. In this paper, we provide a joint approach for learning effective pretrained recipe embeddings using both the ingredients and cooking instructions. We present RECIPTOR, a novel set transformer-based joint model to learn recipe representations, that preserves permutation-invariance for the ingredient set and uses a novel knowledge graph (KG) derived triplet sampling approach to optimize the learned embeddings so that related recipes are closer in the latent semantic space. The embeddings are further jointly optimized by combining similarity among cooking instructions with a KG based triplet loss. We experimentally show that RECIPTOR's recipe embeddings outperform state-of-the-art baselines on two newly designed downstream classification tasks by a wide margin.

Hyperbolic Distance Matrices

Hyperbolic space is a natural setting for mining and visualizing data with hierarchical structure. In order to compute a hyperbolic embedding from comparison or similarity information, one has to solve a hyperbolic distance geometry problem. In this paper, we propose a unified framework to compute hyperbolic embeddings from an arbitrary mix of noisy metric and non-metric data. Our algorithms are based on semidefinite programming and the notion of a hyperbolic distance matrix, in many ways parallel to its famous Euclidean counterpart. A central ingredient we put forward is a semidefinite characterization of the hyperbolic Gramian---a matrix of Lorentzian inner products. This characterization allows us to formulate a semidefinite relaxation to efficiently compute hyperbolic embeddings in two stages: first, we complete and denoise the observed hyperbolic distance matrix; second, we propose a spectral factorization method to estimate the embedded points from the hyperbolic distance matrix. We show through numerical experiments how the flexibility to mix metric and non-metric constraints allows us to efficiently compute embeddings from arbitrary data.

RayS: A Ray Searching Method for Hard-label Adversarial Attack

Deep neural networks are vulnerable to adversarial attacks. Among different attack settings, the most challenging yet the most practical one is the hard-label setting where the attacker only has access to the hard-label output (prediction label) of the target model. Previous attempts are neither effective enough in terms of attack success rate nor efficient enough in terms of query complexity under the widely used $L_\infty$ norm threat model. In this paper, we present the Ray Searching attack (RayS), which greatly improves the hard-label attack effectiveness as well as efficiency. Unlike previous works, we reformulate the continuous problem of finding the closest decision boundary into a discrete problem that does not require any zeroth-order gradient estimation. In the meantime, all unnecessary searches are eliminated via a fast check step. This significantly reduces the number of queries needed for our hard-label attack. Moreover, interestingly, we found that the proposed RayS attack can also be used as a sanity check for possible "falsely robust" models. On several recently proposed defenses that claim to achieve the state-of-the-art robust accuracy, our attack method demonstrates that the current white-box/black-box attacks could still give a false sense of security and the robust accuracy drop between the most popular PGD attack and RayS attack could be as large as 28%. We believe that our proposed RayS attack could help identify falsely robust models that beat most white-box/black-box attacks.

On Sampled Metrics for Item Recommendation

The task of item recommendation requires ranking a large catalogue of items given a context. Item recommendation algorithms are evaluated using ranking metrics that depend on the positions of relevant items. To speed up the computation of metrics, recent work often uses sampled metrics where only a smaller set of random items and the relevant items are ranked. This paper investigates sampled metrics in more detail and shows that they are inconsistent with their exact version, in the sense that they do not persist relative statements, e.g., recommender A is better than B, not even in expectation. Moreover, the smaller the sampling size, the less difference there is between metrics, and for very small sampling size, all metrics collapse to the AUC metric. We show that it is possible to improve the quality of the sampled metrics by applying a correction, obtained by minimizing different criteria such as bias or mean squared error. We conclude with an empirical evaluation of the naive sampled metrics and their corrected variants. To summarize, our work suggests that sampling should be avoided for metric calculation, however if an experimental study needs to sample, the proposed corrections can improve the quality of the estimate.

ALO-NMF: Accelerated Locality-Optimized Non-negative Matrix Factorization

Non-negative Matrix Factorization (NMF) is a key kernel for unsupervised dimension reduction used in a wide range of applications, including graph mining, recommender systems and natural language processing. Due to the compute-intensive nature of applications that must perform repeated NMF, several parallel implementations have been developed. However, existing parallel NMF algorithms have not addressed data locality optimizations, which are critical for high performance since data movement costs greatly exceed the cost of arithmetic/logic operations on current computer systems. In this paper, we present a novel optimization method for parallel NMF algorithm based on the HALS (Hierarchical Alternating Least Squares) scheme that incorporates algorithmic transformations to enhance data locality. Efficient realizations of the algorithm on multi-core CPUs and GPUs are developed, demonstrating a new Accelerated Locality-Optimized NMF (ALO-NMF) that obtains up to 2.29x lower data movement cost and up to 4.45x speedup over existing state-of-the-art parallel NMF algorithms.

Multi-Source Deep Domain Adaptation with Weak Supervision for Time-Series Sensor Data

Domain adaptation (DA) offers a valuable means to reuse data and models for new problem domains. However, robust techniques have not yet been considered for time series data with varying amounts of data availability. In this paper, we make three main contributions to fill this gap. First, we propose a novel Convolutional deep Domain Adaptation model for Time Series data (CoDATS) that significantly improves accuracy and training time over state-of-the-art DA strategies on real-world sensor data benchmarks. By utilizing data from multiple source domains, we increase the usefulness of CoDATS to further improve accuracy over prior single-source methods, particularly on complex time series datasets that have high variability between domains. Second, we propose a novel Domain Adaptation with Weak Supervision (DA-WS) method by utilizing weak supervision in the form of target-domain label distributions, which may be easier to collect than additional data labels. Third, we perform comprehensive experiments on diverse real-world datasets to evaluate the effectiveness of our domain adaptation and weak supervision methods. Results show that CoDATS for single-source DA significantly improves over the state-of-the-art methods, and we achieve additional improvements in accuracy using data from multiple source domains and weakly supervised signals.

Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions

Users of music streaming, video streaming, news recommendation, and e-commerce services often engage with content in a sequential manner. Providing and evaluating good sequences of recommendations is therefore a central problem for these services. Prior reweighting-based counterfactual evaluation methods either suffer from high variance or make strong independence assumptions about rewards. We propose a new counterfactual estimator that allows for sequential interactions in the rewards with lower variance in an asymptotically unbiased manner. Our method uses graphical assumptions about the causal relationships of the slate to reweight the rewards in the logging policy in a way that approximates the expected sum of rewards under the target policy. Extensive experiments in simulation and on a live recommender system show that our approach outperforms existing methods in terms of bias and data efficiency for the sequential track recommendations problem.

TAdaNet: Task-Adaptive Network for Graph-Enriched Meta-Learning

Annotated data samples in real-world applications are often limited. Meta-learning, which utilizes prior knowledge learned from related tasks and generalizes to new tasks of limited supervised experience, is an effective approach for few-shot learning. However, standard meta-learning with globally shared knowledge cannot handle the task heterogeneity problem well, i.e., tasks lie in different distributions. Recent advances have explored several ways to trigger task-dependent initial parameters or metrics, in order to customize task-specific information. These approaches learn task contextual information from data, but ignore external domain knowledge that can help in the learning process. In this paper, we propose a task-adaptive network (TAdaNet) that makes use of a domain-knowledge graph to enrich data representations and provide task-specific customization. Specifically, we learn a task embedding that characterizes task relationships and tailors task-specific parameters, resulting in a task-adaptive metric space for classification. Experimental results on a few-shot image classification problem show the effectiveness of the proposed method. We also apply it on a real-world disease classification problem, and show promising results for clinical decision support.

Unsupervised Paraphrasing via Deep Reinforcement Learning

Paraphrasing is expressing the meaning of an input sentence in different wording while maintaining fluency (i.e., grammatical and syntactical correctness). Most existing work on paraphrasing use supervised models that are limited to specific domains (e.g., image captions). Such models can neither be straightforwardly transferred to other domains nor generalize well, and creating labeled training data for new domains is expensive and laborious. The need for paraphrasing across different domains and the scarcity of labeled training data in many such domains call for exploring unsupervised paraphrase generation methods. We propose Progressive Unsupervised Paraphrasing (PUP): a novel unsupervised paraphrase generation method based on deep reinforcement learning (DRL). PUP uses a variational autoencoder (trained using a non-parallel corpus) to generate a seed paraphrase that warm-starts the DRL model. Then, PUP progressively tunes the seed paraphrase guided by our novel reward function which combines semantic adequacy, language fluency, and expression diversity measures to quantify the quality of the generated paraphrases in each iteration without needing parallel sentences. Our extensive experimental evaluation shows that PUP outperforms unsupervised state-of-the-art paraphrasing techniques in terms of both automatic metrics and user studies on four real datasets. We also show that PUP outperforms domain-adapted supervised algorithms on several datasets. Our evaluation also shows that PUP achieves a great trade-off between semantic similarity and diversity of expression.

CICLAD: A Fast and Memory-efficient Closed Itemset Miner for Streams

Mining association rules from data streams is a challenging task due to the (typically) limited resources available vs. the large size of the result. Frequent closed itemsets (FCI) enable an efficient first step, yet current FCI stream miners are not optimal on resource consumption, e.g. they store a large number of extra itemsets at an additional cost. In a search for a better storage-efficiency trade-off, we designed Ciclad, an intersection-based sliding-window FCI miner. Leveraging in-depth insights into FCI evolution, it combines minimal storage with quick access. Experimental results indicate Ciclad's memory imprint is much lower and its performances globally better than competitor methods.

Graph Attention Networks over Edge Content-Based Channels

Edges play a crucial role in passing information on a graph, especially when they carry textual content reflecting semantics behind how nodes are linked and interacting with each other. In this paper, we propose a channel-aware attention mechanism enabled by edge text content when aggregating information from neighboring nodes; and we realize this mechanism in a graph autoencoder framework. Edge text content is encoded as low-dimensional mixtures of latent topics, which serve as semantic channels for topic-level information passing on edges. We embed nodes and topics in the same latent space to capture their mutual dependency when decoding the structural and textual information on graph. We evaluated the proposed model on Yelp user-item bipartite graph and StackOverflow user-user interaction graph. The proposed model outperformed a set of baselines on link prediction and content prediction tasks. Qualitative evaluations also demonstrated the descriptive power of the learnt node embeddings, showing its potential as an interpretable representation of graphs.

Multimodal Learning with Incomplete Modalities by Knowledge Distillation

Multimodal learning aims at utilizing information from a variety of data modalities to improve the generalization performance. One common approach is to seek the common information that is shared among different modalities for learning, whereas we can also fuse the supplementary information to leverage modality-specific information. Though the supplementary information is often desired, most existing multimodal approaches can only learn from samples with complete modalities, which wastes a considerable amount of data collected. Otherwise, model-based imputation needs to be used to complete the missing values and yet may introduce undesired noise, especially when the sample size is limited. In this paper, we proposed a framework based on knowledge distillation, utilizing the supplementary information from all modalities, and avoiding imputation and noise associated with it. Specifically, we first train models on each modality independently using all the available data. Then the trained models are used as teachers to teach the student model, which is trained with the samples having complete modalities. We demonstrate the effectiveness of the proposed method in extensive empirical studies on both synthetic datasets and real-world datasets.

Estimating the Percolation Centrality of Large Networks through Pseudo-dimension Theory

In this work we investigate the problem of estimating the percolation centrality of every vertex in a graph. This centrality measure quantifies the importance of each vertex in a graph going through a contagious process. It is an open problem whether the percolation centrality can be computed in O(n3-c) time, for any constant c>0. In this paper we present a ~O(m) randomized approximation algorithm for the percolation centrality for every vertex of G, generalizing techniques developed by Riondato, Upfal and Kornaropoulos. The estimation obtained by the algorithm is within ε of the exact value with probability 1- δ, for fixed constants 0 < ε,δ < 1. In fact, we show in our experimental analysis that in the case of real-world complex networks, the output produced by our algorithm is significantly closer to the exact values than its guarantee in terms of theoretical worst case analysis.

TinyGNN: Learning Efficient Graph Neural Networks

Recently, Graph Neural Networks (GNNs) arouse a lot of research interest and achieve great success in dealing with graph-based data. The basic idea of GNNs is to aggregate neighbor information iteratively. After k iterations, a k-layer GNN can capture nodes' k-hop local structure. In this way, a deeper GNN can access much more neighbor information leading to better performance. However, when a GNN goes deeper, the exponential expansion of neighborhoods incurs expensive computations in batched training and inference. This takes the deeper GNN away from many applications, e.g., real-time systems. In this paper, we try to learn a small GNN (called TinyGNN), which can achieve high performance and infer the node representation in a short time. However, since a small GNN cannot explore as much local structure as a deeper GNN does, there exists a neighbor information gap between the deeper GNN and the small GNN. To address this problem, we leverage peer node information to model the local structure explicitly and adopt a neighbor distillation strategy to learn local structure knowledge from a deeper GNN implicitly. Extensive experimental results demonstrate that TinyGNN is empirically effective and achieves similar or even better performance compared with the deeper GNNs. Meanwhile, TinyGNN gains a 7.73x--126.59x speed-up on inference over all data sets.

GPT-GNN: Generative Pre-Training of Graph Neural Networks

Graph neural networks (GNNs) have been demonstrated to be powerful in modeling graph-structured data. However, training GNNs requires abundant task-specific labeled data, which is often arduously expensive to obtain. One effective way to reduce the labeling effort is to pre-train an expressive GNN model on unlabelled data with self-supervision and then transfer the learned model to downstream tasks with only a few labels. In this paper, we present the GPT-GNN framework to initialize GNNs by generative pre-training. GPT-GNN introduces a self-supervised attributed graph generation task to pre-train a GNN so that it can capture the structural and semantic properties of the graph. We factorize the likelihood of graph generation into two components: 1) attribute generation and 2) edge generation. By modeling both components, GPT-GNN captures the inherent dependency between node attributes and graph structure during the generative process. Comprehensive experiments on the billion-scale open academic graph and Amazon recommendation data demonstrate that GPT-GNN significantly outperforms state-of-the-art GNN models without pre-training by up to 9.1% across various downstream tasks?

Parameterized Correlation Clustering in Hypergraphs and Bipartite Graphs

Motivated by applications in community detection and dense subgraph discovery, we consider new clustering objectives in hypergraphs and bipartite graphs. These objectives are parameterized by one or more resolution parameters in order to enable diverse knowledge discovery in complex data.

For both hypergraph and bipartite objectives, we identify relevant parameter regimes that are equivalent to existing objectives and share their (polynomial-time) approximation algorithms. We first show that our parameterized hypergraph correlation clustering objective is related to higher-order notions of normalized cut and modularity in hypergraphs. It is further amenable to approximation algorithms via hyperedge expansion techniques.

Our parameterized bipartite correlation clustering objective generalizes standard unweighted bipartite correlation clustering, as well as the bicluster deletion problem. For a certain choice of parameters it is also related to our hypergraph objective. Although in general it is NP-hard, we highlight a parameter regime for the bipartite objective where the problem reduces to the bipartite matching problem and thus can be solved in polynomial time. For other parameter settings, we present several approximation algorithms using linear program rounding techniques. These results allow us to introduce the first constant-factor approximation for bicluster deletion, the task of removing a minimum number of edges to partition a bipartite graph into disjoint bi-cliques.

In several experimental results, we highlight the flexibility of our framework and the diversity of results that can be obtained in different parameter settings. This includes clustering bipartite graphs across a range of parameters, detecting motif-rich clusters in an email network and a food web, and forming clusters of retail products in a product review hypergraph, that are highly correlated with known product categories.

Prioritized Restreaming Algorithms for Balanced Graph Partitioning

Balanced graph partitioning is a critical step for many large-scale distributed computations with relational data. As graph datasets have grown in size and density, a range of highly-scalable balanced partitioning algorithms have appeared to meet varied demands across different domains. As the starting point for the present work, we observe that two recently introduced families of iterative partitioners---those based on restreaming and those based on balanced label propagation (including Facebook's Social Hash Partitioner)---can be viewed through a common modular framework of design decisions. With the help of this modular perspective, we find that a key combination of design decisions leads to a novel family of algorithms with notably better empirical performance than any existing highly-scalable algorithm on a broad range of real-world graphs. The resulting prioritized restreaming algorithms employ a constraint management strategy based on multiplicative weights, borrowed from the restreaming literature, while adopting notions of priority from balanced label propagation to optimize the ordering of the streaming process. Our experimental results consider a range of stream orders, where a dynamic ordering based on what we call ambivalence is broadly the most performative in terms of the cut quality of the resulting balanced partitions, with a static ordering based on degree being nearly as good.

A Non-Iterative Quantile Change Detection Method in Mixture Model with Heavy-Tailed Components

Estimating parameters of mixture model has wide applications ranging from classification problems to estimating of complex distributions. Most of the current literature on estimating the parameters of the mixture densities are based on iterative Expectation Maximization (EM) type algorithms which require the use of either taking expectations over the latent label variables or generating samples from the conditional distribution of such latent labels using the Bayes rule. Moreover, when the number of components is unknown, the problem becomes computationally more demanding due to well-known label switching issues [28]. In this paper, we propose a robust and quick approach based on change-point methods to determine the number of mixture components that works for almost any location-scale families even when the components are heavy tailed (e.g., Cauchy). We present several numerical illustrations by comparing our method with some of popular methods available in the literature using simulated data and real case studies. The proposed method is shown be as much as 500 times faster than some of the competing methods and are also shown to be more accurate in estimating the mixture distributions by goodness-of-fit tests.

AdvMind: Inferring Adversary Intent of Black-Box Attacks

Deep neural networks (DNNs) are inherently susceptible to adversarial attacks even under black-box settings, in which the adversary only has query access to the target models. In practice, while it may be possible to effectively detect such attacks (e.g., observing massive similar but non-identical queries), it is often challenging to exactly infer the adversary intent (e.g., the target class of the adversarial example the adversary attempts to craft) especially during early stages of the attacks, which is crucial for performing effective deterrence and remediation of the threats in many scenarios.

In this paper, we present AdvMind, a new class of estimation models that infer the adversary intent of black-box adversarial attacks in a robust and prompt manner. Specifically, to achieve robust detection, AdvMind accounts for the adversary adaptiveness such that her attempt to conceal the target will significantly increase the attack cost (e.g., in terms of the number of queries); to achieve prompt detection, AdvMind proactively synthesizes plausible query results to solicit subsequent queries from the adversary that maximally expose her intent. Through extensive empirical evaluation on benchmark datasets and state-of-the-art black-box attacks, we demonstrate that on average AdvMind detects the adversary intent with over 75% accuracy after observing less than 3 query batches and meanwhile increases the cost of adaptive attacks by over 60%. We further discuss the possible synergy between AdvMind and other defense methods against black-box adversarial attacks, pointing to several promising research directions.

Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding

Mining a set of meaningful topics organized into a hierarchy is intuitively appealing since topic correlations are ubiquitous in massive text corpora. To account for potential hierarchical topic structures, hierarchical topic models generalize flat topic models by incorporating latent topic hierarchies into their generative modeling process. However, due to their purely unsupervised nature, the learned topic hierarchy often deviates from users' particular needs or interests. To guide the hierarchical topic discovery process with minimal user supervision, we propose a new task, Hierarchical Topic Mining, which takes a category tree described by category names only, and aims to mine a set of representative terms for each category from a text corpus to help a user comprehend his/her interested topics. We develop a novel joint tree and text embedding method along with a principled optimization procedure that allows simultaneous modeling of the category tree structure and the corpus generative process in the spherical space for effective category-representative term discovery. Our comprehensive experiments show that our model, named JoSH, mines a high-quality set of hierarchical topics with high efficiency and benefits weakly-supervised hierarchical text classification tasks.

Combinatorial Black-Box Optimization with Expert Advice

We consider the problem of black-box function optimization over the Boolean hypercube. Despite the vast literature on black-box function optimization over continuous domains, not much attention has been paid to learning models for optimization over combinatorial domains until recently. However, the computational complexity of the recently devised algorithms are prohibitive even for moderate numbers of variables; drawing one sample using the existing algorithms is more expensive than a function evaluation for many black-box functions of interest. To address this problem, we propose a computationally efficient model learning algorithm based on multilinear polynomials and exponential weight updates. In the proposed algorithm, we alternate between simulated annealing with respect to the current polynomial representation and updating the weights using monomial experts' advice. Numerical experiments on various datasets in both unconstrained and sum-constrained Boolean optimization indicate the competitive performance of the proposed algorithm, while improving the computational time up to several orders of magnitude compared to state-of-the-art algorithms in the literature.

CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring

Taxonomy is not only a fundamental form of knowledge representation, but also crucial to vast knowledge-rich applications, such as question answering and web search. Most existing taxonomy construction methods extract hypernym-hyponym entity pairs to organize a "universal" taxonomy. However, these generic taxonomies cannot satisfy user's specific interest in certain areas and relations. Moreover, the nature of instance taxonomy treats each node as a single word, which has low semantic coverage for people to fully understand. In this paper, we propose a method for seed-guided topical taxonomy construction, which takes a corpus and a seed taxonomy described by concept names as input, and constructs a more complete taxonomy based on user's interest, wherein each node is represented by a cluster of coherent terms. Our framework, CoRel, has two modules to fulfill this goal. A relation transferring module learns and transfers the user's interested relation along multiple paths to expand the seed taxonomy structure in width and depth. A concept learning module enriches the semantics of each concept node by jointly embedding the taxonomy and text. Comprehensive experiments conducted on real-world datasets show that CoRel generates high-quality topical taxonomies and outperforms all the baselines significantly.

Treatment Policy Learning in Multiobjective Settings with Fully Observed Outcomes

In several medical decision-making problems, such as antibiotic prescription, laboratory testing can provide precise indications for how a patient will respond to different treatment options. This enables us to "fully observe" all potential treatment outcomes, but while present in historical data, these results are infeasible to produce in real-time at the point of the initial treatment decision. Moreover, treatment policies in these settings often need to trade off between multiple competing objectives, such as effectiveness of treatment and harmful side effects. We present, compare, and evaluate three approaches for learning individualized treatment policies in this setting: First, we consider two indirect approaches, which use predictive models of treatment response to construct policies optimal for different trade-offs between objectives. Second, we consider a direct approach that constructs such a set of policies without intermediate models of outcomes. Using a medical dataset of Urinary Tract Infection (UTI) patients, we show that all approaches learn policies that achieve strictly better performance on all outcomes than clinicians, while also trading off between different objectives. We demonstrate additional benefits of the direct approach, including flexibly incorporating other goals such as deferral to physicians on simple cases.

List-wise Fairness Criterion for Point Processes

Many types of event sequence data exhibit triggering and clustering properties in space and time. Point processes are widely used in modeling such event data with applications such as predictive policing and disaster event forecasting. Although current algorithms can achieve significant event prediction accuracy, the historic data or the self-excitation property can introduce biased prediction. For example, hotspots ranked by event hazard rates can make the visibility of a disadvantaged group (e.g., racial minorities or the communities of lower social economic status) more apparent. Existing methods have explored ways to achieve parity between the groups by penalizing the objective function with several group fairness metrics. However, these metrics fail to measure the fairness on every prefix of the ranking. In this paper, we propose a novel list-wise fairness criterion for point processes, which can efficiently evaluate the ranking fairness in event prediction. We also present a strict definition of the unfairness consistency property of a fairness metric and prove that our list-wise fairness criterion satisfies this property. Experiments on several real-world spatial-temporal sequence datasets demonstrate the effectiveness of our list-wise fairness criterion.

Neural Subgraph Isomorphism Counting

In this paper, we study a new graph learning problem: learning to count subgraph isomorphisms. Different from other traditional graph learning problems such as node classification and link prediction, subgraph isomorphism counting is NP-complete and requires more global inference to oversee the whole graph. To make it scalable for large-scale graphs and patterns, we propose a learning framework that augments different representation learning architectures and iteratively attends pattern and target data graphs to memorize intermediate states of subgraph isomorphism searching for global counting. We develop both small graphs (<= 1,024 subgraph isomorphisms in each) and large graphs (<= 4,096 subgraph isomorphisms in each) sets to evaluate different representation and interaction modules. A mutagenic compound dataset, MUTAG, is also used to evaluate neural models and demonstrate the success of transfer learning. While the learning based approach is inexact, we are able to generalize to count large patterns and data graphs in linear time compared to the exponential time of the original NP-complete problem. Experimental results show that learning based subgraph isomorphism counting can speed up the traditional algorithm, VF2, 10-1,000 times with acceptable errors. Domain adaptation based on fine-tuning also shows the usefulness of our approach in real-world applications.

Hypergraph Clustering Based on PageRank

A hypergraph is a useful combinatorial object to model ternary or higher-order relations among entities. Clustering hypergraphs is a fundamental task in network analysis. In this study, we develop two clustering algorithms based on personalized PageRank on hypergraphs. The first one is local in the sense that its goal is to find a tightly connected vertex set with a bounded volume including a specified vertex. The second one is global in the sense that its goal is to find a tightly connected vertex set. For both algorithms, we discuss theoretical guarantees on the conductance of the output vertex set. Also, we experimentally demonstrate that our clustering algorithms outperform existing methods in terms of both the solution quality and running time. To the best of our knowledge, ours are the first practical algorithms for hypergraphs with theoretical guarantees on the conductance of the output set.

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffn-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness. Our audio samples are shown in https://speechresearch.github.io/deepsinger/.

Scaling Choice Models of Relational Social Data

Many prediction problems on social networks, from recommendations to anomaly detection, can be approached by modeling network data as a sequence of relational events and then leveraging the resulting model for prediction. Conditional logit models of discrete choice are a natural approach to modeling relational events as "choices'' in a framework that envelops and extends many long-studied models of network formation. The conditional logit model is simplistic, but it is particularly attractive because it allows for efficient consistent likelihood maximization via negative sampling, something that isn't true for mixed logit and many other richer models. The value of negative sampling is particularly pronounced because choice sets in relational data are often enormous. Given the importance of negative sampling, in this work we introduce a model simplification technique for mixed logit models that we call "de-mixing'', whereby standard mixture models of network formation---particularly models that mix local and global link formation---are reformulated to operate their modes over disjoint choice sets. This reformulation reduces mixed logit models to conditional logit models, opening the door to negative sampling while also circumventing other standard challenges with maximizing mixture model likelihoods. To further improve scalability, we also study importance sampling for more efficiently selecting negative samples, finding that it can greatly speed up inference in both standard and de-mixed models. Together, these steps make it possible to much more realistically model network formation in very large graphs. We illustrate the relative gains of our improvements on synthetic datasets with known ground truth as well as a large-scale dataset of public transactions on the Venmo platform.

Deep Exogenous and Endogenous Influence Combination for Social Chatter Intensity Prediction

Modeling user engagement dynamics on social media has compelling applications in market trend analysis, user-persona detection, and political discourse mining. Most existing approaches depend heavily on knowledge of the underlying user network. However, a large number of discussions happen on platforms that either lack any reliable social network (news portal, blogs, Buzzfeed) or reveal only partially the inter-user ties (Reddit, Stackoverflow). Many approaches require observing a discussion for some considerable period before they can make useful predictions. In real-time streaming scenarios, observations incur costs. Lastly, most models do not capture complex interactions between exogenous events (such as news articles published externally) and in-network effects (such as follow-up discussions on Reddit) to determine engagement levels. To address the three limitations noted above, we propose a novel framework, ChatterNet, which, to our knowledge, is the first that can model and predict user engagement without considering the underlying user network. Given streams of timestamped news articles and discussions, the task is to observe the streams for a short period leading up to a time horizon, then predict chatter: the volume of discussions through a specified period after the horizon. ChatterNet processes text from news and discussions using a novel time-evolving recurrent network architecture that captures both temporal properties within news and discussions, as well as influence of news on discussions. We report on extensive experiments using a two-month-long discussion corpus of Reddit, and a contemporaneous corpus of online news articles from the Common Crawl. ChatterNet shows considerable improvements beyond recent state-of-the-art models of engagement prediction. Detailed studies controlling observation and prediction windows, over 43 different subreddits, yield further useful insights.

Geography-Aware Sequential Location Recommendation

Sequential location recommendation plays an important role in many applications such as mobility prediction, route planning and location-based advertisements. In spite of evolving from tensor factorization to RNN-based neural networks, existing methods did not make effective use of geographical information and suffered from the sparsity issue. To this end, we propose a Geography-aware sequential recommender based on the Self-Attention Network (GeoSAN for short) for location recommendation. On the one hand, we propose a new loss function based on importance sampling for optimization, to address the sparsity issue by emphasizing the use of informative negative samples. On the other hand, to make better use of geographical information, GeoSAN represents the hierarchical gridding of each GPS point with a self-attention based geography encoder. Moreover, we put forward geography-aware negative samplers to promote the informativeness of negative samples. We evaluate the proposed algorithm with three real-world LBSN datasets, and show that GeoSAN outperforms the state-of-the-art sequential location recommenders by 34.9%. The experimental results further verify significant effectiveness of the new loss function, geography encoder, and geography-aware negative samplers.

Dual Channel Hypergraph Collaborative Filtering

Collaborative filtering (CF) is one of the most popular and important recommendation methodologies in the heart of numerous recommender systems today. Although widely adopted, existing CF-based methods, ranging from matrix factorization to the emerging graph-based methods, suffer inferior performance especially when the data for training are very limited. In this paper, we first pinpoint the root causes of such deficiency and observe two main disadvantages that stem from the inherent designs of existing CF-based methods, i.e., 1) inflexible modeling of users and items and 2) insufficient modeling of high-order correlations among the subjects. Under such circumstances, we propose a dual channel hypergraph collaborative filtering (DHCF) framework to tackle the above issues. First, a dual channel learning strategy, which holistically leverages the divide-and-conquer strategy, is introduced to learn the representation of users and items so that these two types of data can be elegantly interconnected while still maintaining their specific properties. Second, the hypergraph structure is employed for modeling users and items with explicit hybrid high-order correlations. The jump hypergraph convolution (JHConv) method is proposed to support the explicit and efficient embedding propagation of high-order correlations. Comprehensive experiments on two public benchmarks and two new real-world datasets demonstrate that DHCF can achieve significant and consistent improvements against other state-of-the-art methods.

A Framework for Recommending Accurate and Diverse Items Using Bayesian Graph Convolutional Neural Networks

Personalized recommender systems are playing an increasingly important role for online consumption platforms. Because of the multitude of relationships existing in recommender systems, Graph Neural Networks (GNNs) based approaches have been proposed to better characterize the various relationships between a user and items while modeling a user's preferences. Previous graph-based recommendation approaches process the observed user-item interaction graph as a ground-truth depiction of the relationships between users and items. However, especially in the implicit recommendation setting, all the unobserved user-item interactions are usually assumed to be negative samples. There are missing links that represent a user's future actions. In addition, there may be spurious or misleading positive interactions. To alleviate the above issue, in this work, we take a first step to introduce a principled way to model the uncertainty in the user-item interaction graph using the Bayesian Graph Convolutional Neural Network framework. We discuss how inference can be performed under our framework and provide a concrete formulation using the Bayesian Probabilistic Ranking training loss. We demonstrate the effectiveness of our proposed framework on four benchmark recommendation datasets. The proposed method outperforms state-of-the-art graph-based recommendation models. Furthermore, we conducted an offline evaluation on one industrial large-scale dataset. It shows that our proposed method outperforms the baselines, with the potential gain being more significant for cold-start users. This illustrates the potential practical benefit in real-world recommender systems.

Learning Based Distributed Tracking

Inspired by the great success of machine learning in the past decade, people have been thinking about the possibility of improving the theoretical results by exploring data distribution. In this paper, we revisit a fundamental problem called Distributed Tracking (DT) under an assumption that the data follows a certain (known or unknown) distribution, and propose a number Data-dependent algorithms with improved theoretical bounds. Informally, in the DT problem, there is a coordinator and k players, where the coordinator holds a threshold N and each player has a counter. At each time stamp, at most one counter can be increased by one. The job of the coordinator is to capture the exact moment when the sum of all these k counters reaches N. The goal is to minimise the communication cost. While our first type of algorithms assume the concrete data distribution is known in advance, our second type of algorithms can learn the distribution on the fly. Both of the algorithms achieve a communication cost bounded by O(k log log N) with high probability, improving the state-of-the-art data-independent bound O(k log N/k). We further propose a number of implementation optimisation heuristics to improve both efficiency and robustness of the algorithms. Finally, we conduct extensive experiments on three real datasets and four synthetic datasets. The experimental results show that the communication cost of our algorithms is as least as $20%$ of that of the state-of-the-art algorithms.

Tight Sensitivity Bounds For Smaller Coresets

An ε-coreset to the dimensionality reduction problem for a (possibly very large) matrix A ∈ Rn x d is a small scaled subset of its n rows that approximates their sum of squared distances to every affine k-dimensional subspace of Rd, up to a factor of 1±ε. Such a coreset is useful for boosting the running time of computing a low-rank approximation (k-SVD/k-PCA) while using small memory. Coresets are also useful for handling streaming, dynamic and distributed data in parallel. With high probability, non-uniform sampling based on the so called leverage score or sensitivity of each row in A yields a coreset. The size of the (sampled) coreset is then near-linear in the total sum of these sensitivity bounds. We provide algorithms that compute provably tight bounds for the sensitivity of each input row. It is based on two ingredients: (i) iterative algorithm that computes the exact sensitivity of each row up to arbitrary small precision for (non-affine) k-subspaces, and (ii) a general reduction for computing a coreset for affine subspaces, given a coreset for (non-affine) subspaces in Rd. Experimental results on real-world datasets, including the English Wikipedia documents-term matrix, show that our bounds provide significantly smaller and data-dependent coresets also in practice. Full open source code is also provided.

GHashing: Semantic Graph Hashing for Approximate Similarity Search in Graph Databases

Graph similarity search aims to find the most similar graphs to a query in a graph database in terms of a given proximity measure, say Graph Edit Distance (GED). It is a widely studied yet still challenging problem. Most of the studies are based on the pruning-verification framework, which first prunes non-promising graphs and then conducts verification on the small candidate set. Existing methods are capable of managing databases with thousands or tens of thousands of graphs, but fail to scale to even larger database, due to their exact pruning strategy. Inspired by the recent success of deep-learning-based semantic hashing in image and document retrieval, we propose a novel graph neural network (GNN) based semantic hashing, i.e. GHashing, for approximate pruning. We first train a GNN with ground-truth GED results so that it learns to generate embeddings and hash codes that preserve GED between graphs. Then a hash index is built to enable graph lookup in constant time. To answer a query, we use the hash codes and the continuous embeddings as two-level pruning to retrieve the most promising candidates, which are sent to the exact solver for final verification. Due to the approximate pruning strategy leveraged by our graph hashing technique, our approach achieves significantly faster query time compared to state-of-the-art methods while maintaining a high recall. Experiments show that our approach is on average 20x faster than the only baseline that works on million-scale databases, which demonstrates GHashing successfully provides a new direction in addressing graph search problem for large-scale graph databases.

Interactive Path Reasoning on Graph for Conversational Recommendation

Traditional recommendation systems estimate user preference on items from past interaction history, thus suffering from the limitations of obtaining fine-grained and dynamic user preference. Conversational recommendation system (CRS) brings revolutions to those limitations by enabling the system to directly ask users about their preferred attributes on items. However, existing CRS methods do not make full use of such advantage --- they only use the attribute feedback in rather implicit ways such as updating the latent user representation. In this paper, we propose Conversational Path Reasoning (CPR), a generic framework that models conversational recommendation as an interactive path reasoning problem on a graph. It walks through the attribute vertices by following user feedback, utilizing the user preferred attributes in an explicit way. By leveraging on the graph structure, CPR is able to prune off many irrelevant candidate attributes, leading to a better chance of hitting user-preferred attributes. To demonstrate how CPR works, we propose a simple yet effective instantiation named SCPR (Simple CPR). We perform empirical studies on the multi-round conversational recommendation scenario, the most realistic CRS setting so far that considers multiple rounds of asking attributes and recommending items. Through extensive experiments on two datasets Yelp and LastFM, we validate the effectiveness of our SCPR, which significantly outperforms the state-of-the-art CRS methods EAR and CRM. In particular, we find that the more attributes there are, the more advantages our method can achieve.

Algorithmic Aspects of Temporal Betweenness

The betweenness centrality of a graph vertex measures how often this vertex is visited on shortest paths between other vertices of the graph. In the analysis of many real-world graphs or networks, betweenness centrality of a vertex is used as an indicator for its relative importance in the network. In recent years, a growing number of real-world networks is modeled as temporal graphs instead of conventional (static) graphs. In a temporal graph, we have a fixed set of vertices and there is a finite discrete set of time steps and every edge might be present only at some time steps. While shortest paths are straightforward to define in static graphs, temporal paths can be considered "optimal" with respect to many different criteria, including length, arrival time, and overall travel time (shortest, foremost, and fastest paths). This leads to different concepts of temporal betweenness centrality, posing new challenges on the algorithmic side. We provide a systematic study of temporal betweenness variants based on various concepts of optimal temporal paths both on a theoretical and empirical level.

Non-Linear Mining of Social Activities in Tensor Streams

Given a large time-evolving event series such as Google web-search logs, which are collected according to various aspects, i.e., timestamps, locations and keywords, how accurately can we forecast their future activities? How can we reveal significant patterns that allow us to long-term forecast from such complex tensor streams? In this paper, we propose a streaming method, namely, CubeCast, that is designed to capture basic trends and seasonality in tensor streams and extract temporal and multi-dimensional relationships between such dynamics. Our proposed method has the following properties: (a) it is effective: it finds both trends and seasonality and summarizes their dynamics into simultaneous non-linear latent space. (b) it is automatic: it automatically recognizes and models such structural patterns without any parameter tuning or prior information. (c) it is scalable: it incrementally and adaptively detects shifting points of patterns for a semi-infinite collection of tensor streams. Extensive experiments that we conducted on real datasets demonstrate that our algorithm can effectively and efficiently find meaningful patterns for generating future values, and outperforms the state-of-the-art algorithms for time series forecasting in terms of forecasting accuracy and computational time.

DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering

Automatic Machine Learning (AutoML) is an area of research aimed at automating Machine Learning (ML) activities that currently require the involvement of human experts. One of the most challenging tasks in this field is the automatic generation of end-to-end ML pipelines: combining multiple types of ML algorithms into a single architecture used for analysis of previously-unseen data. This task has two challenging aspects: the first is the need to explore a large search space of algorithms and pipeline architectures. The second challenge is the computational cost of training and evaluating multiple pipelines. In this study we present DeepLine, a reinforcement learning-based approach for automatic pipeline generation. Our proposed approach utilizes an efficient representation of the search space together with a novel method for operating in environments with large and dynamic action spaces. By leveraging past knowledge gained from previously-analyzed datasets, our approach only needs to generate and evaluate few dozens of pipelines to reach comparable or better performance than current state-of-the-art AutoML systems that evaluate hundreds and even thousands of pipelines in their optimization process. Evaluation on 56 classification datasets demonstrates the merits of our approach.

On Sampling Top-K Recommendation Evaluation

Recently, Rendle has warned that the use of sampling-based top-k metrics might not suffice. This throws a number of recent studies on deep learning-based recommendation algorithms, and classic non-deep-learning algorithms using such a metric, into jeopardy. In this work, we thoroughly investigate the relationship between the sampling and global top-K Hit-Ratio (HR, or Recall), originally proposed by Koren[2] and extensively used by others. By formulating the problem of aligning sampling top-k ($SHR@k$) and global top-K (HR@K) Hit-Ratios through a mapping function f, so that SHR@k~ HR@f(k), we demonstrate both theoretically and experimentally that the sampling top-k Hit-Ratio provides an accurate approximation of its global (exact) counterpart, and can consistently predict the correct winners (the same as indicate by their corresponding global Hit-Ratios).

Algorithmic Decision Making with Conditional Fairness

Nowadays fairness issues have raised great concerns in decision-making systems. Various fairness notions have been proposed to measure the degree to which an algorithm is unfair. In practice, there frequently exist a certain set of variables we term as fair variables, which are pre-decision covariates such as users' choices. The effects of fair variables are irrelevant in assessing the fairness of the decision support algorithm. We thus define conditional fairness as a more sound fairness metric by conditioning on the fairness variables. Given different prior knowledge of fair variables, we demonstrate that traditional fairness notations, such as demographic parity and equalized odds, are special cases of our conditional fairness notations. Moreover, we propose a Derivable Conditional Fairness Regularizer (DCFR), which can be integrated into any decision-making model, to track the trade-off between precision and fairness of algorithmic decision making. Specifically, an adversarial representation based conditional independence loss is proposed in our DCFR to measure the degree of unfairness. With extensive experiments on three real-world datasets, we demonstrate the advantages of our conditional fairness notation and DCFR.

Semi-supervised Collaborative Filtering by Text-enhanced Domain Adaptation

Data sparsity is an inherent challenge in the recommender systems, where most of the data is collected from the implicit feedbacks of users. This causes two difficulties in designing effective algorithms: first, the majority of users only have a few interactions with the system and there is no enough data for learning; second, there are no negative samples in the implicit feedbacks and it is a common practice to perform negative sampling to generate negative samples. However, this leads to a consequence that many potential positive samples are mislabeled as negative ones and data sparsity would exacerbate the mislabeling problem.

To solve these difficulties, we regard the problem of recommendation on sparse implicit feedbacks as a semi-supervised learning task, and explore domain adaption to solve it. We transfer the knowledge learned from dense data to sparse data and we focus on the most challenging case - there is no user or item overlap.

In this extreme case, aligning embeddings of two datasets directly is rather sub-optimal since the two latent spaces encode very different information. As such, we adopt domain-invariant textual features as the anchor points to align the latent spaces. To align the embeddings, we extract the textual features for each user and item and feed them into a domain classifier with the embeddings of users and items. The embeddings are trained to puzzle the classifier and textual features are fixed as anchor points. By domain adaptation, the distribution pattern in the source domain is transferred to the target domain. As the target part can be supervised by domain adaptation, we abandon negative sampling in target dataset to avoid label noise.

We adopt three pairs of real-world datasets to validate the effectiveness of our transfer strategy. Results show that our models outperform existing models significantly.

Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC

Rich information matrices from first and second-order derivatives have many potential applications in both theoretical and practical problems in deep learning. However, computing these information matrices is extremely expensive and this enormous cost is currently limiting its application to important problems regarding generalization, hyperparameter tuning, and optimization of deep neural networks. One of the most challenging use cases of information matrices is their use as a preconditioner for the optimizers, since the information matrices need to be updated every step. In this work, we conduct a step-by-step performance analysis when computing the Fisher information matrix during training of ResNet-50 on ImageNet, and show that the overhead can be reduced to the same amount as the cost of performing a single SGD step. We also show that the resulting Fisher preconditioned optimizer can converge in 1/3 the number of epochs compared to SGD, while achieving the same Top-1 validation accuracy. This is the first work to achieve such accuracy with K-FAC while reducing the training time to match that of SGD.

Voronoi Graph Traversal in High Dimensions with Applications to Topological Data Analysis and Piecewise Linear Interpolation

Voronoi diagrams and their dual, the Delaunay complex, are two fundamental geometric concepts that lie at the foundation of many machine learning algorithms and play a role in particular in classical piecewise linear interpolation and regression methods. More recently, they are also crucial for the construction of a common class of simplicial complexes such as Alpha and Delaunay-\vC ech complexes in topological data analysis. We propose a randomized approximation approach that mitigates the prohibitive cost of exact computation of Voronoi diagrams in high dimensions for machine learning applications. In experiments with data in up to 50 dimensions, we show that this allows us to significantly extend the use of Voronoi-based simplicial complexes in Topological Data Analysis (TDA) to higher dimensions. We confirm prior TDA results on image patches that previously had to rely on sub-sampled data with increased resolution and demonstrate the scalability of our approach by performing a TDA analysis on synthetic data as well as on filters of a ResNet neural network architecture. Secondly, we propose an application of our approach to piecewise linear interpolation of high dimensional data that avoids explicit complete computation of an associated Delaunay triangulation.

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This feature is a strong improvement over previously proposed solutions that could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.

REA: Robust Cross-lingual Entity Alignment Between Knowledge Graphs

Cross-lingual entity alignment aims at associating semantically similar entities in knowledge graphs with different languages. It has been an essential research problem for knowledge integration and knowledge graph connection, and been studied with supervised or semi-supervised machine learning methods with the assumption of clean labeled data. However, labels from human annotations often include errors, which can largely affect the alignment results. We thus aim to formulate and explore the robust entity alignment problem, which is non-trivial, due to the deficiency of noisy labels. Our proposed method named REA (Robust Entity Alignment) consists of two components: noise detection and noise-aware entity alignment. The noise detection is designed by following the adversarial training principle. The noise-aware entity alignment is devised by leveraging graph neural network based knowledge graph encoder as the core. In order to mutually boost the performance of the two components, we propose a unified reinforced training strategy to combine them. To evaluate our REA method, we conduct extensive experiments on several real-world datasets. The experimental results demonstrate the effectiveness of our proposed method and also show that our model consistently outperforms the state-of-the-art methods with significant improvement on alignment accuracy in the noise-involved scenario.

Stable Learning via Differentiated Variable Decorrelation

Recently, as the applications of artificial intelligence gradually seeping into some risk-sensitive areas such as justice, healthcare and autonomous driving, an upsurge of research interest on model stability and robustness has arisen in the field of machine learning. Rather than purely fitting the observed training data, stable learning tries to learn a model with uniformly good performance under non-stationary and agnostic testing data. The key challenge of stable learning in practice is that we do not have any knowledge about the true model and test data distribution as a priori. Under such condition, we cannot expect a faithful estimation of model parameters and its stability over wild changing environments. Previous methods resort to a reweighting scheme to remove the correlations between all the variables through a set of new sample weights. However, we argue that such aggressive decorrelation between all the variables may cause the over-reduced sample size, which leads to the variance inflation and possible underperformance. In this paper, we incorporate the unlabled data from multiple environments into the variable decorrelation framework and propose a Differentiated Variable Decorrelation (DVD) algorithm based on the clustering of variables. Specifically, the variables are clustered according to the stability of their correlations and the variable decorrelation module learns a set of sample weights to remove the correlations merely between the variables of different clusters. Empirical studies on both synthetic and real world datasets clearly demonstrate the efficacy of our DVD algorithm on improving the model parameter estimation and the prediction stability over changing distributions.

Learning Stable Graphs from Multiple Environments with Selection Bias

Nowadays graph has become a general and powerful representation to describe the rich relationships among different kinds of entities via the underlying patterns encoded in its structure. The knowledge (more generally) accumulated in graph is expected to be able to cross populations from one to another and the past to future. However the data collection process of graph generation is full of known or unknown sample selection biases, leading to spurious correlations among entities, especially in the non-stationary and heterogeneous environments. In this paper, we target the problem of learning stable graphs from multiple environments with selection bias. We purpose a Stable Graph Learning (SGL) framework to learn a graph that can capture general relational patterns which are irrelevant with the selection bias in an unsupervised way. Extensive experimental results from both simulation and real data demonstrate that our method could significantly benefit the generalization capacity of graph structure.

Fast RobustSTL: Efficient and Robust Seasonal-Trend Decomposition for Time Series with Complex Patterns

Many real-world time series data exhibit complex patterns with trend, seasonality, outlier and noise. Robustly and accurately decomposing these components would greatly facilitate time series tasks including anomaly detection, forecasting and classification. RobustSTL is an effective seasonal-trend decomposition for time series data with complicated patterns. However, it cannot handle multiple seasonal components properly. Also it suffers from its high computational complexity, which limits its usage in practice. In this paper, we extend RobustSTL to handle multiple seasonality. To speed up the computation, we propose a special generalized ADMM algorithm to perform the decomposition efficiently. We rigorously prove that the proposed algorithm converges approximately as standard ADMM while reducing the complexity from O(N2) to O(N log N) for each iteration. We empirically study our proposed algorithm with other state-of-the-art seasonal-trend decomposition methods, including MSTL, STR, TBATS, on both synthetic and real-world datasets with single and multiple seasonality. The experimental results demonstrate the superior performance of our decomposition algorithm in terms of both effectiveness and efficiency.

CurvaNet: Geometric Deep Learning based on Directional Curvature for 3D Shape Analysis

Over the last decade, deep learning research has achieved tremendous success in computer vision and natural language processing. The current widely successful deep learning models are largely based on convolution and pooling operations on a Euclidean plane with a regular grid (e.g., image and video data) and thus cannot be directly applied to the non-Euclidean surface. Geometric deep learning aims to fill the gap by generalizing deep learning models from a 2D Euclidean plane to a 3D geometric surface. The problem has important applications in human-computer interaction, biochemistry, and mechanical engineering, but is uniquely challenging due to the lack of a regular grid framework and the difficulties in learning geometric features on a non-Euclidean manifold. Existing works focus on generalizing deep learning models from 2D image to graphs (e.g., graph neural networks) or 3D mesh surfaces but without fully learning geometric features from a differential geometry perspective. In contrast, this paper proposes a novel geometric deep learning model called CurvaNet that integrates differential geometry with graph neural networks. The key idea is to learn direction sensitive 3D shape features through directional curvature filters. We design a U-Net like architecture with downsampling and upsampling paths based on mesh pooling and unpooling operations. Evaluation on real-world datasets shows that the proposed model outperforms several baseline methods in classification accuracy.

Attentional Multi-graph Convolutional Network for Regional Economy Prediction with Open Migration Data

We study the problem of predicting regional economy of U.S. counties with open migration data collected from U.S. Internal Revenue Service (IRS) records. To capture the complicated correlations between them, we design a novel Attentional Multi-graph Convolutional Network (AMCN), which models the migration behavior as a multi-graph with different types of edges denoting the migration flows collected from heterogeneous sources of different years and different demographics. AMCN extracts high quality feature from the migration multi-graph by first applying customized aggregator functions on the induced subgraphs, and then fusing the aggregated features with a higher-order attentional aggregator function. In addition, we address the data sparsity problem with an important neighbor discovery algorithm that can automatically supplement important neighbors that are absent in the empirical data. Experiment results show our AMCN model significantly outperforms all baselines in terms of reducing the relative mean square error by 43.8% against the classic regression model and by 12.7% against the state-of-the-art deep learning baselines. In-depth model analysis shows our proposed AMCN model reveals insightful correlations between regional economy and migration data.

SESSION: Applied Data Science Track Papers

Octet: Online Catalog Taxonomy Enrichment with Self-Supervision

Taxonomies have found wide applications in various domains, especially online for item categorization, browsing, and search. Despite the prevalent use of online catalog taxonomies, most of them in practice are maintained by humans, which is labor-intensive and difficult to scale. While taxonomy construction from scratch is considerably studied in the literature, how to effectively enrich existing incomplete taxonomies remains an open yet important research question. Taxonomy enrichment not only requires the robustness to deal with emerging terms but also the consistency between existing taxonomy structure and new term attachment. In this paper, we present a self-supervised end-to-end framework, Octet, for Online Catalog Taxonomy EnrichmenT. Octet leverages heterogeneous information unique to online catalog taxonomies such as user queries, items, and their relations to the taxonomy nodes while requiring no other supervision than the existing taxonomies. We propose to distantly train a sequence labeling model for term extraction and employ graph neural networks (GNNs) to capture the taxonomy structure as well as the query-item-taxonomy interactions for term attachment. Extensive experiments in different online domains demonstrate the superiority of Octet over state-of-the-art methods via both automatic and human evaluations. Notably, Octet enriches an online catalog taxonomy in production to 2 times larger in the open-world evaluation.

TIMME: Twitter Ideology-detection via Multi-task Multi-relational Embedding

We aim at solving the problem of predicting people's ideology, or political tendency. We estimate it by using Twitter data, and formalize it as a classification problem. Ideology-detection has long been a challenging yet important problem. Certain groups, such as the policy makers, rely on it to make wise decisions. Back in the old days when labor-intensive survey-studies were needed to collect public opinions, analyzing ordinary citizens' political tendencies was uneasy. The rise of social medias, such as Twitter, has enabled us to gather ordinary citizen's data easily. However, the incompleteness of the labels and the features in social network datasets is tricky, not to mention the enormous data size and the heterogeneousity. The data differ dramatically from many commonly-used datasets, thus brings unique challenges. In our work, first we built our own datasets from Twitter. Next, we proposed TIMME, a multi-task multi-relational embedding model, that works efficiently on sparsely-labeled heterogeneous real-world dataset. It could also handle the incompleteness of the input features. Experimental results showed that TIMME is overall better than the state-of-the-art models for ideology detection on Twitter. Our findings include: links can lead to good classification outcomes without text; conservative voice is under-represented on Twitter; follow is the most important relation to predict ideology; retweet and mention enhance a higher chance of like, etc. Last but not least, TIMME could be extended to other datasets and tasks in theory.

Knowing your FATE: Friendship, Action and Temporal Explanations for User Engagement Prediction on Social Apps

With the rapid growth and prevalence of social network applications (Apps) in recent years, understanding user engagement has become increasingly important, to provide useful insights for future App design and development. While several promising neural modeling approaches were recently pioneered for accurate user engagement prediction, their black-box designs are unfortunately limited in model explainability. In this paper, we study a novel problem of explainable user engagement prediction for social network Apps. First, we propose a flexible definition of user engagement for various business scenarios, based on future metric expectations. Next, we design an end-to-end neural framework, FATE, which incorporates three key factors that we identify to influence user engagement, namely friendships, user actions, and temporal dynamics to achieve explainable engagement predictions. FATE is based on a tensor-based graph neural network (GNN), LSTM and a mixture attention mechanism, which allows for (a) predictive explanations based on learned weights across different feature categories, (b) reduced network complexity, and (c) improved performance in both prediction accuracy and training/inference time. We conduct extensive experiments on two large-scale datasets from Snapchat, where FATE outperforms state-of-the-art approaches by 10% error and 20% runtime reduction. We also evaluate explanations from FATE, showing strong quantitative and qualitative performance.

Sub-Matrix Factorization for Real-Time Vote Prediction

We address the problem of predicting aggregate vote outcomes (e.g., national) from partial outcomes (e.g., regional) that are revealed sequentially. We combine matrix factorization techniques and generalized linear models (GLMs) to obtain a flexible, efficient, and accurate algorithm. This algorithm works in two stages: First, it learns representations of the regions from high-dimensional historical data. Second, it uses these representations to fit a GLM to the partially observed results and to predict unobserved results. We show experimentally that our algorithm is able to accurately predict the outcomes of Swiss referenda, U.S. presidential elections, and German legislative elections. We also explore the regional representations in terms of ideological and cultural patterns. Finally, we deploy an online Web platform (www.predikon.ch) to provide real-time vote predictions in Switzerland and a data visualization tool to explore voting behavior. A by-product is a dataset of sequential vote results for 330 referenda and 2196 Swiss municipalities.

Temporal-Contextual Recommendation in Real-Time

Personalized real-time recommendation has had a profound impact on retail, media, entertainment and other industries. However, developing recommender systems for every use case is costly, time consuming and resource-intensive. To fill this gap, we present a black-box recommender system that can adapt to a diverse set of scenarios without the need for manual tuning. We build on techniques that go beyond simple matrix factorization to incorporate important new sources of information: the temporal order of events [Hidasi et al., 2015], contextual information to bootstrap cold-start users, metadata information about items [Rendle 2012] and the additional information surrounding each event. Additionally, we address two fundamental challenges when putting recommender systems in the real-world: how to efficiently train them with even millions of unique items and how to cope with changing item popularity trends [Wu et al., 2017]. We introduce a compact model, which we call hierarchical recurrent network with meta data (HRNN-meta) to address the real-time and diverse metadata needs; we further provide efficient training techniques via importance sampling that can scale to millions of items with little loss in performance. We report significant improvements on a wide range of real-world datasets and provide intuition into model capabilities with synthetic experiments. Parts of HRNN-meta have been deployed in production at scale for customers to use at Amazon Web Services and serves as the underlying recommender engine for thousands of websites.

OptMatch: Optimized Matchmaking via Modeling the High-Order Interactions on the Arena

Matchmaking is a core problem for the e-sports and online games, which determines the player satisfaction and further influences the life cycle of the gaming products. Most of matchmaking systems take the form of grouping the queuing players into two opposing teams by following certain rules. The design and implementation of matchmaking systems are usually product-specific and labor-intensive.

This paper proposes a two-stage data-driven matchmaking framework (namely OptMatch), which is applicable to most of gaming products and has the minimal product knowledge required. OptMatch contains an offline learning stage and an online planning stage. The offline learning stage includes (1) relationship mining modules to learn the low-dimensional representations of individuals by capturing the high-order inter-personal interactions, and (2) a neural network to incorporate the team-up effect and predict the match outcomes. The online planning stage optimizes the gross player utilities (i.e., satisfaction) during the matchmaking process, by leveraging the learned representations and predictive model.

Quantitative evaluations on four real-world datasets and an online experiment on Fever Basketball game are conducted to empirically demonstrate the effectiveness of OptMatch.

PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest

Latent user representations are widely adopted in the tech industry for powering personalized recommender systems. Most prior work infers a single high dimensional embedding to represent a user, which is a good starting point but falls short in delivering a full understanding of the user's interests. In this work, we introduce PinnerSage, an end-to-end recommender system that represents each user via multi-modal embeddings and leverages this rich representation of users to provides high quality personalized recommendations. PinnerSage achieves this by clustering users' actions into conceptually coherent clusters with the help of a hierarchical clustering method (Ward) and summarizes the clusters via representative pins (Medoids) for efficiency and interpretability. PinnerSage is deployed in production at Pinterest and we outline the several design decisions that makes it run seamlessly at a very large scale. We conduct several offline and online A/B experiments to show that our method significantly outperforms single embedding methods.

Polestar: An Intelligent, Efficient and National-Wide Public Transportation Routing Engine

Public transportation plays a critical role in people's daily life. It has been proven that public transportation is more environmentally sustainable, efficient, and economical than any other forms of travel. However, due to the increasing expansion of transportation networks and more complex travel situations, people are having difficulties in efficiently finding the most preferred route from one place to another through public transportation systems. To this end, in this paper, we present Polestar, a data-driven engine for intelligent and efficient public transportation routing.Specifically, we first propose a novel Public Transportation Graph (PTG) to model public transportation system in terms of various travel costs, such as time or distance. Then, we introduce a general route search algorithm coupled with an efficient station binding method for efficient route candidate generation. After that, we propose a two-pass route candidate ranking module to capture user preferences under dynamic travel situations. Finally, experiments on two real-world data sets demonstrate the advantages of Polestar in terms of both efficiency and effectivenes Indeed, in early 2019, Polestar has been deployed on Baidu Maps, one of the world's largest map services. To date, Polestar is servicing over 330 cities, answers over a hundred millions of queries each day, and achieves substantial improvement of user click ratio.

Context-Aware Attentive Knowledge Tracing

Knowledge tracing (KT) refers to the problem of predicting future learner performance given their past performance in educational applications. Recent developments in KT using flexible deep neural network-based models excel at this task. However, these models often offer limited interpretability, thus making them insufficient for personalized learning, which requires using interpretable feedback and actionable recommendations to help learners achieve better learning outcomes. In this paper, we propose attentive knowledge tracing (AKT), which couples flexible attention-based neural network models with a series of novel, interpretable model components inspired by cognitive and psychometric models. AKT uses a novel monotonic attention mechanism that relates a learner's future responses to assessment questions to their past responses; attention weights are computed using exponential decay and a context-aware relative distance measure, in addition to the similarity between questions. Moreover, we use the Rasch model to regularize the concept and question embeddings; these embeddings are able to capture individual differences among questions on the same concept without using an excessive number of parameters. We conduct experiments on several real-world benchmark datasets and show that AKT outperforms existing KT methods (by up to $6%$ in AUC in some cases) on predicting future learner responses. We also conduct several case studies and show that AKT exhibits excellent interpretability and thus has potential for automated feedback and personalization in real-world educational settings.

Improving Movement Predictions of Traffic Actors in Bird's-Eye View Models using GANs and Differentiable Trajectory Rasterization

One of the most critical pieces of the self-driving puzzle is the task of predicting future movement of surrounding traffic actors, which allows the autonomous vehicle to safely and effectively plan its future route in a complex world. Recently, a number of algorithms have been proposed to address this important problem, spurred by a growing interest of researchers from both industry and academia. Methods based on top-down scene rasterization on one side and Generative Adversarial Networks (GANs) on the other have shown to be particularly successful, obtaining state-of-the-art accuracies on the task of traffic movement prediction. In this paper we build upon these two directions and propose a raster-based conditional GAN architecture, powered by a novel differentiable rasterizer module at the input of the conditional discriminator that maps generated trajectories into the raster space in a differentiable manner. This simplifies the task for the discriminator as trajectories that are not scene-compliant are easier to discern, and allows the gradients to flow back forcing the generator to output better, more realistic trajectories. We evaluated the proposed method on a large-scale, real-world data set, showing that it outperforms state-of-the-art GAN-based baselines.

M2GRL: A Multi-task Multi-view Graph Representation Learning Framework for Web-scale Recommender Systems

Combining graph representation learning with multi-view data (side information) for recommendation is a trend in industry. Most existing methods can be categorized as multi-view representation fusion; they first build one graph and then integrate multi-view data into a single compact representation for each node in the graph. However, these methods are raising concerns in both engineering and algorithm aspects: 1) multi-view data are abundant and informative in industry and may exceed the capacity of one single vector, and 2) inductive bias may be introduced as multi-view data are often from different distributions. In this paper, we use a multi-view representation alignment approach to address this issue. Particularly, we propose a multi-task multi-view graph representation learning framework (M2GRL) to learn node representations from multi-view graphs for web-scale recommender systems. M2GRL constructs one graph for each single-view data, learns multiple separate representations from multiple graphs, and performs alignment to model cross-view relations. M2GRL chooses a multi-task learning paradigm to learn intra-view representations and cross-view relations jointly. Besides, M2GRL applies homoscedastic uncertainty to adaptively tune the loss weights of tasks during training. We deploy M2GRL at Taobao and train it on 57 billion examples. According to offline metrics and online A/B tests, M2GRL significantly outperforms other state-of-the-art algorithms. Further exploration on diversity recommendation in Taobao shows the effectiveness of utilizing multiple representations produced by M2GRL, which we argue is a promising direction for various industrial recommendation tasks of different focus.

Attribute-based Propensity for Unbiased Learning in Recommender Systems: Algorithm and Case Studies

Many modern recommender systems train their models based on a large amount of implicit user feedback data. Due to the inherent bias in this data (e.g., position bias), learning from it directly can lead to suboptimal models. Recently, unbiased learning was proposed to address such problems by leveraging counterfactual techniques like inverse propensity weighting (IPW). In these methods, propensity scores estimation is usually limited to item's display position in a single user interface (UI).

In this paper, we generalize the traditional position bias model to an attribute-based propensity framework. Our methods estimate propensity scores based on offline data and allow propensity estimation across a broad range of implicit feedback scenarios, e.g., feedback beyond recommender system UI. We demonstrate this by applying this framework to three real-world large-scale recommender systems in Google Drive that serve millions of users. For each system, we conduct both offline and online evaluation. Our results show that the proposed framework is able to significantly improve upon strong production baselines across a diverse range of recommendation item types (documents, people-document pairs, and queries), UI layouts (horizontal, vertical, and grid layouts), and underlying learning algorithms (gradient boosted decision trees and neural networks), all without the need to intervene and degrade the user experience. The proposed models have been deployed in the production systems with ease since no serving infrastructure change is needed.

Predicting Individual Treatment Effects of Large-scale Team Competitions in a Ride-sharing Economy

Millions of drivers worldwide have enjoyed financial benefits and work schedule flexibility through a ride-sharing economy, but meanwhile they have suffered from the lack of a sense of identity and career achievement. Equipped with social identity and contest theories, financially incentivized team competitions have been an effective instrument to increase drivers' productivity, job satisfaction, and retention, and to improve revenue over cost for ride-sharing platforms. While these competitions are overall effective, the decisive factors behind the treatment effects and how they affect the outcomes of individual drivers have been largely mysterious. In this study, we analyze data collected from more than 500 large-scale team competitions organized by a leading ride-sharing platform, building machine learning models to predict individual treatment effects. Through a careful investigation of features and predictors, we are able to reduce out-sample prediction error by more than 24%. Through interpreting the best-performing models, we discover many novel and actionable insights regarding how to optimize the design and the execution of team competitions on ride-sharing platforms. A simulated analysis demonstrates that by simply changing a few contest design options, the average treatment effect of a real competition is expected to increase by as much as 26%. Our procedure and findings shed light on how to analyze and optimize large-scale online field experiments in general.

Cellular Network Radio Propagation Modeling with Deep Convolutional Neural Networks

Radio propagation modeling and prediction is fundamental for modern cellular network planning and optimization. Conventional radio propagation models fall into two categories. Empirical models, based on coarse statistics, are simple and computationally efficient, but are inaccurate due to oversimplification. Deterministic models, such as ray tracing based on physical laws of wave propagation, are more accurate and site specific. But they have higher computational complexity and are inflexible to utilize site information other than traditional global information system (GIS) maps.

In this article we present a novel method to model radio propagation using deep convolutional neural networks and report significantly improved performance compared to conventional models. We also lay down the framework for data-driven modeling of radio propagation and enable future research to utilize rich and unconventional information of the site, e.g. satellite photos, to provide more accurate and flexible models.

Neural Input Search for Large Scale Recommendation Models

Recommendation problems with large numbers of discrete items, such as products, webpages, or videos, are ubiquitous in the technology industry. Deep neural networks are being increasingly used for these recommendation problems. These models use embeddings to represent discrete items as continuous vectors, and the vocabulary sizes and embedding dimensions, despite their heavy influence on the model's accuracy, are often manually selected in a heuristical manner.

We present Neural Input Search (NIS), a technique for learning the optimal vocabulary sizes and embedding dimensions for categorical features. The goal is to maximize prediction accuracy subject to a constraint on the total memory used by all embeddings. Moreover, we argue that the traditional Single-size Embedding (SE), which uses the same embedding dimension for all values of a feature, suffers from inefficient usage of model capacity and training data. We propose a novel type of embedding, namely Multi-size Embedding (ME), which allows the embedding dimension to vary for different values of the feature. During training we use reinforcement learning to find the optimal vocabulary size for each feature and embedding dimension for each value of the feature. Experimentation on two public recommendation datasets shows that NIS can find significantly better models with much fewer embedding parameters. We also deployed NIS in production to a real world large scale App ranking model in our company's App store, Google Play, resulting in +1.02% App Install with 30% smaller model size.

Easy Perturbation EEG Algorithm for Spectral Importance (easyPEASI): A Simple Method to Identify Important Spectral Features of EEG in Deep Learning Models

Efforts into understanding neurological differences between populations is an active area of research. Deep learning has recently shown promising results using EEG as input to distinguish recordings of subjects based on neurological activity. However, only about one quarter of these studies investigate the underlying neurophysiological implications. This work proposes and validates a method to investigate frequency bands important to EEG-driven deep learning models. Easy perturbation EEG algorithm for spectral importance (easyPEASI) is simpler than previous methods and requires only perturbations to input data. We validate easyPEASI on EEG pathology classification using the Temple University Health EEG Corpus. easyPEASI is further applied to characterize the effects of patients' medications on brain rhythms. We investigate classifications of patients taking one of two anticonvulsant medications, Dilantin (phenytoin) and Keppra (levetiracetam), and subjects taking no medications. We find that for recordings of subjects with clinically-determined normal EEG that these medications effect the Theta and Alpha band most significantly. For recordings with clinically-determined abnormal EEG these medications affected the Delta, Theta, and Alpha bands most significantly. We also find the Beta band to be affected differently by the two medications. Results found here show promise for a method of obtaining explainable artificial intelligence and interpretable models from EEG-driven deep learning through a simpler more accessible method perturbing only input data. Overall, this work provides a fast, easy, and reproducible method to automatically determine salient spectral features of neural activity that have been learned by machine learning models, such as deep learning.

Building Continuous Integration Services for Machine Learning

Continuous integration (CI) has been a de facto standard for building industrial-strength software. Yet, there is little attention towards applying CI to the development of machine learning (ML) applications until the very recent effort on the theoretical side. In this paper, we take a step forward to bring the theory into practice.

We develop the first CI system for ML, to the best of our knowledge, that integrates seamlessly with existing ML development tools. We present its design and implementation details.

Learning to Cluster Documents into Workspaces Using Large Scale Activity Logs

Google Drive is widely used for managing personal and work-related documents in the cloud. To help users organize their documents in Google Drive, we develop a new feature to allow users to create a set of working files for ongoing easy access, called workspace. A workspace is a cluster of documents, but unlike a typical document cluster, it contains documents that are not only topically coherent, but are also useful in the ongoing user tasks.

To alleviate the burden of creating workspaces manually, we automatically cluster documents into suggested workspaces. We go beyond the textual similarity-based unsupervised clustering paradigm and instead directly learn from users' activity for document clustering. More specifically, we extract co-access signals (i.e., whether a user accessed two documents around the same time) to measure document relatedness. We then use a neural document similarity model that incorporates text, metadata, as well as co-access features. Since human labels are often difficult or expensive to collect, we extract weak labels based on co-access data at large scale for model training. Our offline and online experiments based on Google Drive show that (a) co-access features are very effective for document clustering; (b) our weakly supervised clustering achieves comparable or even better performance compared to the models trained with human labels; and (c) the weakly supervised method leads to better workspace suggestions that the users accept more often in the production system than baseline approaches.

What is that Building?: An End-to-end System for Building Recognition from Streetside Images

The paper describes Streetside Building Search-Retrieve System (SBSRS) - a system for recognizing buildings from steetside images. SBSRS powers several distinct applications: 1) it improves map-search by enriching its streetview service with semantic information, such as location, business name, open hours, etc.; 2) it enables search by image and location - a novel form of visual image search where both visual and location signals are used to identify the most relevant result to a query image of a building.

SBSRS works in an entirely unsupervised way. It has an offline component, which generates an index of building objects by scraping streetview images, segmenting and conflating them with business information. An online component can then search over the index, utilizing location information to retrieve a small set of geo-relevant results, which are then re-ranked using novel highly accurate visual descriptors. To evaluate the system, we generate a dataset of over 23K unique business buildings from four major US cities. This significantly exceeds the number of landmarks in datasets previously used by similar systems. We compare our system with a state-of-the-art baseline and show its accuracy on our new buildings dataset as well as on two popular landmark datasets.

MultiSage: Empowering GCN with Contextualized Multi-Embeddings on Web-Scale Multipartite Networks

Graph convolutional networks (GCNs) are a powerful class of graph neural networks. Trained in a semi-supervised end-to-end fashion, GCNs can learn to integrate node features and graph structures to generate high-quality embeddings that can be used for various downstream tasks like search and recommendation. However, existing GCNs mostly work on homogeneous graphs and consider a single embedding for each node, which do not sufficiently model the multi-facet nature and complex interaction of nodes in real-world networks. Here, we present a contextualized GCN engine by modeling the multipartite networks of target nodes and their intermediatecontext nodes that specify the contexts of their interactions. Towards the neighborhood aggregation process, we devise a contextual masking operation at the feature level and a contextual attention mechanism at the node level to achieve interaction contextualization by treating neighboring target nodes based on intermediate context nodes. Consequently, we compute multiple embeddings for target nodes that capture their diverse facets and different interactions during graph convolution, which is useful for fine-grained downstream applications. To enable efficient web-scale training, we build a parallel random walk engine to pre-sample contextualized neighbors, and a Hadoop2-based data provider pipeline to pre-join training data, dynamically reduce multi-GPU training time, and avoid high memory cost. Extensive experiments on the bipartite Pinterest graph and tripartite OAG graph corroborate the advantage of the proposed system.

HetETA: Heterogeneous Information Network Embedding for Estimating Time of Arrival

The estimated time of arrival (ETA) is a critical task in the intelligent transportation system, which involves the spatiotemporal data. Despite a significant amount of prior efforts have been made to design efficient and accurate systems for ETA task, few of them take structural graph data into account, much less the heterogeneous information network. In this paper, we propose HetETA to leverage heterogeneous information graph in ETA task. Specifically, we translate the road map into a multi-relational network and introduce a vehicle-trajectories based network to jointly consider the traffic behavior pattern. Moreover, we employ three components to model temporal information from recent periods, daily periods and weekly periods respectively. Each component comprises temporal convolutions and graph convolutions to learn representations of the spatiotemporal heterogeneous information for ETA task. Experiments on large-scale datasets illustrate the effectiveness of the proposed HetETA beyond the state-of-the-art methods, and show the importance of representation learning of heterogeneous information networks for ETA task.

Hubble: An Industrial System for Audience Expansion in Mobile Marketing

Recently, in order to take a preemptive opportunity in the mobile economy, the Internet companies conduct thousands of marketing campaigns every day, to promote their mobile products and services. In the mobile marketing scenario, one of the fundamental issues is the audience expansion task for marketing campaigns. Given a set of seed users, audience expansion aims to seek more users (audiences), who are similar to the seeds and will finish the business goal of the targeted campaign (ie convert). However, the problem is challenging in three aspects. First, a company will run hundreds of campaigns to serve massive users every day. The requirements of scalability and timeliness make training model for each campaign extremely resource-consuming thus impractical. Therefore, we proposed to solve the problem in a two-stage manner, in which the offline stage employs heavyweight user representation learning and the online stage performs embedding-based lightweight audience expansion. Second, conventional two-stage audience expansion systems neglect the high-order user-campaign interactions and usually generate entangled user embeddings, thus fail to achieve high-quality user representation. Third, the seeds, which are usually provided by experts or collected from users' feedbacks, could be noisy and cannot cover the entire actual audiences, thus introduce coverage bias. Unfortunately, to our best knowledge, none of the related literatures tackle this crucial issue of audience expansion.

Addressing the above challenges, in this paper, we present the Hubble System, an industrial solution for audience expansion in mobile marketing scenario. Hubble system follows the hybrid online-offline architecture to satisfy the requirements of scalability and timeliness. Specifically, in the offline stage, we propose a novel adaptive and disentangled graph neural network (called AD-GNN), to adaptively explore the user-campaign graph and generate comprehensive user embedding in a disentangled manner. In the online stage, tackling the coverage bias issue, we develop a novel audience expansion model with knowledge distillation mechanism (called KD-AE), to absorb knowledge from the offline AD-GNN and alleviate the coverage bias.Finally, extensive offline experiments and online A/B testing demonstrate the superior performance of the proposed Hubble system, compared with other state-of-the-art methods.

Scaling Graph Neural Networks with Approximate PageRank

Graph neural networks (GNNs) have emerged as a powerful approach for solving many network mining tasks. However, learning on large graphs remains a challenge -- many recently proposed scalable GNN approaches rely on an expensive message-passing procedure to propagate information through the graph. We present the PPRGo model which utilizes an efficient approximation of information diffusion in GNNs resulting in significant speed gains while maintaining state-of-the-art prediction performance. In addition to being faster, PPRGo is inherently scalable, and can be trivially parallelized for large datasets like those found in industry settings.

We demonstrate that PPRGo outperforms baselines in both distributed and single-machine training environments on a number of commonly used academic graphs. To better analyze the scalability of large-scale graph learning methods, we introduce a novel benchmark graph with 12.4 million nodes, 173 million edges, and 2.8 million node features. We show that training PPRGo from scratch and predicting labels for all nodes in this graph takes under 2 minutes on a single machine, far outpacing other baselines on the same graph. We discuss the practical application of PPRGo to solve large-scale node classification problems at Google.

Combo-Attention Network for Baidu Video Advertising

With the progress of communication technology and the popularity of the smart phone, videos grow to be the largest medium. Since videos can grab a customer's attention quickly and leave a big impression, video ads can gain more trust than traditional ads. Thus advertisers start to pour more resources into making creative video ads to built the connections with potential customers. Baidu, as the leading search engine company in China, receives billions of search queries per day. In this paper, we introduce a technique used in Baidu video advertising for feeding relevant video ads according to the user's query. Note that, retrieving relevant videos using the text query is a cross-modal problem. Due to the modal gap, the text-to-video search is more challenging than well exploited text-to-text search and image-to-image search. To tackle this challenge, we propose a Combo-Attention Network (CAN) and launch it in Baidu video advertising. In the proposed CAN model, we represent a video as a set of bounding boxes features and represent a sentence as a set of words features, and formulate the sentence-to-video search as a set-to-set matching problem. The proposed CAN is built upon the proposed combo-attention module, which exploits cross-modal attentions besides self attentions to effectively capture the relevance between words and bounding boxes. To testify the effectiveness of the proposed CAN offline, we built a Daily700K dataset collected from HaoKan APP. The systematic experiments on Daily700K as well as a public dataset, VATEX, demonstrate the effectiveness of our CAN. After launching the proposed CAN in Baidu's dynamic video advertising (DVA), we achieve a $5.47%$ increase in Conversion Rate (CVR) and a $11.69%$ increase in advertisement impression rate.

Federated Doubly Stochastic Kernel Learning for Vertically Partitioned Data

In a lot of real-world data mining and machine learning applications, data are provided by multiple providers and each maintains private records of different feature sets about common entities. It is challenging to train these vertically partitioned data effectively and efficiently while keeping data privacy for traditional data mining and machine learning algorithms. In this paper, we focus on nonlinear learning with kernels,and propose a federated doubly stochastic kernel learning (FDSKL) algorithm for vertically partitioned data. Specifically, we use random features to approximate the kernel mapping function and use doubly stochastic gradients to update the solutions, which are all computed federatedly without the disclosure of data. Importantly, we prove that FDSKL has a sublinear convergence rate, and can guarantee the data security under the semi-honest assumption. Extensive experimental results on a variety of benchmark datasets show that FDSKL is significantly faster than state-of-the-art federated learning methods when dealing with kernels, while retaining the similar generalization performance.

To Tune or Not to Tune?: In Search of Optimal Configurations for Data Analytics

This experimental study presents a number of issues that pose a challenge for practical configuration tuning and its deployment in data analytics frameworks. These issues include: 1) the assumption of a static workload or environment, ignoring the dynamic characteristics of the analytics environment (e.g., increase in input data size, changes in allocation of resources). 2) the amortization of tuning costs and how this influences what workloads can be tuned in practice in a cost-effective manner. 3) the need for a comprehensive incremental tuning solution for a diverse set of workloads. We adapt different ML techniques in order to obtain efficient incremental tuning in our problem domain, and propose Tuneful, a configuration tuning framework. We show how it is designed to overcome the above issues and illustrate its applicability by running a wide array of experiments in cloud environments provided by two different service providers.

Reconstruction and Decomposition of High-Dimensional Landscapes via Unsupervised Learning

Uncovering the organization of a landscape that encapsulates all states of a dynamic system is a central task in many domains, as it promises to reveal, in an unsupervised manner, a system's inner working. One domain where this task is crucial is in bioinformatics, where the energy landscape that organizes three-dimensional structures of a molecule by their energetics is a powerful construct. The landscape can be leveraged, among other things, to reveal macrostates where a molecule is biologically-active. This is a daunting task, as landscapes of complex actuated systems, such as molecules, are inherently high-dimensional. Nonetheless, our laboratories have made some progress via topological and statistical analysis of spatial data over the recent years. We have proposed what is essentially a dichotomy, methods that are more pertinent for visualization-driven discovery, and methods that are more pertinent for discovery of the biologically-active macrostates but not amenable to visualization. In this paper, we present a novel, hybrid method that combines strengths of these methods, allowing both visualization of the landscape and discovery of macrostates. We demonstrate what the method is capable of uncovering in comparison with existing methods over structure spaces sampled with conformational sampling algorithms. Though the direct evaluation in this paper is on protein energy landscapes, the proposed method is of broad interest in cross-cutting problems that necessitate characterization of fitness and optimization landscapes.

Map Generation from Large Scale Incomplete and Inaccurate Data Labels

Accurately and globally mapping human infrastructure is an important and challenging task with applications in routing, regulation compliance monitoring, and natural disaster response management etc.. In this paper we present progress in developing an algorithmic pipeline and distributed compute system that automates the process of map creation using high resolution aerial images. Unlike previous studies, most of which use datasets that are available only in a few cities across the world, we utilizes publicly available imagery and map data, both of which cover the contiguous United States (CONUS). We approach the technical challenge of inaccurate and incomplete training data adopting state-of-the-art convolutional neural network architectures such as the U-Net and the CycleGAN to incrementally generate maps with increasingly more accurate and more complete labels of man-made infrastructure such as roads and houses. Since scaling the mapping task to CONUS calls for parallelization, we then adopted an asynchronous distributed stochastic parallel gradient descent training scheme to distribute the computational workload onto a cluster of GPUs with nearly linear speed-up.

Grale: Designing Networks for Graph Learning

How can we find the right graph for semi-supervised learning? In real world applications, the choice of which edges to use for computation is the first step in any graph learning process. Interestingly, there are often many types of similarity available to choose as the edges between nodes, and the choice of edges can drastically affect the performance of downstream semi-supervised learning systems. However, despite the importance of graph design, most of the literature assumes that the graph is static.

In this work, we present Grale, a scalable method we have developed to address the problem of graph design for graphs with billions of nodes. Grale operates by fusing together different measures of (potentially weak) similarity to create a graph which exhibits high task-specific homophily between its nodes. Grale is designed for running on large datasets. We have deployed Grale in more than 20 different industrial settings at Google, including datasets which have tens of billions of nodes, and hundreds of trillions of potential edges to score. By employing locality sensitive hashing techniques, we greatly reduce the number of pairs that need to be scored, allowing us to learn a task specific model and build the associated nearest neighbor graph for such datasets in hours, rather than the days or even weeks that might be required otherwise.

We illustrate this through a case study where we examine the application of Grale to an abuse classification problem on YouTube with hundreds of million of items. In this application, we find that Grale detects a large number of malicious actors on top of hard-coded rules and content classifiers, increasing the total recall by 89% over those approaches alone.

Automatic Validation of Textual Attribute Values in E-commerce Catalog by Learning with Limited Labeled Data

Product catalogs are valuable resources for eCommerce website. In the catalog, a product is associated with multiple attributes whose values are short texts, such as product name, brand, functionality and flavor. Usually individual retailers self-report these key values, and thus the catalog information unavoidably contains noisy facts. It is very important to validate the correctness of these values in order to improve shopper experiences and enable more effective product recommendation. Due to the huge volume of products, an effective automatic validation approach is needed. In this paper, we propose to develop an automatic validation approach that verifies the correctness of textual attribute values for products. This can be formulated as a task as cross-checking a textual attribute value against product profile, which is a short textual description of the product on eCommerce website. Although existing deep neural network models have shown success in conducting cross-checking between two pieces of texts, their success has to be dependent upon a large set of quality labeled data, which are hard to obtain in this validation task: products span a variety of categories. Due to the category difference, annotation has to be done on all the categories, which is impossible to achieve in real practice.

To address the aforementioned challenges, we propose a novel meta-learning latent variable approach, called MetaBridge, which can learn transferable knowledge from a subset of categories with limited labeled data and capture the uncertainty of never-seen categories with unlabeled data. More specifically, we make the following contributions. (1) We formalize the problem of validating the textual attribute values of products from a variety of categories as a natural language inference task in the few-shot learning setting, and propose a meta-learning latent variable model to jointly process the signals obtained from product profiles and textual attribute values. (2) We propose to integrate meta learning and latent variable in a unified model to effectively capture the uncertainty of various categories. With this model, annotation costs can be significantly reduced as we make best use of labeled data from limited categories. (3) We propose a novel objective function based on latent variable model in the few-shot learning setting, which ensures distribution consistency between unlabeled and labeled data and prevents overfitting by sampling different records from the learned distribution. Extensive experiments on real eCommerce datasets from hundreds of categories demonstrate the effectiveness of MetaBridge on textual attribute validation and its outstanding performance compared with state-of-the-art approaches.

CLARA: Confidence of Labels and Raters

Large online services employ thousands of people to label content for applications such as video understanding, natural language processing, and content policy enforcement. While labelers typically reach their decisions by following a well-defined "protocol'', humans may still make mistakes. A common countermeasure is to have multiple people review the same content; however, this process is often time-intensive and requires accurate aggregation of potentially noisy decisions.

In this paper, we present CLARA (Confidence of Labels and Raters), a system developed and deployed at Facebook for aggregating reviewer decisions and estimating their uncertainty. We perform extensive validations and describe the deployment of CLARA for measuring the base rate of policy violations, quantifying reviewers' performance, and improving their efficiency. In our experiments, we found that CLARA (a) provides an unbiased estimator of violation rates that is robust to changes in reviewer quality, with accurate confidence intervals, (b) provides an accurate assessment of reviewers' performance, and (c) improves efficiency by reducing the number of reviews based on the review certainty, and enables the operational selection of a threshold on the cost/accuracy efficiency frontier.

Embedding-based Retrieval in Facebook Search

Search in social networks such as Facebook poses different challenges than in classical web search: besides the query text, it is important to take into account the searcher's context to provide relevant results. Their social graph is an integral part of this context and is a unique aspect of Facebook search. While embedding-based retrieval (EBR) has been applied in web search engines for years, Facebook search was still mainly based on a Boolean matching model. In this paper, we discuss the techniques for applying EBR to a Facebook Search system. We introduce the unified embedding framework developed to model semantic embeddings for personalized search, and the system to serve embedding-based retrieval in a typical search system based on an inverted index. We discuss various tricks and experiences on end-to-end optimization of the whole system, including ANN parameter tuning and full-stack optimization. Finally, we present our progress on two selected advanced topics about modeling. We evaluated EBR on verticals for Facebook Search with significant metrics gains observed in online A/B experiments. We believe this paper will provide useful insights and experiences to help people on developing embedding-based retrieval systems in search engines.

Lumos: A Library for Diagnosing Metric Regressions in Web-Scale Applications

Web-scale applications can ship code on a daily to weekly cadence. These applications rely on online metrics to monitor the health of new releases. Regressions in metric values need to be detected and diagnosed as early as possible to reduce the disruption to users and product owners. Regressions in metrics can surface due to a variety of reasons: genuine product regressions, changes in user population and bias due to telemetry loss (or processing) are among the common causes. Diagnosing the cause of these metric regressions is costly for engineering teams as they need to invest time in finding the root cause of the issue as soon as possible. We presentLumos, a Python library built using the principles of A/B testing to systematically diagnose metric regressions to automate such analysis.Lumos has been deployed across the component teams in Microsoft's Real-Time Communication (RTC) applications Skype and Microsoft Teams. It has enabled engineering teams to detect 100s of real changes in metrics and reject 1000s of false alarms detected by anomaly detectors. The application ofLumos has resulted in freeing up as much as $95%$ of the time allocated to metric-based investigations. In this work, we open sourceLumos and present our results from applying it to two different components within the RTC group over millions of sessions. This general library can be coupled with any production system to manage the volume of alerting efficiently.

Order Fulfillment Cycle Time Estimation for On-Demand Food Delivery

By providing customers with conveniences such as easy access to an extensive variety of restaurants, effortless food ordering and fast delivery, on-demand food delivery (OFD) platforms have achieved explosive growth in recent years. A crucial machine learning task performed at OFD platforms is prediction of the Order Fulfillment Cycle Time (OFCT), which refers to the amount of time elapsed between a customer places an order and he/she receives the meal. The accuracy of predicted OFCT is important for customer satisfaction, as it needs to be communicated to a customer before he/she places the order, and is considered as a service promise that should be fulfilled as well as possible. As a result, the estimated OFCT also heavily influences planning decisions such as dispatching and routing.

In this paper, we present the OFCT prediction model that is currently deployed at Ele.me, which is one of the world's largest OFD platforms and delivers over 10 million meals in more than 200 Chinese cities every day. By dissecting the order fulfillment cycle of a meal order, we identify key factors behind OFCT, and capture them with numerous features constructed using a wide range of data sources. These features are fed into a deep neural network (DNN), which further incorporates representations of couriers, restaurants and delivery destinations to enhance prediction efficacy. Finally, a novel post-processing layer is introduced to improve convergence speed by better accounting for the distributional mismatch between the true OFCT values and those predicted by the model at initialization. Extensive offline and online experiments demonstrate the effectiveness of our approach.

Calendar Graph Neural Networks for Modeling Time Structures in Spatiotemporal User Behaviors

User behavior modeling is important for industrial applications such as demographic attribute prediction, content recommendation, and target advertising. Existing methods represent behavior log as a sequence of adopted items and find sequential patterns; however, concrete location and time information in the behavior log, reflecting dynamic and periodic patterns, joint with the spatial dimension, can be useful for modeling users and predicting their characteristics. In this work, we propose a novel model based on graph neural networks for learning user representations from spatiotemporal behavior data. Our model's architecture incorporates two networked structures. One is a tripartite network of items, sessions, and locations. The other is a hierarchical calendar network of hour, week, and weekday nodes. It first aggregates embeddings of location and items into session embeddings via the tripartite network, and then generates user embeddings from the session embeddings via the calendar structure. The user embeddings preserve spatial patterns and temporal patterns of a variety of periodicity (e.g., hourly, weekly, and weekday patterns). It adopts the attention mechanism to model complex interactions among the multiple patterns in user behaviors. Experiments on real datasets (i.e., clicks on news articles in a mobile app) show our approach outperforms strong baselines for predicting missing demographic attributes.

Privileged Features Distillation at Taobao Recommendations

Features play an important role in the prediction tasks of e-commerce recommendations. To guarantee the consistency of off-line training and on-line serving, we usually utilize the same features that are both available. However, the consistency in turn neglects some discriminative features. For example, when estimating the conversion rate (CVR), i.e., the probability that a user would purchase the item if she clicked it, features like dwell time on the item detailed page are informative. However, CVR prediction should be conducted for on-line ranking before the click happens. Thus we cannot get such post-event features during serving.

We define the features that are discriminative but only available during training as the privileged features. Inspired by the distillation techniques which bridge the gap between training and inference, in this work, we propose privileged features distillation (PFD). We train two models, i.e., a student model that is the same as the original one and a teacher model that additionally utilizes the privileged features. Knowledge distilled from the more accurate teacher is transferred to the student, which helps to improve its prediction accuracy. During serving, only the student part is extracted and it relies on no privileged features. We conduct experiments on two fundamental prediction tasks at Taobao recommendations, i.e., click-through rate (CTR) at coarse-grained ranking and CVR at fine-grained ranking. By distilling the interacted features that are prohibited during serving for CTR and the post-event features for CVR, we achieve significant improvements over their strong baselines. During the on-line A/B tests, the click metric is improved by +5.0% in the CTR task. And the conversion metric is improved by +2.3% in the CVR task. Besides, by addressing several issues of training PFD, we obtain comparable training speed as the baselines without any distillation.

Cracking Tabular Presentation Diversity for Automatic Cross-Checking over Numerical Facts

Tabular forms of numerical facts widely exist in the disclosure documents of vertical domains, especially the financial fields. It is also quite common that the same fact might be mentioned multiple times in different tables with diverse tabular presentation. Firm's disclosure documents are the main source of accounting information for individual investors. Its authenticity is crucial for both firms' development and investors' investment decisions. However, due to large volumes of tables, frequent updates during editing, and limited time for manual cross-checking, these facts might be inconsistent with each other even after official publishing. Such errors may bring about huge reputational risk, and even economic losses even if the mistakes are made unintentionally instead of deliberately. Hence, it creates an opportunity for Automatic Numerical Cross-Checking over Tables. This paper introduces the key module of such a system, which aims to identify whether a pair of table cells are semantically equivalent, namely referring to the same fact. We observed that due to tabular presentation diversity the facts in tabular forms are difficult to be parsed into relational tuples. Thus, we present an end-to-end solution of binary classification over each pair of table cells, which does not involve with explicit semantic parsing over tables. Also, we discuss the design of this neural model to compromise between prediction accuracy and inference time for a large number of table cell pairs, and propose some practical techniques to address the issue of extreme classification imbalance among pairs. Experiments show that our model achieves macro F1 = 0.8297 in linking semantically equivalent table cells from the IPO prospectus. Finally, an auditing tool is built to support guided cross-checking over financial documents, reducing work hours by 52% ~ 68%. This system has received wide recognition in the Chinese financial community. Nine of the top ten Chinese security brokers have adopted this system to support their business of investment banking.

GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce

In this paper, we present GrokNet, a deployed image recognition system for commerce applications. GrokNet leverages a multi-task learning approach to train a single computer vision trunk. We achieve a 2.1x improvement in exact product match accuracy when compared to the previous state-of-the-art Facebook product recognition system. We achieve this by training on 7 datasets across several commerce verticals, using 80 categorical loss functions and 3 embedding losses. We share our experience of combining diverse sources with wide-ranging label semantics and image statistics, including learning from human annotations, user-generated tags, and noisy search engine interaction data. GrokNet has demonstrated gains in production applications and operates at Facebook scale.

Learning Instrument Invariant Characteristics for Generating High-resolution Global Coral Reef Maps

Coral reefs are one of the most biologically complex and diverse ecosystems within the shallow marine environment. Unfortunately, these underwater ecosystems are threatened by a number of anthropogenic challenges, including ocean acidification and warming, overfishing, and the continued increase of marine debris in oceans. This requires a comprehensive assessment of the world's coastal environments, including a quantitative analysis on the health and extent of coral reefs and other associated marine species, as a vital Earth Science measurement. However, limitations in observational and technological capabilities inhibit global sustained imaging of the marine environment. Harmonizing multimodal data sets acquired using different remote sensing instruments presents additional challenges, thereby limiting the availability of good quality labeled data for analysis. In this work, we develop a deep learning model for extracting domain invariant features from multimodal remote sensing imagery and creating high-resolution global maps of coral reefs by combining various sources of imagery and limited hand-labeled data available for certain regions. This framework allows us to generate, for the first time, coral reef segmentation maps at 2-meter resolution, which is a significant improvement over the kilometer-scale state-of-the-art maps. Additionally, this framework doubles accuracy and IoU metrics over baselines that do not account for domain invariance.

Causal Meta-Mediation Analysis: Inferring Dose-Response Function From Summary Statistics of Many Randomized Experiments

It is common in the internet industry to use offline-developed algorithms to power online products that contribute to the success of a business. Offline-developed algorithms are guided by offline evaluation metrics, which are often different from online business key performance indicators (KPIs). To maximize business KPIs, it is important to pick a north star among all available offline evaluation metrics. By noting that online products can be measured by online evaluation metrics, the online counterparts of offline evaluation metrics, we decompose the problem into two parts. As the offline A/B test literature works out the first part: counterfactual estimators of offline evaluation metrics that move the same way as their online counterparts, we focus on the second part: causal effects of online evaluation metrics on business KPIs. The north star of offline evaluation metrics should be the one whose online counterpart causes the most significant lift in the business KPI. We model the online evaluation metric as a mediator and formalize its causality with the business KPI as dose-response function (DRF). Our novel approach, causal meta-mediation analysis, leverages summary statistics of many existing randomized experiments to identify, estimate, and test the mediator DRF. It is easy to implement and to scale up, and has many advantages over the literature of mediation analysis and meta-analysis. We demonstrate its effectiveness by simulation and implementation on real data.

AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction

Learning feature interactions is crucial for click-through rate (CTR) prediction in recommender systems. In most existing deep learning models, feature interactions are either manually designed or simply enumerated. However, enumerating all feature interactions brings large memory and computation cost. Even worse, useless interactions may introduce noise and complicate the training process. In this work, we propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS). AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence. In the search stage, instead of searching over a discrete set of candidate feature interactions, we relax the choices to be continuous by introducing the architecture parameters. By implementing a regularized optimizer over the architecture parameters, the model can automatically identify and remove the redundant feature interactions during the training process of the model. In the re-train stage, we keep the architecture parameters serving as an attention unit to further boost the performance. Offline experiments on three large-scale datasets (two public benchmarks, one private) demonstrate that AutoFIS can significantly improve various FM based models. AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service, where a 10-day online A/B test demonstrated that AutoFIS improved the DeepFM model by 20.3% and 20.1% in terms of CTR and CVR respectively.

City Metro Network Expansion with Reinforcement Learning

City metro network expansion, included in the transportation network design, aims to design new lines based on the existing metro network. Existing methods in the field of transportation network design either (i) can hardly formulate this problem efficiently, (ii) depend on expert guidance to produce solutions, or (iii) appeal to problem-specific heuristics which are difficult to design. To address these limitations, we propose a reinforcement learning based method for the city metro network expansion problem. In this method, we formulate the metro line expansion as a Markov decision process (MDP), which characterizes the problem as a process of sequential station selection. Then, we train an actor-critic model to design the next metro line on the basis of the existing metro network. The actor is an encoder-decoder network with an attention mechanism to generate the parameterized policy which is used to select the stations. The critic estimates the expected cumulative reward to assist the training of the actor by reducing training variance. The proposed method does not require expert guidance during design, since the learning procedure only relies on the reward calculation to tune the policy for better station selection. Also, it avoids the difficulty of heuristics designing by the policy formalizing the station selection. Considering origin-destination (OD) trips and social equity, we expand the current metro network in Xi'an, China, based on the real mobility information of 24,770,715 mobile phone users in the whole city. The results demonstrate the advantages of our method compared with existing approaches.

Game Action Modeling for Fine Grained Analyses of Player Behavior in Multi-player Card Games (Rummy as Case Study)

We present a deep learning framework for game action modeling, which enables fine-grained analyses of player behavior. We develop CNN-based supervised models that effectively learn the critical game play decisions from skilled players, and use these models to assess player characteristics in the system, such as their retention, engagement, deposit buckets, etc. We show that with a carefully constructed input format, that efficiently represents the game state and history as a multi-dimensional image, along with a custom architecture the model learns the strategies of the game accurately. It is further enhanced with look-ahead achieved by self-play simulation to better estimate the game state, and this information is used in a new loss function. Next, we show that analyzing the players with these models as reference has immense benefit in understanding player potential in terms of engagement and revenue. We also use the model to understand the various contexts under which players tend to make mistakes, and use these insights to up-skill players.

Cascade-LSTM: A Tree-Structured Neural Classifier for Detecting Misinformation Cascades

Misinformation in social media - such as fake news, rumors, or other forms of deceptive content - poses a significant threat to society and, hence, scalable strategies for an early detection of online cascades with misinformation are in dire need. The prominent approach in detecting online cascades with misinformation builds upon neural networks based on sequences of simple structural features of the propagation dynamics (e.g., cascade size, average retweeting time). However, these structural features neglect large parts of the information in the cascade. As a remedy, we propose a novel tree-structured neural network named Cascade-LSTM.

Our Cascade-LSTM draws upon a tree-structured long short-term memory network that is carefully engineered to the structure of online information cascades. Specifically, we suggest a novel bi-directional encoding similar to the information flow, extend inner nodes with further covariates from retweets, and fuse the network with global information from the root. As a result, our Cascade-LSTM overcomes inherent limitations from feature engineering, since it learns propagation features along the complete cascade. The effectiveness of our Cascade-LSTM is demonstrated based on a classification task to predict the veracity of 2,156 Twitter cascades. We improve the detection if misinformation in terms of AUC over the status quo with cascade features by 2.8%.

Altogether, our Cascade-LSTM entails important implications: (1) it presents the first neural classifier that learns the complete cascade. (2) It demonstrates a promising approach to practitioners for detecting misinformation through mining retweet behavior. (3) The model is fairly general, which ensures widespread applicability for inferences from online cascades.

Personalized Prefix Embedding for POI Auto-Completion in the Search Engine of Baidu Maps

Point of interest auto-completion (POI-AC) is a featured function in the search engine of many Web mapping services. This function keeps suggesting a dynamic list of POIs as a user types each character, and it can dramatically save the effort of typing, which is quite useful on mobile devices. Existing approaches on POI-AC for industrial use mainly adopt various learning to rank (LTR) models with handcrafted features and even historically clicked POIs are taken into account for personalization. However, these prior arts tend to reach performance bottlenecks as both heuristic features and search history of users cannot directly model personal input habits. In this paper, we present an end-to-end neural-based framework for POI-AC, which has been recently deployed in the search engine of Baidu Maps, one of the largest Web mapping applications with hundreds of millions monthly active users worldwide. In order to establish connections among users, their personal input habits, and correspondingly interested POIs, the proposed framework (abbr. P3AC) is composed of three components, i.e., a multi-layer Bi-LSTM network to adapt to personalized prefixes, a CNN-based network to model multi-sourced information on POIs, and a triplet ranking loss function to optimize both personalized prefix embeddings and distributed representations of POIs. We first use large-scale real-world search logs of Baidu Maps to assess the performance of P3AC offline measured by multiple metrics, including Mean Reciprocal Rank (MRR), Success Rate (SR), and normalized Discounted Cumulative Gain (nDCG). Extensive experimental results demonstrate that it can achieve substantial improvements. Then we decide to launch it online and observe that some other critical indicators on user satisfaction, such as the average number of keystrokes and the average typing speed at keystrokes in a POI-AC session, which significantly decrease as well. In addition, we have released both the source codes of P3AC and the experimental data to the public for reproducibility tests.

Category-Specific CNN for Visual-aware CTR Prediction at JD.com

As one of the largest B2C e-commerce platforms in China, JD.com also powers a leading advertising system, serving millions of advertisers with fingertip connection to hundreds of millions of customers. In our system, as well as most e-commerce scenarios, ads are displayed with images. This makes visual-aware Click Through Rate (CTR) prediction of crucial importance to both business effectiveness and user experience. Existing algorithms usually extract visual features using off-the-shelf Convolutional Neural Networks (CNNs) and late fuse the visual and non-visual features for the finally predicted CTR. Despite being extensively studied, this field still face two key challenges. First, although encouraging progress has been made in offline studies, applying CNNs in real systems remains non-trivial, due to the strict requirements for efficient end-to-end training and low-latency online serving. Second, the off-the-shelf CNNs and late fusion architectures are suboptimal. Specifically, off-the-shelf CNNs were designed for classification thus never take categories as input features. While in e-commerce, categories are precisely labeled and contain abundant visual priors that will help the visual modeling. Unaware of the ad category, these CNNs may extract some unnecessary category-unrelated features, wasting CNN's limited expression ability. To overcome the two challenges, we propose Category-specific CNN (CSCNN) specially for CTR prediction. CSCNN early incorporates the category knowledge with a light-weighted attention-module on each convolutional layer. This enables CSCNN to extract expressive category-specific visual patterns that benefit the CTR prediction. Offline experiments on benchmark and a 10 billion scale real production dataset from JD, together with an Online A/B test show that CSCNN outperforms all compared state-of-the-art algorithms. We also build a highly efficient infrastructure to accomplish end-to-end training with CNN on the 10 billion scale real production dataset within 24 hours, and meet the low latency requirements of online system (20ms on CPU). CSCNN is now deployed in the search advertising system of JD, serving the main traffic of hundreds of millions of active users.

ConSTGAT: Contextual Spatial-Temporal Graph Attention Network for Travel Time Estimation at Baidu Maps

The task of travel time estimation (TTE), which estimates the travel time for a given route and departure time, plays an important role in intelligent transportation systems such as navigation, route planning, and ride-hailing services. This task is challenging because of many essential aspects, such as traffic prediction and contextual information. First, the accuracy of traffic prediction is strongly correlated with the traffic speed of the road segments in a route. Existing work mainly adopts spatial-temporal graph neural networks to improve the accuracy of traffic prediction, where spatial and temporal information is used separately. However, one drawback is that the spatial and temporal correlations are not fully exploited to obtain better accuracy. Second, contextual information of a route, i.e., the connections of adjacent road segments in the route, is an essential factor that impacts the driving speed. Previous work mainly uses sequential encoding models to address this issue. However, it is difficult to scale up sequential models to large-scale real-world services. In this paper, we propose an end-to-end neural framework named ConSTGAT, which integrates traffic prediction and contextual information to address these two problems. Specifically, we first propose a spatial-temporal graph neural network that adopts a novel graph attention mechanism, which is designed to fully exploit the joint relations of spatial and temporal information. Then, in order to efficiently take advantage of the contextual information, we design a computationally efficient model that applies convolutions over local windows to capture a route's contextual information and further employs multi-task learning to improve the performance. In this way, the travel time of each road segment can be computed in parallel and in advance. Extensive experiments conducted on large-scale real-world datasets demonstrate the superiority of ConSTGAT. In addition, ConSTGAT has already been deployed in production at Baidu Maps, and it successfully keeps serving tens of billions of requests every day. This confirms that ConSTGAT is a practical and robust solution for large-scale real-world TTE services.

Faster Secure Data Mining via Distributed Homomorphic Encryption

Due to the rising privacy demand in data mining, Homomorphic Encryption (HE) is receiving more and more attention recently for its capability to do computations over the encrypted field. By using the HE technique, it is possible to securely outsource model learning to the not fully trustful but powerful public cloud computing environments. However, HE-based training scales badly because of the high computation complexity. It is still an open problem whether it is possible to apply HE to large-scale problems. In this paper, we propose a novel general distributed HE-based data mining framework towards one step of solving the scaling problem. The main idea of our approach is to use the slightly more communication overhead in exchange of shallower computational circuit in HE, so as to reduce the overall complexity. We verify the efficiency and effectiveness of our new framework by testing over various data mining algorithms and benchmark data-sets. For example, we successfully train a logistic regression model to recognize the digit 3 and 8 within around 5 minutes, while a centralized counterpart needs almost 2 hours.

Contagious Chain Risk Rating for Networked-guarantee Loans

The small and medium-sized enterprises (SMEs) are allowed to guarantee each other and form complex loan networks to receive loans from banks during the economic expansion stage. However, external shocks may weaken the robustness, and an accidental default may spread across the network and lead to large-scale defaults, even systemic crisis. Thus, predicting and rating the default contagion chains in the guarantee network in order to reduce or prevent potential systemic financial risk, attracts a grave concern from the Regulatory Authority and the banks. Existing credit risk models in the banking industry utilize machine learning methods to generate a credit score for each customer. Such approaches dismiss the contagion risk from guarantee chains and need extensive feature engineering with deep domain expertise. To this end, we propose a novel approach to rate the risk of contagion chains in the bank industry with the deep neural network. We employed the temporal inter-chain attention network on graph-structured loan behavior data to compute risk scores for the contagion chains. We show that our approach is significantly better than the state-of-the-art baselines on the dataset from a major financial institution in Asia. Besides, we conducted empirical studies on the real-world loan dataset for risk assessment. The proposed approach enabled loan managers to monitor risks in a boarder view and avoid significant financial losses for the financial institution.

AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types

Can one build a knowledge graph (KG) for all products in the world? Knowledge graphs have firmly established themselves as valuable sources of information for search and question answering, and it is natural to wonder if a KG can contain information about products offered at online retail sites. There have been several successful examples of generic KGs, but organizing information about products poses many additional challenges, including sparsity and noise of structured data for products, complexity of the domain with millions of product types and thousands of attributes, heterogeneity across large number of categories, as well as large and constantly growing number of products.

We describe AutoKnow, our automatic (self-driving) system that addresses these challenges. The system includes a suite of novel techniques for taxonomy construction, product property identification, knowledge extraction, anomaly detection, and synonym discovery. AutoKnow is (a) automatic, requiring little human intervention, (b) multi-scalable, scalable in multiple dimensions (many domains, many products, and many attributes), and (c) integrative, exploiting rich customer behavior logs. AutoKnow has been operational in collecting product knowledge for over 11K product types.

Personalized Image Retrieval with Sparse Graph Representation Learning

Personalization is essential for enhancing the customer experience in retrieval tasks. In this paper, we develop a novel method CA-GCN for personalized image retrieval in the Adobe Stock image system. The proposed method CA-GCN leverages user behavior data in a Graph Convolutional Neural Network (GCN) model to learn user and image embeddings simultaneously. Standard GCN performs poorly on sparse user-image interaction graphs due to the limited knowledge gain from less representative neighbors. To address this challenge, we propose to augment the sparse user-image interaction data by considering the similarities among images. Specifically, we detect clusters of similar images and introduce a set of hidden super-nodes in the graph to represent clusters. We show that such an augmented graph structure can significantly improve the retrieval performance on real-world data collected from Adobe Stock service. In particular, when testing the proposed method on real users' stock image retrieval sessions, we get promoted average click position from 70 to 51.

Comprehensive Information Integration Modeling Framework for Video Titling

In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume. To recommend these videos to potential consumers more effectively, diverse and catchy video titles are critical. However, consumer-generated videos seldom accompany appropriate titles. To bridge this gap, we integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework. Although automatic video titling is very useful and demanding, it is much less addressed than video captioning. The latter focuses on generating sentences that describe videos as a whole while our task requires the product-aware multi-grained video analysis. To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization. Specifically, the granular-level interaction modeling first utilizes temporal-spatial landmark cues, descriptive words, and abstractive attributes to builds three individual graphs and recognizes the intra-actions in each graph through Graph Neural Networks (GNN). Then the global-local aggregation module is proposed to model inter-actions across graphs and aggregate heterogeneous graphs into a holistic graph representation. The abstraction-level story-line summarization further considers both frame-level video features and the holistic graph to utilize the interactions between products and backgrounds, and generate the story-line topic of the video. We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform, and will make the desensitized version publicly available to nourish further development of the research community. Relatively extensive experiments on various datasets demonstrate the efficacy of the proposed method.

Acoustic Measures for Real-Time Voice Coaching

Our voices can convey many different types of thoughts and intent; how our voices carry them is often not consciously controlled and as a consequence, unintended effects may arise that negatively impact our relationships. How we say things is as important as what we say. This paper presents methodologies for computing a set of physical properties from sound waves of a speaker's voice directly, referred to as acoustic measures. Experiments are designed and conducted to establish the correlations between physical properties and auditory measures for human perception of sound waves. Based on these correlations, a voice coaching app can guide users, in real-time or deferred retrospective, to modify their speech's auditory measures, such as rate of speech, energy level, and intonation, to achieve their intended communication goals.

Geodemographic Influence Maximization

Given a set of locations in a city, on which ones should we place ads on so as to reach as many people as possible within a limited budget? Past research has addressed this question under the assumption that dense trajectory data are available to determine the reach of each ad. However, the data that are available in most industrial settings do not consist of dense, long-range trajectories; instead, they consist of statistics on people's short-range point-to-point movements. In this paper, we address the natural problem that arises such data: given a distribution of population and point-to-point movement statistics over a network, find a set of locations within a budget that achieves maximum expected reach. We call this problem geodemographic influence maximization (GIM). We show that the problem is NP-hard, but its objective function is monotone and submodular, thus admits a greedy algorithm with a 1 over 2 (1-1 over e) approximation ratio. Still, this algorithm is inapplicable on large-scale data for high-frequency digital signage ads. We develop an efficient deterministic algorithm, Lazy-Sower, exploiting a novel, tight double-bounding scheme of marginal influence gain as well as the locality proprieties of the problem; a learning-based variant, NN-Sower, utilizes randomization and deep learning to further improve efficiency, with a slight loss of quality. Our exhaustive experimental study on two real-world urban datasets demonstrates the efficacy and efficiency of our solutions compared to baselines.

A Self-Evolving Mutually-Operative Recurrent Network-based Model for Online Tool Condition Monitoring in Delay Scenario

With the increasing demand of product supply, manufacturers are in urgent need of online tool condition monitoring (TCM) without compromising with the maintenance cost in terms of time as well as man-power requirement. However, the existing machine learning models for TCM are mostly offline and not suitable for the non-stationary environment of the machining settings. Moreover, the access of the ground truth always imposes a shutdown of the machining process and the existing models are severely affected by such delay in receiving labelled samples. In order to tackle these issues, we propose SERMON as a novel learning model based on a pair of self-evolving mutually-operative recurrent neural networks. The proposed SERMON is well-equipped with features for automated and real-time monitoring of machine fault status even in the finite/infinite label delay scenario. The experimental evaluation of SERMON using real-world dataset on 3D-printing process demonstrates its effectiveness in online fault detection under non-stationary as well as delayed label context of the machining process. Additional comparative study on large-scale benchmark streaming datasets further exhibits the scalability power of SERMON.

Maximizing Cumulative User Engagement in Sequential Recommendation: An Online Optimization Perspective

To maximize cumulative user engagement (e.g. cumulative clicks) in sequential recommendation, it is often needed to tradeoff two potentially conflicting objectives, that is, pursuing higher immediate user engagement (e.g., click-through rate) and encouraging user browsing (i.e., more items exposured). Existing works often study these two tasks separately, thus tend to result in sub-optimal results. In this paper, we study this problem from an online optimization perspective, and propose a flexible and practical framework to explicitly tradeoff longer user browsing length and high immediate user engagement. Specifically, by considering items as actions, user's requests as states and user leaving as an absorbing state, we formulate each user's behavior as a personalized Markov decision process (MDP), and the problem of maximizing cumulative user engagement is reduced to a stochastic shortest path (SSP) problem. Meanwhile, with immediate user engagement and quit probability estimation, it is shown that the SSP problem can be efficiently solved via dynamic programming. Experiments on real-world datasets demonstrate the effectiveness of the proposed approach. Moreover, this approach is deployed at a large E-commerce platform, achieved over 7% improvement of cumulative clicks.

Domain Specific Knowledge Graphs as a Service to the Public: Powering Social-Impact Funding in the US

Web and mobile technologies enable ubiquitous access to information. Yet, it is getting harder, even for subject matter experts, to quickly identify quality, trustworthy, and reliable content available online through search engines powered by advanced knowledge graphs. This paper explores the practical applications of Domain Specific Knowledge Graphs that allow for the extraction of information from trusted published and unpublished sources, to map the extracted information to an ontology defined in collaboration with sector experts, and to enable the public to go from single queries into ongoing conversations meeting their knowledge needs reliably. We focused on Social-Impact Funding, an area of need for over one million nonprofit organizations, foundations, government entities, social entrepreneurs, impact investors, and academic institutions in the US.

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR) are important speech tasks, and require a large amount of text and speech pairs for model training. However, there are more than 6,000 languages in the world and most languages are lack of speech training data, which poses significant challenges when building TTS and ASR systems for extremely low-resource languages. In this paper, we develop LRSpeech, a TTS and ASR system under the extremely low-resource setting, which can support rare languages with low data cost. LRSpeech consists of three key techniques: 1) pre-training on rich-resource languages and fine-tuning on low-resource languages; 2) dual transformation between TTS and ASR to iteratively boost the accuracy of each other; 3) knowledge distillation to customize the TTS model on a high-quality target-speaker voice and improve the ASR model on multiple voices. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. Experimental results show that LRSpeech 1) achieves high quality for TTS in terms of both intelligibility (more than $98%$ intelligibility rate) and naturalness (above 3.5 mean opinion score (MOS)) of the synthesized speech, which satisfy the requirements for industrial deployment, 2) achieves promising recognition accuracy for ASR, and 3) last but not least, uses extremely low-resource training data. We also conduct comprehensive analyses on LRSpeech with different amounts of data resources, and provide valuable insights and guidances for industrial deployment. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.

Doing in One Go: Delivery Time Inference Based on Couriers' Trajectories

The rapid development of e-commerce requires efficient and reliable logistics services. Nowadays, couriers are still the main solution to address the "last mile" problem in logistics. They are usually required to record the accurate delivery time of each parcel manually, which provides vital information for applications like delivery insurances, delivery performance evaluations, and customer available time discovery. Couriers' trajectories generated by their PDAs provide a chance to infer the delivery time automatically to ease the burdens on the couriers. However, directly using the nearest stay point to infer the delivery time is under satisfactory due to two challenges: 1) inaccurate delivery locations, and 2) various stay scenarios. To this end, we propose Delivery Time Inference (DTInf), to automatically infer the delivery time of waybills based on couriers' trajectories. Our solution is composed of three steps: 1) Data Pre-processing, which detects stay points from trajectories, and separates stay points and waybills by delivery trips, 2) Delivery Location Correction, which infers true delivery locations of waybills by mining historical deliveries, and 3) Delivery Event-based Matching, which selects the best-matched stay point for waybills in the same delivery location to infer the delivery time. Extensive experiments and case studies based on large scale real-world waybill and trajectory data from JD Logistics confirm the effectiveness of our approach. Finally, we introduce a system based on DTInf, which is deployed and used internally in JD Logistics.

Improving Deep Learning for Airbnb Search

The application of deep learning to search ranking was one of the most impactful product improvements at Airbnb. But what comes next after you launch a deep learning model? In this paper we describe the journey beyond, discussing what we refer to as the ABCs of improving search: A for architecture, ℬ for bias and ℂ for cold start. For architecture, we describe a new ranking neural network, focusing on the process that evolved our existing DNN beyond a fully connected two layer network. On handling positional bias in ranking, we describe a novel approach that led to one of the most significant improvements in tackling inventory that the DNN historically found challenging. To solve cold start, we describe our perspective on the problem and changes we made to improve the treatment of new listings on the platform. We hope ranking teams transitioning to deep learning will find this a practical case study of how to iterate on DNNs.

General-Purpose User Embeddings based on Mobile App Usage

In this paper, we report our recent practice at Tencent for user modeling based on mobile app usage. User behaviors on mobile app usage, including retention, installation, and uninstallation, can be a good indicator for both long-term and short-term interests of users. For example, if a user installs Snapseed recently, she might have a growing interest in photographing. Such information is valuable for numerous downstream applications, including advertising, recommendations, etc. Traditionally, user modeling from mobile app usage heavily relies on handcrafted feature engineering, which requires onerous human work for different downstream applications, and could be sub-optimal without domain experts. However, automatic user modeling based on mobile app usage faces unique challenges, including (1) retention, installation, and uninstallation are heterogeneous but need to be modeled collectively, (2) user behaviors are distributed unevenly over time, and (3) many long-tailed apps suffer from serious sparsity. In this paper, we present a tailored Auto Encoder-coupled Transformer Network (AETN), by which we overcome these challenges and achieve the goals of reducing manual efforts and boosting performance. We have deployed the model at Tencent, and both online/offline experiments from multiple domains of downstream applications have demonstrated the effectiveness of the output user embeddings.

Unsupervised Translation via Hierarchical Anchoring: Functional Mapping of Places across Cities

Unsupervised translation has become a popular task in natural language processing (NLP) due to difficulties in collecting large scale parallel datasets. In the urban computing field, place embeddings generated using human mobility patterns via recurrent neural networks are used to understand the functionality of urban areas. Translating place embeddings across cities allow us to transfer knowledge across cities, which may be used for various downstream tasks such as planning new store locations. Despite such advances, current methods fail to translate place embeddings across domains with different scales (e.g. Tokyo to Niigata), due to the straightforward adoption of neural machine translation (NMT) methods from NLP, where vocabulary sizes are similar across languages. We refer to this issue as the domain imbalance problem in unsupervised translation tasks. We address this problem by proposing an unsupervised translation method that translates embeddings by exploiting common hierarchical structures that exist across imbalanced domains. The effectiveness of our method is tested using place embeddings generated from mobile phone data in 6 Japanese cities of heterogeneous sizes. Validation using landuse data clarify that using hierarchical anchors improves the translation accuracy across imbalanced domains. Our method is agnostic to input data type, thus could be applied to unsupervised translation tasks in various fields in addition to linguistics and urban computing.

Debiasing Grid-based Product Search in E-commerce

The widespread usage of e-commerce websites in daily life and the resulting wealth of implicit feedback data form the foundation for systems that train and test e-commerce search ranking algorithms. While convenient to collect, implicit feedback data inherently suffers from various types of bias since user feedback is limited to products they are exposed to by existing search ranking algorithms and impacted by how the products are displayed. In the literature, a vast majority of existing methods have been proposed towards unbiased learning to rank for list-based web search scenarios. However, such methods cannot be directly adopted by e-commerce websites mainly for two reasons. First, in e-commerce websites, search engine results pages (SERPs) are displayed in 2-dimensional grids. The existing methods have not considered the difference in user behavior between list-based web search and grid-based product search. Second, there can be multiple types of implicit feedback (e.g., clicks and purchases) on e-commerce websites. We aim to utilize all types of implicit feedback as the supervision signals. In this work, we extend unbiased learning to rank to the world of e-commerce search via considering a grid-based product search scenario. We propose a novel framework which (1) forms the theoretical foundations to allow multiple types of implicit feedback in unbiased learning to rank and (2) incorporates the row skipping and slower decay click models to capture unique user behavior patterns in grid-based product search for inverse propensity scoring. Through extensive experiments on real-world e-commerce search log datasets across browsing devices and product taxonomies, we show that the proposed framework outperforms the state of the art unbiased learning to rank algorithms. These results also reveal important insights on how user behavior patterns vary in e-commerce SERPs across browsing devices and product taxonomies.

Forecasting the Evolution of Hydropower Generation

Hydropower is the largest renewable energy source for electricity generation in the world, with numerous benefits in terms of: environment protection (near-zero air pollution and climate impact), cost-effectiveness (long-term use, without significant impacts of market fluctuation), and reliability (quickly respond to surge in demand). However, the effectiveness of hydropower plants is affected by multiple factors such as reservoir capacity, rainfall, temperature and fluctuating electricity demand, and particularly their complicated relationships, which make the prediction/recommendation of station operational output a difficult challenge. In this paper, we present DeepHydro, a novel stochastic method for modeling multivariate time series (e.g., water inflow/outflow and temperature) and forecasting power generation of hydropower stations. DeepHydro captures temporal dependencies in co-evolving time series with a new conditioned latent recurrent neural networks, which not only considers the hidden states of observations but also preserves the uncertainty of latent variables. We introduce a generative network parameterized on a continuous normalizing flow to approximate the complex posterior distribution of multivariate time series data, and further use neural ordinary differential equations to estimate the continuous-time dynamics of the latent variables constituting the observable data. This allows our model to deal with the discrete observations in the context of continuous dynamic systems, while being robust to the noise. We conduct extensive experiments on real-world datasets from a large power generation company consisting of cascade hydropower stations. The experimental results demonstrate that the proposed method can effectively predict the power production and significantly outperform the possible candidate baseline approaches.

Salience and Market-aware Skill Extraction for Job Targeting

At LinkedIn, we want to create economic opportunity for everyone in the global workforce. To make this happen, LinkedIn offers a reactive Job Search system, and a proactive Jobs You May Be Interested In (JYMBII) system to match the best candidates with their dream jobs. One of the most challenging tasks for developing these systems is to properly extract important skill entities from job postings and then target members with matched attributes. In this work, we show that the commonly used text-based salience and market-agnostic skill extraction approach is sub-optimal because it only considers skill mention and ignores the salient level of a skill and its market dynamics, i.e., the market supply and demand influence on the importance of skills. To address the above drawbacks, we present Job2Skills, our deployed salience and market-aware skill extraction system. The proposed Job2Skills shows promising results in improving the online performance of job recommendation (JYMBII) (+1.92% job apply) and skill suggestions for job posters (-37% suggestion rejection rate). Lastly, we present case studies to show interesting insights that contrast traditional skill recognition method and the proposed Job2Skills from occupation, industry, country, and individual skill levels. Based on the above promising results, we deployed the Job2Skills online to extract job targeting skills for all 20M job postings served at LinkedIn.

DATE: Dual Attentive Tree-aware Embedding for Customs Fraud Detection

Intentional manipulation of invoices that lead to undervaluation of trade goods is the most common type of customs fraud to avoid ad valorem duties and taxes. To secure government revenue without interrupting legitimate trade flows, customs administrations around the world strive to develop ways to detect illicit trades. This paper proposes DATE, a model of Dual-task Attentive Tree-aware Embedding, to classify and rank illegal trade flows that contribute the most to the overall customs revenue when caught. The strength of DATE comes from combining a tree-based model for interpretability and transaction-level embeddings with dual attention mechanisms. To accurately identify illicit transactions and predict tax revenue, DATE learns simultaneously from illicitness and surtax of each transaction. With a five-year amount of customs import data with a test illicit ratio of 2.24%, DATE shows a remarkable precision of 92.7% on illegal cases and a recall of 49.3% on revenue after inspecting only 1% of all trade flows. We also discuss issues on deploying DATE in Nigeria Customs Service, in collaboration with the World Customs Organization.

User Sentiment as a Success Metric: Persistent Biases Under Full Randomization

We study user sentiment (reported via optional surveys) as a metric for fully randomized A/B tests. Both user-level covariates and treatment assignment can impact response propensity. We show that a simple mean comparison produces biased population level estimates and propose a set of consistent estimators for the average and local treatment effects on treated and respondent users. We show that our problem can be mapped onto the intersection of the missing data problem and observational causal inference, and we identify conditions under which consistent estimators exist. Finally, we evaluate the performance of estimators and find that more complicated models do not necessarily provide superior performance as long as models satisfy consistency criteria.

Improving Recommendation Quality in Google Drive

Quick Access is a machine-learned system in Google Drive that predicts which files a user wants to open. Adding Quick Access recommendations to the Drive homepage cut the amount of time that users spend locating their files in half. Aggregated over the ~1 billion users of Drive, the time saved up adds up to ~1000 work weeks every day. In this paper, we discuss both the challenges of iteratively improving the quality of a personal recommendation system as well as the variety of approaches that we took in order to improve this feature. We explored different deep network architectures, novel modeling techniques, additional data sources, and the effects of latency and biases in the UX. We share both pitfalls as well as successes in our attempts to improve this product, and also discuss how we scaled and managed the complexity of the system. We believe that these insights will be especially useful to those who are working with private corpora as well as those who are building a large-scale production recommendation system.

Large-Scale Training System for 100-Million Classification at Alibaba

In the last decades, extreme classification has become an essential topic for deep learning. It has achieved great success in many areas, especially in computer vision and natural language processing (NLP). However, it is very challenging to train a deep model with millions of classes due to the memory and computation explosion in the last output layer. In this paper, we propose a large-scale training system to address these challenges. First, we build a hybrid parallel training framework to make the training process feasible. Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs and improves the throughput of training. Then, to eliminate the communication overhead, we propose a new overlapping pipeline and a gradient sparsification method. Furthermore, we design a fast continuous convergence strategy to reduce total training iterations by adaptively adjusting learning rate and updating model parameters. With the help of all the proposed methods, we gain 3.9× throughput of our training system and reduce almost 60% of training iterations. The experimental results show that using an in-house 256 GPUs cluster, we could train a classifier of 100 million classes on Alibaba Retail Product Dataset in about five days while achieving a comparable accuracy with the naive softmax training process.

Mining Implicit Relevance Feedback from User Behavior for Web Question Answering

Training and refreshing a web-scale Question Answering (QA) system for a multi-lingual commercial search engine often requires a huge amount of training examples. One principled idea is to mine implicit relevance feedback from user behavior recorded in search engine logs. All previous works on mining implicit relevance feedback target at relevance of web documents rather than passages. Due to several unique characteristics of QA tasks, the existing user behavior models for web documents cannot be applied to infer passage relevance. In this paper, we make the first study to explore the correlation between user behavior and passage relevance, and propose a novel approach for mining training data for Web QA. We conduct extensive experiments on four test datasets and the results show our approach significantly improves the accuracy of passage ranking without extra human labeled data. In practice, this work has proved effective to substantially reduce the human labeling cost for the QA service in a global commercial search engine, especially for languages with low resources. Our techniques have been deployed in multi-language services.

Controllable Multi-Interest Framework for Recommendation

Recently, neural networks have been widely used in e-commerce recommender systems, owing to the rapid development of deep learning. We formalize the recommender system as a sequential recommendation problem, intending to predict the next items that the user might be interacted with. Recent works usually give an overall embedding from a user's behavior sequence. However, a unified user embedding cannot reflect the user's multiple interests during a period. In this paper, we propose a novel controllable multi-interest framework for the sequential recommendation, called ComiRec. Our multi-interest module captures multiple interests from user behavior sequences, which can be exploited for retrieving candidate items from the large-scale item pool. These items are then fed into an aggregation module to obtain the overall recommendation. The aggregation module leverages a controllable factor to balance the recommendation accuracy and diversity. We conduct experiments for the sequential recommendation on two real-world datasets, Amazon and Taobao. Experimental results demonstrate that our framework achieves significant improvements over state-of-the-art models. Our framework has also been successfully deployed on the offline Alibaba distributed cloud platform.

Managing Diversity in Airbnb Search

One of the long-standing questions in search systems is the role of diversity in results. From a product perspective, showing diverse results provides the user with more choice and should lead to an improved experience. However, this intuition is at odds with common machine learning approaches to ranking which directly optimize the relevance of each individual item without a holistic view of the result set. In this paper, we describe our journey in tackling the problem of diversity for Airbnb search, starting from heuristic based approaches and concluding with a novel deep learning solution that produces an embedding of the entire query context by leveraging Recurrent Neural Networks (RNNs). We hope our lessons learned will prove useful to others and motivate further research in this area.

Molecular Inverse-Design Platform for Material Industries

The discovery of new materials has been the essential force which brings a discontinuous improvement to industrial products' performance. However, the extra-vast combinatorial design space of material structures exceeds human experts' capability to explore all, thereby hampering material development. In this paper, we present a material industry-oriented web platform of an AI-driven molecular inverse-design system, which automatically designs brand new molecular structures rapidly and diversely. Different from existing inverse-design solutions, in this system, the combination of substructure-based feature encoding and molecular graph generation algorithms allows a user to gain high-speed, interpretable, and customizable design process. Also, a hierarchical data structure and user-oriented UI provide a flexible and intuitive workflow. The system is deployed on IBM's and our client's cloud servers and has been used by 5 partner companies. To illustrate actual industrial use cases, we exhibit inverse-design of sugar and dye molecules, that were carried out by experimental chemists in those client companies. Compared to a general human chemist's standard performance, the molecular design speed was accelerated more than 10 times, and greatly increased variety was observed in the inverse-designed molecules without loss of chemical realism.

Learning to Score Economic Development from Satellite Imagery

Reliable and timely measurements of economic activities are fundamental for understanding economic development and designing government policies. However, many developing countries still lack reliable data. In this paper, we introduce a novel approach for measuring economic development from high-resolution satellite images in the absence of ground truth statistics. Our method consists of three steps. First, we run a clustering algorithm on satellite images that distinguishes artifacts from nature (siCluster). Second, we generate a partial order graph of the identified clusters based on the level of economic development, either by human guidance or by low-resolution statistics (siPog). Third, we use a CNN-based sorter that assigns differentiable scores to each satellite grid based on the relative ranks of clusters (siScore). The novelty of our method is that we break down a computationally hard problem into sub-tasks, which involves a human-in-the-loop solution. With the combination of unsupervised learning and the partial orders of dozens of urban vs. rural clusters, our method can estimate the economic development scores of over 10,000 satellite grids consistently with other baseline development proxies (Spearman correlation of 0.851). This efficient method is interpretable and robust; we demonstrate how to apply our method to both developed (e.g., South Korea) and developing economies (e.g., Vietnam and Malawi).

A Request-level Guaranteed Delivery Advertising Planning: Forecasting and Allocation

The guaranteed delivery model is widely used in online advertising. The publisher sells impressions in advance by promising to serve each advertiser an agreed-upon number of target impressions that satisfy specific attribute requirements over a fixed time period. Previous efforts usually model the service as a crowd-level or user-level supply allocation problem and focus on searching optimal allocation for online serving, assuming that forecasts of supply are available and contracts are already signed. Existing techniques are not sufficient to meet the needs of today's industry trends: 1) advertisers pursue more precise targeting, which requires not only user-level attributes but also request-level attributes; 2) users prefer more friendly ad serving, which imposes more diverse serving constraints; 3) the bottleneck of the publisher's revenue growth lies in not only the ad serving, but also the forecast accuracy and sales strategy. These issues are non-trivial to address, since the scale of the request-level model is orders of magnitude larger than that of the crowd-level or user-level models. Facing the challenges, we present a holistic design of a request-level guaranteed delivery advertising planning system with careful optimization for all three critical components including impression forecasting, selling and serving. Our system has been deployed in the Tencent online guaranteed delivery advertising system serving billion level users for nearly one year. Evaluations on large-scale real data and the performance of the deployed system both demonstrate that our design can significantly increase the request-level impression forecast accuracy and delivery rate.

Two Sides of the Same Coin: White-box and Black-box Attacks for Transfer Learning

Transfer learning has become a common practice for training deep learning models with limited labeled data in a target domain. On the other hand, deep models are vulnerable to adversarial attacks. Though transfer learning has been widely applied, its effect on model robustness is unclear. To figure out this problem, we conduct extensive empirical evaluations to show that fine-tuning effectively enhances model robustness under white-box FGSM attacks. We also propose a black-box attack method for transfer learning models which attacks the target model with the adversarial examples produced by its source model. To systematically measure the effect of both white-box and black-box attacks, we propose a new metric to evaluate how transferable are the adversarial examples produced by a source model to a target model. Empirical results show that the adversarial examples are more transferable when fine-tuning is used than they are when the two networks are trained independently.

Learning to Generate Personalized Query Auto-Completions via a Multi-View Multi-Task Attentive Approach

In this paper, we study the task of Query Auto-Completion (QAC), which is a very significant feature of modern search engines. In real industrial application, there always exist two major problems of QAC - weak personalization and unseen queries. To address these problems, we propose M2A, a multi-view multi-task attentive framework to learn personalized query auto-completion models. We propose a new Transformer-based hierarchical encoder to model different kinds of sequential behaviors, which can be seen as multiple distinct views of the user's searching history, and then a prefix-to-history attention mechanism is used to select the most relevant information to compose the final intention representation. To learn more informative representations, we propose to incorporate multi-task learning into the model training. Two different kinds of supervisory information provided by query logs are utilized at the same time by jointly training a CTR prediction model and a query generation model.

To bridge the gap between the setting of research work and the real scenario, we release a new large-scale query log dataset - TaobaoQAC, which contains rich real prefix-to-query click behaviors. We conduct experiments on TaobaoQAC to demonstrate the effectiveness or our approach, and results show that M2A achieves superior performance compared with several strong baselines in both candidate ranking and query generation. We also conduct an online A/B testing and our approach has been deployed online.

A Sleeping, Recovering Bandit Algorithm for Optimizing Recurring Notifications

Many online and mobile applications rely on daily emails and push notifications to increase and maintain user engagement. The multi-armed bandit approach provides a useful framework for optimizing the content of these notifications, but a number of complications (such as novelty effects and conditional eligibility) make conventional bandit algorithms unsuitable in practice. In this paper, we introduce the Recovering Difference Softmax Algorithm to address the particular challenges of this problem domain, and use it to successfully optimize millions of daily reminders for the online language-learning app Duolingo. This lead to a 0.5%. increase in total daily active users (DAUs) and a 2%, increase in new user retention over a strong baseline. We provide technical details of its design and deployment, and demonstrate its efficacy through both offline and online evaluation experiments.

Multi-objective Optimization for Guaranteed Delivery in Video Service Platform

Guaranteed-Delivery (GD) is one of the important display strategies for the IP videos in video service platform. Different from the traditional recommendation strategy, GD requires the delivery system to guarantee the exposure amount (also called impressions in some works) for the content, where the amount generally comes from the purchase contract or business consideration of the platform. In this paper, we study the problem of how to maximize certain gains, such as video view (VV) or fairness of different contents (CTR variations between contents) under the GD constraints. We formulate such a problem as a constrained nonlinear programming problem, in which the objectives are to maximize the total VVs of contents and the exposure fairness between contents. In order to capture the trends of VV versus the impression number (page views, PV) for each video content, we propose a parameterized ordinary differential equation (ODE) model, and the parameters of the ODE are fitted by the video historical PV and CLICK datas. To solve the constrained nonlinear programming, we use genetic algorithm (GA) with a specific design of coding scheme considering the ODE constraints. The empirical study based on real-world data and online test on Youku.com verifies the effectiveness and superiority of our approach compared with the state of the art in the industry practice.

Delivery Scope: A New Way of Restaurant Retrieval for On-demand Food Delivery Service

Recently on-demand food delivery service has become very popular in China. More than 30 million orders are placed by eaters of Meituan-Dianping everyday. Delicacies are delivered to eaters in 30 minutes on average. To fully leverage the ability of our couriers and restaurants, delivery scope is proposed as an infrastructure product for on-demand food delivery area. A delivery scope based retrieval system is designed and built on our platform. In order to draw suitable delivery scopes for millions of restaurant partners, we propose a pioneering delivery scope generation framework. In our framework, a single delivery scope generation algorithm is proposed by using spatial computational techniques and data mining techniques. Moreover, a scope scoring algorithm and decision algorithm are proposed by utilizing machine learning models and combinatorial optimization techniques. Specifically, we propose a novel delivery scope sample generation method and use the scope related features to estimate order numbers and average delivery time in a period of time for each delivery scope. Then we formalize the candidate scopes selection process as a binary integer programming problem. Both branch&bound algorithm and a heuristic search algorithm are integrated in our system. Results of online experiments show that scopes generated by our new algorithm significantly outperform manual generated ones. Our algorithm brings more orders without hurt of users' experience. After deployed online, our system has saved thousands of hours for operation staff, and it is considered to be one of the most useful operation tools to balance demand of eaters and supply of restaurants and couriers.

Fraud Transactions Detection via Behavior Tree with Local Intention Calibration

Fraud transactions obtain the rights and interests of e-commerce platforms by illegal ways, and have been the emerging threats to the healthy development of these platforms. Recently, user behavioral data is extensively exploited to detect fraud transactions, and it is usually processed as a sequence consisting of individual actions. However, such sequence-like user behaviors have logical patterns associated with user intentions, which motivates a fine-grained management strategy that binds and cuts off these actions into intention-related segments. In this paper, we devise a tree-like structure named behavior tree to reorganize the user behavioral data, in which a group of successive sequential actions denoting a specific user intention are represented as a branch on the tree. We then propose a novel neural method coined LIC Tree-LSTM(Local Intention Calibrated Tree-LSTM) to utilize the behavior tree for fraud transactions detection. In our LIC Tree-LSTM, the global user intention is captured by an attentional method applied on different branches. Then, we calibrate the entire tree by attentions within tree branches to pinpoint the balance between global and local user intentions. We investigate the effectiveness of LIC Tree-LSTM on a real-world dataset of Alibaba platform, and the experimental results show that our proposed algorithm outperforms state-of-the-art methods in both offline and online modes. Furthermore, our model provides good interpretability which helps us better understand user behaviors.

Balanced Order Batching with Task-Oriented Graph Clustering

Balanced order batching problem (BOBP) arises from the process of warehouse picking in Cainiao, the largest logistics platform in China. Batching orders together in the picking process to form a single picking route, reduces travel distance. The reason for its importance is that order picking is a labor intensive process and, by using good batching methods, substantial savings can be obtained. The BOBP is a NP-hard combinational optimization problem and designing a good problem-specific heuristic under the quasi-real-time system response requirement is non-trivial. In this paper, rather than designing heuristics, we propose an end-to-end learning and optimization framework named Balanced Task-orientated Graph Clustering Network (BTOGCN) to solve the BOBP by reducing it to balanced graph clustering optimization problem. In BTOGCN, a task-oriented estimator network is introduced to guide the type-aware heterogeneous graph clustering networks to find a better clustering result related to the BOBP objective. Through comprehensive experiments on single-graph and multi-graphs, we show: 1) our balanced task-oriented graph clustering network can directly utilize the guidance of target signal and outperforms the two-stage deep embedding and deep clustering method; 2) our method obtains an average 4.57m and 0.13m picking distance reduction than the expert-designed algorithm on single and multi-graph set and has a good generalization ability to apply in practical scenario.

Efficiently Solving the Practical Vehicle Routing Problem: A Novel Joint Learning Approach

Our model is based on the graph convolutional network (GCN) with node feature (coordination and demand) and edge feature (the real distance between nodes) as input and embedded. Separate decoders are proposed to decode the representations of these two features. The output of one decoder is the supervision of the other decoder. We propose a strategy that combines the reinforcement learning manner with the supervised learning manner to train the model. Through comprehensive experiments on real-world data, we show that 1) the edge feature is important to be explicitly considered in the model; 2) the joint learning strategy can accelerate the convergence of the training and improve the solution quality; 3) our model significantly outperforms several well-known algorithms in the literature, especially when the problem size is large; 3) our method is generalized beyond the size of problem instances they were trained on.

Meta-Learning for Query Conceptualization at Web Scale

Concepts naturally constitute an abstraction for fine-grained entities and knowledge in the open domain. They enable search engines and recommendation systems to enhance user experience by discovering high-level abstraction of a search query and the user intent behind it. In this paper, we study the problem of query conceptualization, which is to find the most appropriate matching concepts for any given search query from a large pool of pre-defined concepts. We propose a coarse-to-fine approach to first reduce the search space for each query through a shortlisting scheme and then identify the matching concepts using pre-trained language models, which are meta-tuned to our query-concept matching task. Our shortlisting scheme involves using a GRU-based Relevant Words Generator (RWG) to first expand and complete the context of the given query and then shortlisting the candidate concepts through a scoring mechanism based on word overlaps. To accurately identify the most appropriate matching concepts for a query, even when the concepts may have zero verbatim overlaps with the query, we meta-fine-tune a BERT pairwise text-matching model under the Reptile meta-learning algorithm, which achieves zero-shot transfer learning on the conceptualization problem. Our two-stage framework can be trained with data completely derived from a search click graph, without requiring any human labelling efforts. For evaluation, we have constructed a large click graph based on more than $7$ million instances of the click history recorded in Tencent QQ browser and performed the query conceptualization task based on a large ontology with $159,148$ unique concepts. Results from a range of evaluation methods, including an offline evaluation procedure on the click graph, human evaluation, online A/B testing and case studies, have demonstrated the superiority of our approach over a number of competitive pre-trained language models and fine-tuned neural network baselines.

Hybrid Spatio-Temporal Graph Convolutional Network: Improving Traffic Prediction with Navigation Data

Traffic forecasting has recently attracted increasing interest due to the popularity of online navigation services, ridesharing and smart city projects. Owing to the non-stationary nature of road traffic, forecasting accuracy is fundamentally limited by the lack of contextual information. To address this issue, we propose the Hybrid Spatio-Temporal Graph Convolutional Network (H-STGCN), which is able to "deduce" future travel time by exploiting the data of upcoming traffic volume. Specifically, we propose an algorithm to acquire the upcoming traffic volume from an online navigation engine. Taking advantage of the piecewise-linear flow-density relationship, a novel transformer structure converts the upcoming volume into its equivalent in travel time. We combine this signal with the commonly-utilized travel-time signal, and then apply graph convolution to capture the spatial dependency. Particularly, we construct a compound adjacency matrix which reflects the innate traffic proximity. We conduct extensive experiments on real-world datasets. The results show that H-STGCN remarkably outperforms state-of-the-art methods in various metrics, especially for the prediction of non-recurring congestion.

Multitask Mixture of Sequential Experts for User Activity Streams

It is often desirable to model multiple objectives in real-world web applications, such as user satisfaction and user engagement in recommender systems. Multi-task learning has become the standard approach for such applications recently.

While most of the multi-task recommendation model architectures proposed to date are focusing on using non-sequential input features (e.g., query and context), input data is often sequential in real-world web application scenarios. For example, user behavior streams, such as user search logs in search systems, are naturally atemporal sequence. Modeling user sequential behaviors as explicit sequential representations can empower the multi-task model to incorporate temporal dependencies, thus predicting future user behavior more accurately. Furthermore, user activity streams can come from heterogeneous data sources, such as user search logs and user browsing logs. They typically possess very different properties such as data sparsity and thus need careful treatment when being modeled jointly.

In this work, we study the challenging problem of how to model sequential user behavior in the neural multi-task learning settings. Our major contribution is a novel framework, Mixture of Sequential Experts (MoSE). It explicitly models sequential user behavior using Long Short-Term Memory (LSTM) in the state-of-art Multi-gate Mixture-of-Expert multi-task modeling framework. In experiments, we show the effectiveness of the MoSE architecture over seven alternative architectures on both synthetic and noisy real-world user data in G Suite. We also demonstrate the effectiveness and flexibility of the MoSE architecture in a real-world decision making engine in GMail that involves millions of users, balancing between search quality and resource costs.

Identifying Homeless Youth At-Risk of Substance Use Disorder: Data-Driven Insights for Policymakers

Substance Use Disorder (SUD) is a devastating disease that leads to significant mental and behavioral impairments. Its negative effects damage the homeless youth population more severely (as compared to stably housed counterparts) because of their high-risk behaviors. To assist policymakers in devising effective and accurate long-term strategies to mitigate SUD, it is necessary to critically analyze environmental, psychological, and other factors associated with SUD among homeless youth. Unfortunately, there is no definitive data-driven study on analyzing factors associated with SUD among homeless youth. While there have been a few prior studies in the past, they (i) do not analyze variation in the associated factors for SUD with geographical heterogeneity in their studies; and (ii) only consider a few contributing factors to SUD in relatively small samples. This work aims to fill this gap by making the following three contributions: (i) we use a real-world dataset collected from ~1,400 homeless youth (across six American states) to build accurate Machine Learning (ML) models for predicting the susceptibility of homeless youth to SUD; (ii) we find a representative set of factors associated with SUD among this population by analyzing feature importance values associated with our ML models; and (iii) we investigate the effect of geographical heterogeneity on the factors associated with SUD. Our results show that our system using adaptively boosted decision trees achieves the best predictive accuracy out of several algorithms on the SUD prediction task, achieving an Area Under the ROC Curve of 0.85. Further, among other things, we also find that both Post-Traumatic Stress Disorder (PTSD) and depression are very strongly associated with SUD among homeless youth because of their propensity to self-medicate to alleviate stress. This work is done in collaboration with social work scientists, who are currently evaluating the results for potential future deployment.

Interleaved Sequence RNNs for Fraud Detection

Payment card fraud causes multibillion dollar losses for banks and merchants worldwide, often fueling complex criminal activities. To address this, many real-time fraud detection systems use tree-based models, demanding complex feature engineering systems to efficiently enrich transactions with historical data while complying with millisecond-level latencies. In this work, we do not require those expensive features by using recurrent neural networks and treating payments as an interleaved sequence, where the history of each card is an unbounded, irregular sub-sequence. We present a complete RNN framework to detect fraud in real-time, proposing an efficient ML pipeline from preprocessing to deployment. We show that these feature-free, multi-sequence RNNs outperform state-of-the-art models saving millions of dollars in fraud detection and using fewer computational resources.

Attention based Multi-Modal New Product Sales Time-series Forecasting

Trend driven retail industries such as fashion, launch substantial new products every season. In such a scenario, an accurate demand forecast for these newly launched products is vital for efficient downstream supply chain planning like assortment planning and stock allocation. While classical time-series forecasting algorithms can be used for existing products to forecast the sales, new products do not have any historical time-series data to base the forecast on. In this paper, we propose and empirically evaluate several novel attention-based multi-modal encoder-decoder models to forecast the sales for a new product purely based on product images, any available product attributes and also external factors like holidays, events, weather, and discount. We experimentally validate our approaches on a large fashion dataset and report the improvements in achieved accuracy and enhanced model interpretability as compared to existing k-nearest neighbor based baseline approaches.

Pest Management In Cotton Farms: An AI-System Case Study from the Global South

Nearly 100 million families across the world rely on cotton farming for their livelihood. Cotton is particularly vulnerable to pest attacks, leading to overuse of pesticides, lost income for farmers, and in some cases farmer suicides. We address this problem by presenting a new solution for pesticide management that uses deep learning, smartphone cameras, inexpensive pest traps, existing digital pipelines, and agricultural extension-worker programs. Although generic, the platform is specifically designed to assist smallholder farmers in the developing world. In addition to outlining the solution, we consider the set of unique constraints this context places on it: data diversity, annotation challenges, shortcomings with traditional evaluation metrics, computing on low-resource devices, and deployment through intermediaries. This paper summarizes key lessons learned while developing and deploying the proposed solution. Such lessons may be applicable to other teams interested in building AI solutions for global development.

TIES: Temporal Interaction Embeddings for Enhancing Social Media Integrity at Facebook

Since its inception, Facebook has become an integral part of the online social community. People rely on Facebook to connect with others and build communities. As a result, it is paramount to protect the integrity of such a large network in a fast and scalable manner. In this paper, we present our efforts to protect various social media entities at Facebook from people who try to abuse our platform. We present a novel Temporal Interaction EmbeddingS (TIES) model that is designed to capture rogue social interactions and flag them for further suitable actions. TIES is a supervised, deep learning, production ready model at Facebook-scale networks. Prior works on integrity problems are mostly focused on capturing either only static or certain dynamic features of social entities. In contrast, TIES can capture both these variant behaviors in a unified model owing to the recent strides made in the domains of graph embedding and deep sequential pattern learning. To show the real-world impact of TIES, we present a few applications especially for preventing spread of misinformation, fake account detection, and reducing ads payment risks in order to enhance Facebook platform's integrity.

Price Investment using Prescriptive Analytics and Optimization in Retail

As the world's largest retailer, Walmart's core mission is to save people money so they can live better. We call the strategy we use to accomplish this goal our Every Day Low Price strategy. By keeping operational expenses as low as possible, we can continually apply a downward pressure on our prices, in turn increasing the amount of traffic, and ultimately, sales within our stores. In this paper, we apply Machine Learning (ML) algorithms and Operations Research techniques for forecasting and optimization to build a new price recommendation system, which improves our ability to generate price recommendations accurately and automatically. Comprised of a demand forecasting step, two optimizations, and causal inference analysis, our system was evaluated in the form of forecast backtests and live pricing experiments, both of which suggested that our approach was more effective than the current rule-based pricing system.

Climate Downscaling Using YNet: A Deep Convolutional Network with Skip Connections and Fusion

Climate change is one of the major challenges to human beings in our time. It brings many unexpected disasters which cause drastic losses including lives and properties. To better understand climate change, scientists developed various Global Climate Models (GCMs) to simulate the global climate and make projections for future climate values. These global climate models have coarse grids (i.e., low resolutions both in space and time) due to limitations of computing power and simulation time. Although they are helpful in predicting large scale long term trend in climate, they are too coarse for impact analysis in smaller scales such as in regional or local scale. However, climate conditions in regional or local scale are very important in making decisions related to climate conditions such as infrastructure, transportation and evacuation, as they highly depend on small scale climate conditions. In this paper, we proposed YNet, a novel deep convolutional neural network (CNN) with skip connections and fusion capabilities to perform downscaling for climate variables, on multiple GCMs directly rather than on reanalysis data. We analyzed and compared our proposed method with four other methods on datasets of three climate variables: mean precipitation, and extreme values (maximum temperature and minimum temperature). The results show the effectiveness of the proposed method.

Cracking the Black Box: Distilling Deep Sports Analytics

This paper addresses the trade-off between Accuracy and Transparency for deep learning applied to sports analytics. Neural nets achieve great predictive accuracy through deep learning, and are popular in sports analytics. But it is hard to interpret a neural net model and harder still to extract actionable insights from the knowledge implicit in it. Therefore, we built a simple and transparent model that mimics the output of the original deep learning model and represents the learned knowledge in an explicit interpretable way. Our mimic model is a linear model tree, which combines a collection of linear models with a regression-tree structure. The tree version of a neural network achieves high fidelity, explains itself, and produces insights for expert stakeholders such as athletes and coaches. We propose and compare several scalable model tree learning heuristics to address the computational challenge from datasets with millions of data points.

Taming Pretrained Transformers for Extreme Multi-label Text Classification

We consider the extreme multi-label text classification (XMC) problem: given an input text, return the most relevant labels from a large label collection. For example, the input text could be a product description on Amazon.com and the labels could be product categories. XMC is an important yet challenging problem in the NLP community. Recently, deep pretrained transformer models have achieved state-of-the-art performance on many NLP tasks including sentence classification, albeit with small label sets. However, naively applying deep transformer models to the XMC problem leads to sub-optimal performance due to the large output space and the label sparsity issue. In this paper, we propose X-Transformer, the first scalable approach to fine-tuning deep transformer models for the XMC problem. The proposed method achieves new state-of-the-art results on four XMC benchmark datasets. In particular, on a Wiki dataset with around 0.5 million labels, the prec@1 of X-Transformer is 77.28%, a substantial improvement over state-of-the-art XMC approaches Parabel (linear) and AttentionXML (neural), which achieve 68.70% and 76.95% precision@1, respectively. We further apply X-Transformer to a product2query dataset from Amazon and gained 10.7% relative improvement on prec@1 over Parabel.

Prediction of Hourly Earnings and Completion Time on a Crowdsourcing Platform

We study the problem of predicting future hourly earnings and task completion time for a crowdsourcing platform user who sees the list of available tasks and wants to select one of them to execute. Namely, for each task shown in the list, one needs to have an estimated value of the user's performance (i.e., hourly earnings and completion time) that will be if she selects this task. We address this problem on real crowd tasks completed on one of the global crowdsourcing marketplaces by (1) conducting a survey and an A/B test on real users; the results confirm the dominance of monetary incentives and importance of knowledge on hourly earnings for users; (2) an in-depth analysis of user behavior that shows that the prediction problem is challenging: (a) users and projects are highly heterogeneous, (b) there exists the so-called "learning effect" of a user selected a new task; and (3) the solution to the problem of predicting user performance that demonstrates improvement of prediction quality by up to 25% for hourly earnings and up to $32%$ completion time w.r.t. a naive baseline which is based solely on historical performance of users on tasks. In our experimentation, we use data about 18 million real crowdsourcing tasks performed by $161$ thousand users on the crowd platform; we publish this dataset. The hourly earning prediction has been deployed in Yandex.Toloka.

SimClusters: Community-Based Representations for Heterogeneous Recommendations at Twitter

Personalized recommendation products at Twitter target a multitude of heterogeneous items: Tweets, Events, Topics, Hashtags, and users. Each of these targets varies in their cardinality (which affects the scale of the problem) and their "shelf life'' (which constrains the latency of generating the recommendations). Although Twitter has built a variety of recommendation systems before dating back a decade, solutions to the broader problem were mostly tackled piecemeal. In this paper, we present SimClusters, a general-purpose representation layer based on overlapping communities into which users as well as heterogeneous content can be captured as sparse, interpretable vectors to support a multitude of recommendation tasks. We propose a novel algorithm for community discovery based on Metropolis-Hastings sampling, which is both more accurate and significantly faster than off-the-shelf alternatives. SimClusters scales to networks with billions of users and has been effective across a variety of deployed applications at Twitter.

Time-Aware User Embeddings as a Service

Digital media companies typically collect rich data in the form of sequences of online user activities. Such data is used in various applications, involving tasks ranging from click or conversion prediction to recommendation or user segmentation. Nonetheless, each application depends upon specialized feature engineering that requires a lot of effort and typically disregards the time-varying nature of the online user behavior. Learning time-preserving vector representations of users (user embeddings), irrespective of a specific task, would save redundant effort and potentially lead to higher embedding quality. To that end, we address the limitations of the current state-of-the-art self-supervised methods for task-independent (unsupervised) sequence embedding, and propose a novel Time-Aware Sequential Autoencoder (TASA) that accounts for the temporal aspects of sequences of activities. The generated embeddings are intended to be readily accessible for many problem formulations and seamlessly applicable to desired tasks, thus sidestepping the burden of task-driven feature engineering. The proposed TASA shows improvements over alternative self-supervised models in terms of sequence reconstruction. Moreover, the embeddings generated by TASA yield increases in predictive performance on both proprietary and public data. It also achieves comparable results to supervised approaches that are trained on individual tasks separately and require substantially more computational effort. TASA has been incorporated within a pipeline designed to provide time-aware user embeddings as a service, and the use of its embeddings exhibited lifts in conversion prediction AUC on four audiences.

Shop The Look: Building a Large Scale Visual Shopping System at Pinterest

As online content becomes ever more visual, the demand for searching by visual queries grows correspondingly stronger. Shop The Look is an online shopping discovery service at Pinterest, leveraging visual search to enable users to find and buy products within an image. In this work, we provide a holistic view of how we built Shop The Look, a shopping oriented visual search system, along with lessons learned from addressing shopping needs. We discuss topics including core technology across object detection and visual embeddings, serving infrastructure for realtime inference, and data labeling methodology for training/evaluation data collection and human evaluation. The user-facing impacts of our system design choices are measured through offline evaluations, human relevance judgements, and online A/B experiments. The collective improvements amount to cumulative relative gains of over 160% in end-to-end human relevance judgements and over 80% in engagement. Shop The Look is deployed in production at Pinterest.

Dynamic Heterogeneous Graph Neural Network for Real-time Event Prediction

Customer response prediction is critical in many industrial applications such as online advertising and recommendations. In particular, the challenge is greater for ride-hailing platforms such as Uber and DiDi, because the response prediction models need to consider historical and real-time event information in the physical environment, such as surrounding traffic and supply and demand conditions. In this paper, we propose to use dynamically constructed heterogeneous graph for each ongoing event to encode the attributes of the event and its surroundings. In addition, we propose a multi-layer graph neural network model to learn the impact of historical actions and the surrounding environment on the current events, and generate an effective event representation to improve the accuracy of the response model. We investigate this framework to two practical applications on the DiDi platform. Offline and online experiments show that the framework can significantly improve prediction performance. The framework has been deployed in the online production environment and serves tens of millions of event prediction requests every day.

Bandit based Optimization of Multiple Objectives on a Music Streaming Platform

Recommender systems powering online multi-stakeholder platforms often face the challenge of jointly optimizing multiple objectives, in an attempt to efficiently match suppliers and consumers. Examples of such objectives include user behavioral metrics (e.g. clicks, streams, dwell time, etc), supplier exposure objectives (e.g. diversity) and platform centric objectives (e.g. promotions). Jointly optimizing multiple metrics in online recommender systems remains a challenging task. Recent work has demonstrated the prowess of contextual bandits in powering recommendation systems to serve recommendation of interest to users. This paper aims at extending contextual bandits to multi-objective setting so as to power recommendations in a multi-stakeholder platforms.

Specifically, in a contextual bandit setting, we learn a recommendation policy that can optimize multiple objectives simultaneously in a fair way. This multi-objective online optimization problem is formalized by using the Generalized Gini index (GGI) aggregation function, which combines and balances multiple objectives together. We propose an online gradient ascent learning algorithm to maximise the long-term vectorial rewards for different objectives scalarised using the GGI function. Through extensive experiments on simulated data and large scale music recommendation data from Spotify, a streaming platform, we show that the proposed algorithm learns a superior policy among the disparate objectives compared with other state-of-the-art approaches.

Multimodal Deep Learning Based Crop Classification Using Multispectral and Multitemporal Satellite Imagery

The Food and Agriculture Organization (FAO) of the United Nations predicts that in order to meet the needs of the expected 3 billion population growth by 2050, food production has to increase by 60%. Therefore, monitoring and mapping crops accurately is essential for estimating food production during each crop growing season across the globe. Traditionally, multispectral remote sensing imagery has been widely used for mapping crops worldwide. However, single date imagery does not capture temporal characteristics (phenology) of growing crops, leading to imprecise crop maps and food estimates. On the other hand, purely temporal classification approaches also produce inaccurate crop maps as they do not account for spatial autocorrelations. In this paper, we present a multimodal deep learning solution that jointly exploits spatial-spectral and phenological properties to identify major crop types. Using a two stream architecture, spatial characteristics are captured via a spatial stream consisting of very high resolution images (single date, 1m, 3-spectral bands, USDA NAIP) with a CNN and the phenological characteristics via a temporal stream images (biweekly, 250m, MODIS NDVI) with an LSTM. Experimental results show that the proposed multimodal solution reduces prediction error by 60%.

BusTr: Predicting Bus Travel Times from Real-Time Traffic

We present BusTr, a machine-learned model for translating road traffic forecasts into predictions of bus delays, used by Google Maps to serve the majority of the world's public transit systems where no official real-time bus tracking is provided. We demonstrate that our neural sequence model improves over DeepTTE, the state-of-the-art baseline, both in performance (-30% MAPE) and training stability. We also demonstrate significant generalization gains over simpler models, evaluated on longitudinal data to cope with a constantly evolving world.

Characterizing and Learning Representation on Customer Contact Journeys in Cellular Services

Corporations spend billions of dollars annually caring for customers across multiple contact channels. A customer journey is the complete sequence of contacts that a given customer has with a company across multiple channels of communication. While each contact is important and contains rich information, studying customer journeys provides a better context to understand customers' behavior in order to improve customer satisfaction and loyalty, and to reduce care costs. However, journey sequences have a complex format due to the heterogeneity of user behavior: they are variable-length, multi-attribute, and exhibit a large cardinality in categories (e.g. contact reasons). The question of how to characterize and learn representations of customer journeys has not been studied in the literature. We propose to learn journey embeddings using a sequence-to-sequence framework that converts each customer journey into a fixed-length latent embedding. In order to improve the disentanglement and distributional properties of embeddings, the model is further modified by incorporating a Wasserstein autoencoder inspired regularization on the distribution of embeddings. Experiments conducted on an enterprise-scale dataset demonstrate the effectiveness of the proposed model and reveal significant improvements due to the regularization in both distinguishing journey pattern characteristics and predicting future customer engagement.

CrowdQuake: A Networked System of Low-Cost Sensors for Earthquake Detection via Deep Learning

Recently, low-cost acceleration sensors have been widely used to detect earthquakes due to the significant development of MEMS technologies. It, however, still requires a high-density network to fully harness the low-cost sensors, especially for real-time earthquake detection. The design of a high-performance and scalable networked system thus becomes essential to be able to process a large amount of sensor data from hundreds to thousands of the sensors. An efficient and accurate earthquake-detection algorithm is also necessary to distinguish earthquake waveforms from various kinds of non-earthquake ones within the huge data in real time. In this paper, we present CrowdQuake, a networked system based on low-cost acceleration sensors, which monitors ground motions and detects earthquakes, by developing a convolutional-recurrent neural network model. This model ensures high detection performance while maintaining false alarms at a negligible level. We also provide detailed case studies on two of a few small earthquakes that have been detected by CrowdQuake during its last one-year operation.

An Empirical Analysis of Backward Compatibility in Machine Learning Systems

In many applications of machine learning (ML), updates are performed with the goal of enhancing model performance. However, current practices for updating models rely solely on isolated, aggregate performance analyses, overlooking important dependencies, expectations, and needs in real-world deployments. We consider how updates, intended to improve ML models, can introduce new errors that can significantly affect downstream systems and users. For example, updates in models used in cloud-based classification services, such as image recognition, can cause unexpected erroneous behavior in systems that make calls to the services. Prior work has shown the importance of "backward compatibility" for maintaining human trust. We study challenges with backward compatibility across different ML architectures and datasets, focusing on common settings including data shifts with structured noise and ML employed in inferential pipelines. Our results show that (i) compatibility issues arise even without data shift due to optimization stochasticity, (ii) training on large-scale noisy datasets often results in significant decreases in backward compatibility even when model accuracy increases, and (iii) distributions of incompatible points align with noise bias, motivating the need for compatibility aware de-noising and robustness methods.

DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services

As cloud services are growing and generating high revenues, the cost of downtime in these services is becoming significantly expensive. To reduce loss and service downtime, a critical primary step is to execute incident triage, the process of assigning a service incident to the correct responsible team, in a timely manner. An incorrect assignment risks additional incident reroutings and increases its time to mitigate by 10x. However, automated incident triage in large cloud services faces many challenges: (1) a highly imbalanced incident distribution from a large number of teams, (2) wide variety in formats of input data or data sources, (3) scaling to meet production-grade requirements, and (4) gaining engineers' trust in using machine learning recommendations. To address these challenges, we introduce DeepTriage, an intelligent incident transfer service combining multiple machine learning techniques - gradient boosted classifiers, clustering methods, and deep neural networks - in an ensemble to recommend the responsible team to triage an incident. Experimental results on real incidents in Microsoft Azure show that our service achieves 82.9% F1 score. For highly impacted incidents, DeepTriage achieves F1 score from 76.3% -- 91.3%. We have applied best practices and state-of-the-art frameworks to scale DeepTriage to handle incident routing for all cloud services. DeepTriage has been deployed in Azure since October 2017 and is used by thousands of teams daily.

An Automatic Approach for Generating Rich, Linked Geo-Metadata from Historical Map Images

Historical maps contain detailed geographic information difficult to find elsewhere covering long-periods of time (e.g., 125 years for the historical topographic maps in the US). However, these maps typically exist as scanned images without searchable metadata. Existing approaches making historical maps searchable rely on tedious manual work (including crowd-sourcing) to generate the metadata (e.g., geolocations and keywords). Optical character recognition (OCR) software could alleviate the required manual work, but the recognition results are individual words instead of location phrases (e.g., "Black'' and "Mountain'' vs. "Black Mountain''). This paper presents an end-to-end approach to address the real-world problem of finding and indexing historical map images. This approach automatically processes historical map images to extract their text content and generates a set of metadata that is linked to large external geospatial knowledge bases. The linked metadata in the RDF (Resource Description Framework) format support complex queries for finding and indexing historical maps, such as retrieving all historical maps covering mountain peaks higher than 1,000 meters in California. We have implemented the approach in a system called mapKurator. We have evaluated mapKurator using historical maps from several sources with various map styles, scales, and coverage. Our results show significant improvement over the state-of-the-art methods. The code has been made publicly available as modules of the Kartta Labs project at https://github.com/kartta-labs/Project.

Bootstrapping Complete The Look at Pinterest

Putting together an ideal outfit is a process that involves creativity and style intuition. This makes it a particularly difficult task to automate. Existing styling products generally involve human specialists and a highly curated set of fashion items. In this paper, we will describe how we bootstrapped the Complete The Look (CTL) system at Pinterest. This is a technology that aims to learn the subjective task of "style compatibility" in order to recommend complementary items that complete an outfit. In particular, we want to show recommendations from other categories that are compatible with an item of interest. For example, what are some heels that go well with this cocktail dress? We will introduce our outfit dataset of over 1 million outfits and 4 million objects, a subset of which we will make available to the research community, and describe the pipeline used to obtain and refresh this dataset. Furthermore, we will describe how we evaluate this subjective task and compare model performance across multiple training methods. Lastly, we will share our lessons going from experimentation to working prototype, and how to mitigate failure modes in the production environment. Our work represents one of the first examples of an industrial-scale solution for compatibility-based fashion recommendation.

Explainable Classification of Brain Networks via Contrast Subgraphs

Mining human-brain networks to discover patterns that can be used to discriminate between healthy individuals and patients affected by some neurological disorder, is a fundamental task in neuro-science. Learning simple and interpretable models is as important as mere classification accuracy. In this paper we introduce a novel approach for classifying brain networks based on extracting contrast subgraphs, i.e., a set of vertices whose induced subgraphs are dense in one class of graphs and sparse in the other. We formally define the problem and present an algorithmic solution for extracting contrast subgraphs. We then apply our method to a brain-network dataset consisting of children affected by Autism Spectrum Disorder and children Typically Developed. Our analysis confirms the interestingness of the discovered patterns, which match background knowledge in the neuro-science literature. Further analysis on other classification tasks confirm the simplicity, soundness, and high explainability of our proposal, which also exhibits superior classification accuracy, to more complex state-of-the-art methods.

Jointly Learning to Recommend and Advertise

Online recommendation and advertising are two major income channels for online recommendation platforms (e.g. e-commerce and news feed site). However, most platforms optimize recommending and advertising strategies by different teams separately via different techniques, which may lead to suboptimal overall performances. To this end, in this paper, we propose a novel two-level reinforcement learning framework to jointly optimize the recommending and advertising strategies, where the first level generates a list of recommendations to optimize user experience in the long run; then the second level inserts ads into the recommendation list that can balance the immediate advertising revenue from advertisers and the negative influence of ads on long-term user experience. To be specific, the first level tackles high combinatorial action space problem that selects a subset items from the large item space; while the second level determines three internally related tasks, i.e., (i) whether to insert an ad, and if yes, (ii) the optimal ad and (iii) the optimal location to insert. The experimental results based on real-world data demonstrate the effectiveness of the proposed framework. We have released the implementation code to ease reproductivity.

Fitbit for Chickens?: Time Series Data Mining Can Increase the Productivity of Poultry Farms

Chickens are the most important poultry species in the world. Globally, industrial-scale production systems account for most of the poultry meat and eggs produced. The welfare of these birds matters for both ethical and economic reasons. From an ethical perspective, poultry have a sufficient degree of awareness to suffer pain if their health is poor, or deprivation if poorly housed. From an economic viewpoint, consumers increasingly value poultry welfare, so better market access can be obtained by producers who demonstrate concern for their flocks. Recent advances in sensor technology has allowed the opportunity to record behavioral patterns in chickens, and several research groups have shown that such data can be exploited to enhance chicken welfare. However, classifying chicken behaviors poses several unique challenges which are not observed in the UCR archive or other classic benchmark collections. In particular, some behaviors are manifested in the shape of the subsequences, whereas others only in more abstract features. Most algorithms only work well for one such modality. In addition, our data of interest has classes that greatly differ in duration, and are only weakly labeled, again defying the assumptions of the classic benchmark datasets. In this work, we propose a general-purpose framework to robustly learn and classify from datasets exhibiting these issues. While our experience is with fowl, the lessons we have learned may be more generally applicable to real-world datasets in other domains including manufacturing and human health.

CompactETA: A Fast Inference System for Travel Time Prediction

Computing estimated time of arrival (ETA) is one of the most important services for online ride-hailing platforms like DiDi and Uber. With billions of service queries per day on such platforms, a fast inference ETA module ensures the efficiency of the overall decision system to guarantee satisfied user experience, as well as saving significant operating cost. In this paper, we develop a novel ETA learning system named as CompactETA, which provides an accurate online travel time inference within 100 microseconds. In the proposed method, we encode high order spatial and temporal dependency into sophisticated representations by applying graph attention network on a spatiotemporal weighted road network graph. We further encode the sequential information of the travel route by positional encoding to avoid the recurrent network structure. The properly learnt representations enable us to apply a very simple multi-layer perceptron model for online real-time inference. Evaluation of both offline experiments and online A/B testing verifies that CompactETA reduces the inference latency by more than 100 times compared to a state-of-the-art system, while maintains competing prediction accuracy.

Intelligent Exploration for User Interface Modules of Mobile App with Collective Learning

A mobile app interface usually consists of a set of user interface modules. How to properly design these user interface modules is vital to achieving user satisfaction for a mobile app. However, there are few methods to determine design variables for user interface modules except for relying on the judgment of designers. Usually, a laborious post-processing step is necessary to verify the key change of each design variable. Therefore, there is only a very limited amount of design solutions that can be tested. It is time-consuming and almost impossible to figure out the best design solutions as there are many modules. To this end, we introduce FEELER, a framework to fast and intelligently explore design solutions of user interface modules with a collective machine learning approach. FEELER can help designers quantitatively measure the preference score of different design solutions, aiming to facilitate the designers to conveniently and quickly adjust user interface module. We conducted extensive experimental evaluations on two real-life datasets to demonstrate its applicability in real-life cases of user interface module design in the Baidu App, which is one of the most popular mobile apps in China.

Gemini: A Novel and Universal Heterogeneous Graph Information Fusing Framework for Online Recommendations

Recently, network embedding has been successfully used in recommendation systems. Researchers have made efforts to utilize additional auxiliary information (e.g., social relations of users) to improve performance. However, such auxiliary information lacks compatibility for all recommendation scenarios, thus it is difficult to apply in some industrial scenarios where generality is required. Moreover, the heterogeneous nature between users and items aggravates the difficulty in network information fusion. Many works tried to transform user-item heterogeneous network to two homogeneous graphs (i.e., user-user and item-item), and then fuse information separately. This may limit the representation power of learned embedding due to ignoring the adjacent relationship in the original graph. In addition, the sparsity of user-item interactions is an urgent problem need to be solved. To solve the above problems, we propose a universal and effective framework named Gemini, which only relies on the common interaction logs, avoiding the dependence on auxiliary information and ensuring a better generality. For the purpose of keeping original adjacent relationship, Gemini transforms the original user-item heterogeneous graph into two semi homogeneous graphs from the perspective of users and items respectively. The transformed graphs consist of two types of nodes: network nodes coming from homogeneous nodes and attribute nodes coming from heterogeneous node. Then, the node representation is learned in a homogeneous way, with considering edge embedding at the same time. Simultaneously, the interaction sparsity problem is solved to some extent as the transformed graphs contain the original second-order neighbors. For training efficiently, we also propose an iterative training algorithm to reduce computational complexity. Experimental results on the five datasets and online A/B tests in recommendations of DiDiChuXing show that Gemini outperforms state-of-the-art algorithms.

Hypergraph Convolutional Recurrent Neural Network

In this study, we present a hypergraph convolutional recurrent neural network (HGC-RNN), which is a prediction model for structured time-series sensor network data. Representing sensor networks in a graph structure is useful for expressing structural relationships among sensors. Conventional graph structure, however, has a limitation on representing complex structure in real world application, such as shared connections among multiple nodes. We use a hypergraph, which is capable of modeling complicated structures, for structural representation. HGC-RNN performs a hypergraph convolution operation on the input data represented in the hypergraph to extract hidden representations of the input, while considering the structural dependency of the data. HGC-RNN employs a recurrent neural network structure to learn temporal dependency from the data sequence. We conduct experiments to forecast taxi demand in NYC, traffic flow in the overhead hoist transfer system, and gas pressure in a gas regulator. We compare the performance of our method with those of other existing methods, and the result shows that HGC-RNN has strengths over baseline models.

Towards Building an Intelligent Chatbot for Customer Service: Learning to Respond at the Appropriate Time

In recent years, intelligent chatbots have been widely used in the field of customer service. One of the key challenges for chatbots to maintain fluent dialogues with customers is how to respond at the appropriate time. However, most of the state-of-the-art chatbots follow the turn-by-turn interaction scheme. Such chatbots respond after each time when a customer sends an utterance, which in some cases leads to inappropriate responses and misleads the process of the dialogues. In this paper, we propose a multi-turn response triggering model (MRTM) to address this problem. MRTM is learned from large-scale human-human dialogues between the customers and the agents with a self-supervised learning scheme. It leverages the semantic matching relationships between the context and the response to train a semantic matching model and obtains the weights of the co-occurring utterances in the context through an asymmetrical self-attention mechanism. The weights are then used to determine whether the given context should be responded to. We conduct extensive experiments on two dialogue datasets collected from the real-world online customer service systems. Results show that MRTM outperforms the baselines by a large margin. Furthermore, we incorporate MRTM into DiDi's customer service chatbot. Based on the ability to identify the appropriate time to respond, the chatbot can incrementally aggregate the information across multiple utterances and make more intelligent responses at the appropriate time.

Ads Allocation in Feed via Constrained Optimization

Social networks and content publishing platforms have newsfeed applications, which show both organic content to drive engagement, and ads to drive revenue. This paper focuses on the problem of ads allocation in a newsfeed to achieve an optimal balance of revenue and engagement. To the best of our knowledge, we are the first to report practical solutions to this business-critical and popular problem in industry.

The paper describes how large-scale recommender system like feed ranking works, and why it is useful to consider ads allocation as a post-operation once the ranking of organic items and (separately) the ranking of ads are done. A set of computationally lightweight algorithms are proposed based on various sets of assumptions in the context of ads on the LinkedIn newsfeed. Through both offline simulation and online A/B tests, benefits of the proposed solutions are demonstrated. The best performing algorithm is currently fully deployed on the LinkedIn newsfeed and is serving all live traffic.

USAD: UnSupervised Anomaly Detection on Multivariate Time Series

The automatic supervision of IT systems is a current challenge at Orange. Given the size and complexity reached by its IT operations, the number of sensors needed to obtain measurements over time, used to infer normal and abnormal behaviors, has increased dramatically making traditional expert-based supervision methods slow or prone to errors. In this paper, we propose a fast and stable method called UnSupervised Anomaly Detection for multivariate time series (USAD) based on adversely trained autoencoders. Its autoencoder architecture makes it capable of learning in an unsupervised way. The use of adversarial training and its architecture allows it to isolate anomalies while providing fast training. We study the properties of our methods through experiments on five public datasets, thus demonstrating its robustness, training speed and high anomaly detection performance. Through a feasibility study using Orange's proprietary data we have been able to validate Orange's requirements on scalability, stability, robustness, training speed and high performance.

A Dual Heterogeneous Graph Attention Network to Improve Long-Tail Performance for Shop Search in E-Commerce

Shop search has become an increasingly important service provided by Taobao, the China's largest e-commerce platform. By using shop search, a user can easily identify the desired shop that provides a full-scale of relevant items matching his information need. With the tremendous growth of users and shops, shop search faces several unique challenging problems: 1) many shop names do not fully express what they sell, i.e., the semantic gap between user query and shop name; 2) due to the lack of user interactions, it is difficult to deliver a good search result for the long-tail queries and retrieve long-tail shops that are highly relevant to a query.

To address these two key challenges, we resort to graph neural networks (GNNs) which have various successful applications in arbitrarily structured graph data. Specifically, we propose a dual heterogeneous graph attention network (DHGAT) integrated with the two-tower architecture, using the user interaction data from both shop search and product search. At first, we build a heterogeneous graph in the context of shop search, by exploiting both the first-order and second-order proximity from user search behaviors, user click-through behaviors and user purchase records. Then, DHGAT is devised to attentively adopt heterogeneous and homogeneous neighbors of query and shop to enhance representations of themselves, which can help relieve the long-tail phenomenon. Besides, DHGAT enriches semantics of query text and shop name by compositing the titles of the relevant items to alleviate the semantic gap. Moreover, to enhance the graph representation learning, we augment DHGAT with a regularized neighbor proximity loss (NPL) to explicitly learn the graph topological structure and train whole framework in an end-to-end fashion. Compelling results from both offline evaluation and online A/B tests demonstrate the superiority of DHGAT over state-of-the-art methods, especially for long-tail queries and shops.

Learning with Limited Labels via Momentum Damped & Differentially Weighted Optimization

As deep learning-based models are deployed more widely in search & recommender systems, system designers often face the issue of gathering large amounts of well-annotated data to train such neural models. While most user-centric systems rely on interaction signals as implicit feedback to train models, such signals are often weak proxies of user satisfaction, as compared to (say) explicit judgments from users, which are prohibitively expensive to collect. In this paper, we consider the task of learning from limited labeled data, wherein we aim at jointly leveraging strong supervision data (e.g. explicit judgments) along with weak supervision data (e.g. implicit feedback or labels from the related task) to train neural models.

We present data mixing strategies based on submodular subset selection, and additionally, propose adaptive optimization techniques to enable the model to differentiate between a strong label data point and a weak supervision data point. Finally, we present two different case-studies (i) user satisfaction prediction with music recommendation and (ii) question-based video comprehension and demonstrate that the proposed adaptive learning strategies are better at learning from limited labels. Our techniques and findings provide practitioners with ways of leveraging external labeled data

SESSION: Health Day Papers

Learning to Simulate Human Mobility

Realistic simulation of a massive amount of human mobility data is of great use in epidemic spreading modeling and related health policy-making. Existing solutions for mobility simulation can be classified into two categories: model-based methods and model-free methods, which are both limited in generating high-quality mobility data due to the complicated transitions and complex regularities in human mobility. To solve this problem, we propose a model-free generative adversarial framework, which effectively integrates the domain knowledge of human mobility regularity utilized in the model-based methods. In the proposed framework, we design a novel self-attention based sequential modeling network as the generator to capture the complicated temporal transitions in human mobility. To augment the learning power of the generator with the advantages of model-based methods, we design an attention-based region network to introduce the prior knowledge of urban structure to generate a meaningful trajectory. As for the discriminator, we design a mobility regularity-aware loss to distinguish the generated trajectory. Finally, we utilize the mobility regularities of spatial continuity and temporal periodicity to pre-train the generator and discriminator to further accelerate the learning procedure. Extensive experiments on two real-life mobility datasets demonstrate that our framework outperforms seven state-of-the-art baselines significantly in terms of improving the quality of simulated mobility data by 35%. Furthermore, in the simulated spreading of COVID-19, synthetic data from our framework reduces MAPE from 5% ~ 10% (baseline performance) to 2%.

Data-driven Simulation and Optimization for Covid-19 Exit Strategies

The rapid spread of the Coronavirus SARS-2 is a major challenge that led almost all governments worldwide to take drastic measures to respond to the tragedy. Chief among those measures is the massive lockdown of entire countries and cities, which beyond its global economic impact has created some deep social and psychological tensions within populations. While the adopted mitigation measures (including the lockdown) have generally proven useful, policymakers are now facing a critical question: how and when to lift the mitigation measures? A carefully-planned exit strategy is indeed necessary to recover from the pandemic without risking a new outbreak. Classically, exit strategies rely on mathematical modeling to predict the effect of public health interventions. Such models are unfortunately known to be sensitive to some key parameters, which are usually set based on rules-of-thumb.

In this paper, we propose to augment epidemiological forecasting with actual data-driven models that will learn to fine-tune predictions for different contexts (e.g., per country). We have therefore built a pandemic simulation and forecasting toolkit that combines a deep learning estimation of the epidemiological parameters of the disease in order to predict the cases and deaths, and a genetic algorithm component searching for optimal trade-offs/policies between constraints and objectives set by decision-makers.

Replaying pandemic evolution in various countries, we experimentally show that our approach yields predictions with much lower error rates than pure epidemiological models in 75% of the cases and achieves a 95% R² score when the learning is transferred and tested on unseen countries. When used for forecasting, this approach provides actionable insights into the impact of individual measures and strategies.

Understanding the Impact of the COVID-19 Pandemic on Transportation-related Behaviors with Human Mobility Data

The constrained outbreak of COVID-19 in Mainland China has recently been regarded as a successful example of fighting this highly contagious virus. Both the short period (in about three months) of transmission and the sub-exponential increase of confirmed cases in Mainland China have proved that the Chinese authorities took effective epidemic prevention measures, such as case isolation, travel restrictions, closing recreational venues, and banning public gatherings. These measures can, of course, effectively control the spread of the COVID-19 pandemic. Meanwhile, they may dramatically change the human mobility patterns, such as the daily transportation-related behaviors of the public. To better understand the impact of COVID-19 on transportation-related behaviors and to provide more targeted anti-epidemic measures, we use the huge amount of human mobility data collected from Baidu Maps, a widely-used Web mapping service in China, to look into the detail reaction of the people there during the pandemic. To be specific, we conduct data-driven analysis on transportation-related behaviors during the pandemic from the perspectives of 1) means of transportation, 2) type of visited venues, 3) check-in time of venues, 4) preference on "origin-destination'' distance, and 5) "origin-transportation-destination'' patterns. For each topic, we also give our specific insights and policy-making suggestions. Given that the COVID-19 pandemic is still spreading in more than 200 overseas countries, infecting millions of people worldwide, the insights and suggestions provided here may help fight COVID-19.

Simulating the Impact of Hospital Capacity and Social Isolation to Minimize the Propagation of Infectious Diseases

Infectious diseases can spread from an infected person to a susceptible person through direct or indirect physical contact, consequently controlling such types of spread is difficult. However, a proper decision at the initial stage can help control the disease's propagation before it turns into a pandemic. Social distancing and hospital capacity are considered among the most critical parameters to manage these types of conditions. In this paper, we used artificial agent-based simulation modeling to identify the importance of social distancing and hospitals' capacity in terms of the number of beds to shorten the length of an outbreak and reduce the total number of infections and deaths during an epidemic. After simulating the model based on different scenarios in a small artificial society, we learned that shorter social isolation activation delay has a higher impact on reducing the catastrophe. Increasing the hospital's treatment capacity, i.e., the number of isolation beds in the hospitals can become handy when social isolation cannot be activated shortly. The model can be considered a prototype to take proper steps based on the simulations on different parameter settings towards the control of an epidemic.

Effective Transfer Learning for Identifying Similar Questions: Matching User Questions to COVID-19 FAQs

People increasingly search online for answers to their medical questions but the rate at which medical questions are asked online significantly exceeds the capacity of qualified people to answer them. This leaves many questions unanswered or inadequately answered. Many of these questions are not unique, and reliable identification of similar questions would enable more efficient and effective question answering schema. COVID-19 has only exacerbated this problem. Almost every government agency and healthcare organization has tried to meet the informational need of users by building online FAQs, but there is no way for people to ask their question and know if it is answered on one of these pages. While many research efforts have focused on the problem of general question similarity, these approaches do not generalize well to domains that require expert knowledge to determine semantic similarity, such as the medical domain. In this paper, we show how a double fine-tuning approach of pretraining a neural network on medical question-answer pairs followed by fine-tuning on medical question-question pairs is a particularly useful intermediate task for the ultimate goal of determining medical question similarity. While other pretraining tasks yield an accuracy below 78.7% on this task, our model achieves an accuracy of 82.6% with the same number of training examples, an accuracy of 80.0% with a much smaller training set, and an accuracy of 84.5% when the full corpus of medical question-answer data is used. We also describe a currently live system that uses the trained model to match user questions to COVID-related FAQs.

Hi-COVIDNet: Deep Learning Approach to Predict Inbound COVID-19 Patients and Case Study in South Korea

The escalating crisis of COVID-19 has put people all over the world in danger. Owing to the high contagion rate of the virus, COVID-19 cases continue to increase globally. To further suppress the threat of the COVID-19 pandemic and minimize its damage, it is imperative that each country monitors inbound travelers. Moreover, given that resources for quarantine are often limited, they must be carefully allocated. In this paper, to aid in such allocation by predicting the number of inbound COVID-19 cases, we propose Hi-COVIDNet, which takes advantage of the geographic hierarchy. Hi-COVIDNet is based on a neural network with two-level components, namely, country-level and continent-level encoders, which understand the complex relationships among foreign countries and derive their respective contagion risk to the destination country. An in-depth case study in South Korea with real-world COVID-19 datasets confirmed the effectiveness and practicality of Hi-COVIDNet.

Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data

Audio signals generated by the human body (e.g., sighs, breathing, heart, digestion, vibration sounds) have routinely been used by clinicians as indicators to diagnose disease or assess disease progression. Until recently, such signals were usually collected through manual auscultation at scheduled visits. Research has now started to use digital technology to gather bodily sounds (e.g., from digital stethoscopes) for cardiovascular or respiratory examination, which could then be used for automatic analysis. Some initial work shows promise in detecting diagnostic signals of COVID-19 from voice and coughs. In this paper we describe our data analysis over a large-scale crowdsourced dataset of respiratory sounds collected to aid diagnosis of COVID-19. We use coughs and breathing to understand how discernible COVID-19 sounds are from those in asthma or healthy controls. Our results show that even a simple binary machine learning classifier is able to classify correctly healthy and COVID-19 sounds. We also show how we distinguish a user who tested positive for COVID-19 and has a cough from a healthy user with a cough, and users who tested positive for COVID-19 and have a cough from users with asthma and a cough. Our models achieve an AUC of above 80% across all tasks. These results are preliminary and only scratch the surface of the potential of this type of data and audio-based machine learning. This work opens the door to further investigation of how automatically analysed respiratory patterns could be used as pre-screening signals to aid COVID-19 diagnosis.

Understanding the Urban Pandemic Spreading of COVID-19 with Real World Mobility Data

Facing the worldwide rapid spreading of COVID-19 pandemic, we need to understand its diffusion in the urban environments with heterogeneous population distribution and mobility. However, challenges exist in the choice of proper spatial resolution, integration of mobility data into epidemic modelling, as well as incorporation of unique characteristics of COVID-19.

To address these challenges, we build a data-driven epidemic simulator with COVID-19 specific features, which incorporates real-world mobility data capturing the heterogeneity in urban environments. Based on the simulator, we conduct two series of experiments to: (1) estimate the efficacy of different mobility control policies on intervening the epidemic; and (2) study how the heterogeneity of urban mobility affect the spreading process. Extensive results not only highlight the effectiveness of fine-grained targeted mobility control policies, but also uncover different levels of impact of population density and mobility strength on the spreading process. With such capability and demonstrations, our open simulator contributes to a better understanding of the complex spreading process and smarter policies to prevent another pandemic.

SESSION: Panel

Fighting a Pandemic: Convergence of Expertise, Data Science and Policy

This panel will address the challenges and opportunities of using data science to fight a pandemic. Of particular interest are real-world cases where using data science helped the fight against the pandemic and cautionary tales of when it hindered that fight.

SESSION: Tutorial Abstracts

From Zero to AI Hero with Automated Machine Learning

Automated ML is an emerging field in Machine Learning that helps developers and new data scientists with little data science knowledge build Machine Learning models and solutions without understanding the complexity of Learning Algorithm selection, and Hyper parameter tuning. With Azure Machine Learning's automated machine learning capability, given a dataset and a few configuration parameters, you will get a trained high quality machine learning model for the dataset that you can use for predictions. In this session, you will learn how to use Automated ML for productivity gains, empowering domain experts to build ML based solutions and scale to build several models with Azure Machine Learning's Automated ML.

Put Deep Learning to Work: Accelerate Deep Learning through Amazon SageMaker and ML Services

Deploying deep learning (DL) projects are becoming increasingly more pervasive at enterprises and startups alike. At Amazon, Machine Learning University (MLU)-trained engineers are taking DL to every aspect of Amazon's businesses, beyond just Amazon Go, Alexa, and Robotics.

In this workshop, Wenming Ye (AWS), Rachel Hu (AWS), and Miro Enev (Nvidia) offer a practical next step in DL learning with instructions, and hands-on labs using the latest Nvidia GPUs and AWS Inferentia. You will explore the current trends powering AI/DL adoption, powerful new GPU/AWS Inferentia accelerator instances, distributed training and inference optimization in neural networks.

Building Forecasting Solutions Using Open-Source and Azure Machine Learning

Time series forecasting is one of the most important topics in data science. Almost every business needs to predict the future in order to make better decisions and allocate resources more effectively. Examples of time series forecasting use cases are financial forecasting, demand forecasting in logistics for operational planning of assets, demand forecasting for Azure resources, and energy demand forecasting for campus buildings and data centers. The goal of this tutorial is to demonstrate state-of-the-art forecasting approaches to problems in retail and introduce a new repository focusing on best-practices in forecasting domain, along with a library of forecasting utilities [1].

The tutorial will start with a quick overview of time series forecasting and traditional time series models to provide the audience with a clear background on the kind of problems that we aim to solve. We will also briefly explore the dataset to be used in all exercises.

Next, we will run through several exercises to solve a forecasting problem in retail. We will start with a traditional statistical approach, e.g. ARIMA, using an auto-arima function in python [2]. Next, we will cover machine-learning based approaches to forecasting and cover various ways to featurize the time series dataset, then train a LightGBM model [6]. Finally, we will describe a deep-neural-net based approach, namely Dilated CNN, and train a Dilated CNN model on our data [7-8]. Using LightGBM and Dilated CNN - two efficient and state-of-the-art models, we can train the models quickly and achieve very high forecasting accuracies.

In the last part of the tutorial, we will cover an example of hyper-parameter tuning in forecasting, and use HyperDrive in Azure Machine Learning service to achieve the task [3-5]. As a part of this exercise, we will also demonstrate how to deploy the trained model to Azure Container Instance (ACI) and test the deployed service.

The repository also contains best-practice implementations in R language. Time permitting, we will cover common approaches to solving forecasting problems in R, ranging from simple regression models to more complex ones, such as Prophet package in R.

How to Calibrate your Neural Network Classifier: Getting True Probabilities from a Classification Model

Research in Machine Learning (ML) for classification tasks has been primarily guided by metrics that derive from a confusion matrix (e.g. accuracy, precision and recall). Several works have highlighted that this has lead to training practices that produce over-confident models and void the assumption that the model learns a probability distribution over the classification targets; this is referred to as miscalibration. Consequently, modern ML architectures struggle to perform in applications where a probabilistic forecaster is needed. Research efforts on calibration techniques have explored the possibility of recovering probability distributions from traditional architectures. This tutorial covers the key concepts required to understand the motivations behind calibration and aims at providing participants with the tools that they require assess the calibration of ML models and calibrate them when required.

Neural Structured Learning: Training Neural Networks with Structured Signals

We present Neural Structured Learning (NSL) in TensorFlow [2], a new learning paradigm to train neural networks by leveraging structured signals in addition to feature inputs. Structure can be explicit as represented by a graph, or implicit, either induced by adversarial perturbation or inferred using techniques like embedding learning. NSL is open-sourced as part of the TensorFlow [3] ecosystem and is widely used in Google across many products and services. In this tutorial, we provide an overview of the NSL framework including various libraries, tools, and APIs as well as demonstrate the practical use of NSL in different applications. The NSL website is hosted at www.tensorflow.org/neural_structured_learning, which includes details about the theoretical foundations of the technology, extensive API documentation, and hands-on tutorials.

Accelerating and Expanding End-to-End Data Science Workflows with DL/ML Interoperability Using RAPIDS

The lines between data science (DS), machine learning (ML), deep learning (DL), and data mining continue to be blurred and removed. This is great as it ushers in vast amounts of capabilities, but it brings increased complexity and a vast number of tools/techniques. It's not uncommon for DL engineers to use one set of tools for data extraction/cleaning and then pivot to another library for training their models. After training and inference, it's common to then move data yet again by another set of tools for post-processing. The RAPIDS suite of open source libraries not only provides a method to execute and accelerate these tasks using GPUs with familiar APIs, but it also provides interoperability with the broader open source community and DL tools while removing unnecessary serializations that slow down workflows. GPUs provide massive parallelization that DL has leveraged for some time, and RAPIDS provides the missing pieces that extend this computing power to more traditional yet important DS and ML tasks (e.g., ETL, modeling). Complete pipelines can be built that encompass everything, including ETL, feature engineering, ML/DL modeling, inference, and visualization, all while removing typical serialization costs and affording seamless interoperability between libraries. All experiments using RAPIDS can effortlessly be scheduled, logged and reviewed using existing public cloud options. Join our engineers and data scientists as they walk through a collection of DS and ML/DL engineering problems that show how RAPIDS running on Azure ML can be used for end-to-end, entirely GPU pipelines. This tutorial includes specifics on how to use RAPIDS for feature engineering, interoperability with common ML/DL packages, and creating GPU native visualizations using cuxfilter. The use cases presented here give attendees a hands-on approach to using RAPIDS components as part of a larger workflow, seamlessly integrating with other libraries (e.g., TensorFlow) and visualization packages.

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters

Explore new techniques in Microsoft's open source library called DeepSpeed, which advances large model training by improving scale, speed, cost, and usability, unlocking the ability to train 100-billion-parameter models. DeepSpeed is compatible with PyTorch. One piece of our library, called ZeRO, is a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers have used these breakthroughs to create Turing Natural Language Generation (Turing-NLG), which at the time of its release was the largest publicly known language model at 17 billion parameters. In addition we will also go over our latest transformer kernel advancements that led the DeepSpeed team to achieve the world fastest BERT pretraining record.

The Zero Redundancy Optimizer (ZeRO) is a novel memory optimization technology for large-scale distributed deep learning. ZeRO can train deep learning models with over 100 billion parameters on the current generation of GPU clusters at three to five times the throughput of the current best system. It also presents a clear path to training models with trillions of parameters, demonstrating an unprecedented leap in deep learning system technology.

DeepSpeed brings state-of-the-art training techniques, such as ZeRO, optimized kernels, distributed training, mixed precision, and checkpointing, through lightweight APIs compatible with PyTorch. With just a few lines of code changes to your PyTorch model, you can leverage DeepSpeed to address underlying performance challenges and boost the speed and scale of your training.

Robust Deep Learning Methods for Anomaly Detection

Anomaly detection is an important problem that has been well-studied within diverse research areas and application domains. A robust anomaly detection system identifies rare events and patterns in the absence of labelled data. The identified patterns provide crucial insights about both the fidelity of the data and deviations in the underlying data-generating process. For example a surveillance system designed to monitor the emergence of new epidemics will use a robust anomaly detection methods to separate spurious associations from genuine indicators of an epidemic with minimal lag time.

The key concept in anomaly detection is the notion of "robustness'', i.e., designing models and representations which are less-sensitive to small changes in the underlying data distribution. The canonical example is that the median is more robust than the mean as an estimator. The tutorial will primarily help researchers and developers design deep learning architectures and loss functions where the learnt representation behave more like the "median'' rather than the "mean.'' The tutorial will revisit well known unsupervised learning techniques in deep learning including autoencoders and generative adversarial networks (GANs) from the perspective of anomaly detection. This in turn will give the audience a more grounded perspective on unsupervised deep learning methods. All the methods will be introduced in a hands-on manner to demonstrate how high-level ideas and concepts get translated to practical real code.

Faster, Simpler, More Accurate: Practical Automated Machine Learning with Tabular, Text, and Image Data

Automated machine learning (AutoML) offers the promise of translating raw data into accurate predictions with just a few lines of code. Rather than relying on human time/effort and manual experimentation, models can be improved by simply letting the AutoML system run for more time. In this hands-on tutorial, we demonstrate fundamental techniques that enable powerful AutoML. We consider standard supervised learning tasks on various types of data including tables, text, images, as well as multi-modal data comprised of multiple types. Rather than technical descriptions of how individual ML models work, we emphasize how to best use models within an overall ML pipeline that takes in raw training data and outputs pre-dictions for test data. A major focus of our tutorial is on automating deep learning, a class of powerful techniques that are cumbersome to manage manually. Despite this, hardly any educational material describes their successful automation. Each topic covered in the tutorial is accompanied by a hands-on Jupyter notebook that implements best practices (which will be available on Github before and after the tutorial). Most of this code is adopted from AutoGluon (autogluon.mxnet.io), a recent AutoML toolkit for automated deep learning that is both state-of-the-art and easy-to-use.

Intelligible and Explainable Machine Learning: Best Practices and Practical Challenges

Learning methods such as boosting and deep learning have made ML models harder to understand and interpret. This puts data scientists and ML developers in the position of often having to make a tradeoff between accuracy and intelligibility. Research in IML (Interpretable Machine Learning) and XAI (Explainable AI) focus on minimizing this trade-off by developing more accurate interpretable models and by developing new techniques to explain black-box models. Such models and techniques make it easier for data scientists, engineers and model users to debug models and achieve important objectives such as ensuring the fairness of ML decisions and the reliability and safety of AI systems. In this tutorial, we present an overview of various interpretability methods and provide a framework for thinking about how to choose the right explanation method for different real-world scenarios. We will focus on the application of XAI in practice through a variety of case studies from domains such as healthcare, finance, and bias and fairness. Finally, we will present open problems and research directions for the data mining and machine learning community. What audience will learn: When and how to use a variety of machine learning interpretability methods through case studies of real-world situations. The difference between glass-box and black-box explanation methods and when to use them. How to use open source interpretability toolkits that are now available

Dealing with Bias and Fairness in Data Science Systems: A Practical Hands-on Tutorial

Tackling issues of bias and fairness when building and deploying data science systems has received increased attention from the research community in recent years, yet a lot of the research has focused on theoretical aspects and very limited set of application areas and data sets. There is a lack of 1) practical training materials, 2) methodologies, and 3) tools for researchers and developers working on real-world algorithmic decision making system to deal with issues of bias and fairness. Today, treating bias and fairness as primary metrics of interest, and building, selecting, and validating models using those metrics is not standard practice for data scientists. In this hands-on tutorial we will try to bridge the gap between research and practice, by deep diving into algorithmic fairness, from metrics and definitions to practical case studies, including bias audits using the Aequitas toolkit (http://github.com/dssg/aequitas). By the end of this hands-on tutorial, the audience will be familiar with bias mitigation frameworks and tools to help them making decisions during a project based on intervention and deployment contexts in which their system will be used.

Deep Learning for Search and Recommender Systems in Practice

In this talk, we will go over the components of personalized search and recommender systems and demonstrate the applications of various deep learning techniques along the way.

Search and recommender systems are probably the most prevalent ML powered application across the industry. They share most of the components composition and provide a user a ranked list of items, while there is subtle difference that a search system typically acts passively with a clear user intention in terms of queries and a recommender system acts more proactively.

Deep learning has been wildly successful in solving complex tasks such as image recognition, speech recognition, natural language processing and understanding, machine translation, etc. In the area of personalized recommender systems, deep learning has been showing tremendous impact in recent years.

Search and recommender systems can be staged roughly in three phases: 1. User and query understanding, where a query or a user profile are processed so that the systems can use the processed information to 2. retrieve all the related items (high recall) and 3. rank the items by the order of the most relevance to the user's intent (high precision). Each phase has its unique challenges but deep learning has been ubiquitously pushing beyond the limit.

After walking through the talk, we hope the audience would gain some first-hand experience building a personalized search/recommender system using deep learning techniques.

Computer Vision: Deep Dive into Object Segmentation Approaches

Image segmentation is the task of associating pixels in an image with their respective object class labels. It has a wide range of applications in many industries including healthcare, transportation, robotics, fashion, home improvement, and tourism. Many deep learning-based approaches have been developed for image-level object recognition and pixel-level scene understanding - with the latter requiring a much denser annotation of scenes with a large set of objects. This tutorial provides an end-to-end pipeline for performing image segmentation using the state-of-art deep learning approaches and public datasets. The hands-on session will provide instructions for dataset customization, transformation, and training, validating, and testing segmentation models. The goal of this tutorial is to provide participants with a strong understanding of building image segmentation models for downstream applications.

In Search for a Cure: Recommendation With Knowledge Graph on CORD-19

The whole globe has cranked up for coping with the COVID-19 situation. The hands-on tutorial targets at providing a comprehensive and pragmatic end-to-end walk-through for building an academic research paper recommender for the use case of COVID-19 related study, with the help of knowledge graph technology. The code examples that demonstrate the theories are reproducible and can hopefully provide value for researchers to build tools that support conducting research to find a cure to COVID-19.

Scalable Graph Neural Networks with Deep Graph Library

Learning from graph and relational data plays a major role in many applications including social network analysis, marketing, e-commerce, information retrieval, knowledge modeling, medical and biological sciences, engineering, and others. In the last few years, Graph Neural Networks (GNNs) have emerged as a promising new supervised learning framework capable of bringing the power of deep representation learning to graph and relational data. This ever-growing body of research has shown that GNNs achieve state-of-the-art performance for problems such as link prediction, fraud detection, target-ligand binding activity prediction, knowledge-graph completion, and product recommendations. In practice, many of the real-world graphs are very large. It is urgent to have scalable solutions to train GNN on large graphs efficiently.

The objective of this tutorial is twofold. First, it will provide an overview of the theory behind GNNs, discuss the types of problems that GNNs are well suited for, and introduce some of the most widely used GNN model architectures and problems/applications that are designed to solve. Second, it will introduce the Deep Graph Library (DGL), a scalable GNN framework that simplifies the development of efficient GNN-based training and inference programs at a large scale. To make things concrete, the tutorial will cover state-of-the-art training methods to scale GNN to large graphs and provide hands-on sessions to show how to use DGL to perform scalable training in different settings (multi-GPU training and distributed training). This hands-on part will start with basic graph applications (e.g., node classification and link prediction) to set up the context and move on to train GNNs on large graphs. It will provide tutorials to demonstrate how to apply the techniques in DGL to train GNNs for real-world applications.

Introduction to Computer Vision and Real Time Deep Learning-based Object Detection

Computer vision (CV) is a field of artificial intelligence that trains computers to interpret and understand the visual world for a variety of exciting downstream tasks such as self-driving cars, checkout-less shopping, smart cities, cancer detection, and more. The field of CV has been revolutionized by deep learning over the last decade. This tutorial looks under the hood of modern day CV systems, and builds out some of these tech pipelines in a Jupyter Notebook using Python, OpenCV, Keras and Tensorflow. While the primary focus is on digital images from cameras and videos, this tutorial will also introduce 3D point clouds, and classification and segmentation algorithms for processing them.

Building Recommender Systems with PyTorch

In this tutorial we show how to build deep learning recommendation systems and resolve the associated interpretability, integrity and privacy challenges. We start with an overview of the PyTorch framework, features that it offers and a brief review of the evolution of recommendation models. We delineate their typical components and build a proxy deep learning recommendation model (DLRM) in PyTorch. Then, we discuss how to interpret recommendation system results as well as how to address the corresponding integrity and quality challenges.

Causal Inference Meets Machine Learning

Causal inference has numerous real-world applications in many domains such as health care, marketing, political science and online advertising. Treatment effect estimation, a fundamental problem in causal inference, has been extensively studied in statistics for decades. However, traditional treatment effect estimation methods may not well handle large-scale and high-dimensional heterogeneous data. In recent years, an emerging research direction has attracted increasing attention in the broad artificial intelligence field, which combines the advantages of traditional treatment effect estimation approaches (e.g., matching estimators) and advanced representation learning approaches (e.g., deep neural networks). In this tutorial, we will introduce both traditional and state-of-the-art representation learning algorithms for treatment effect estimation. Background about causal inference, counterfactuals and matching estimators will be covered as well. We will also showcase promising applications of these methods in different application domains.

Fairness in Machine Learning for Healthcare

The issue of bias and fairness in healthcare has been around for centuries. With the integration of AI in healthcare the potential to discriminate and perpetuate unfair and biased practices in healthcare increases many folds The tutorial focuses on the challenges, requirements and opportunities in the area of fairness in healthcare AI and the various nuances associated with it. The problem healthcare as a multi-faceted systems level problem that necessitates careful of different notions of fairness in healthcare to corresponding concepts in machine learning is elucidated via different real world examples.

Learning from All Types of Experiences: A Unifying Machine Learning Perspective

Contemporary Machine Learning and AI research has resulted in thousands of models (e.g., numerous deep networks, graphical models), learning paradigms (e.g., supervised, unsupervised, active, reinforcement, adversarial learning), optimization techniques (e.g., all kinds of optimization or stochastic sampling algorithms), not mentioning countless approximation heuristics, tuning tricks, and black-box oracles, plus combinations of all above. While pushing the field forward rapidly, these results also contributed to making ML/AI more like an alchemist's crafting workshop rather than a modern chemist's periodic table. It not only makes mastering existing ML techniques extremely difficult, but also makes standardized, reusable, repeatable, reliable, and explainable practice and further development of ML/AI products extremely costly, if possible at all.

This tutorial presents a systematic, unified blueprint of ML, for both a refreshing holistic understanding of the diverse ML paradigms/algorithms, and guidance of operationalizing ML for creating problem solutions in a composable manner.

The tutorial consists of three parts. The first part provides an overview of the current landscape of ML paradigms, with a focus on motivating a systematic perspective. The second part presents the blueprint from three aspects: objective function, optimization solver, and model architecture. We describe standardized formulations of the diverse objectives and algorithms, and a composable view of model structures. On this basis, the third part focuses on the operational side of ML. We describe principled module abstraction of ML building blocks. We show the abstraction enables efficient composition of ML solutions to problems in healthcare, manufacturing, vision/text generation.

Advances in Recommender Systems: From Multi-stakeholder Marketplaces to Automated RecSys

The tutorial focuses on two major themes of recent advances in recommender systems: Part A: Recommendations in a Marketplace: Multi-sided marketplaces are steadily emerging as valuable ecosystems in many applications (e.g. Amazon, AirBnb, Uber), wherein the platforms have customers not only on the demand side (e.g. users), but also on the supply side (e.g. retailer). This tutorial focuses on designing search & recommendation frameworks that power such multi-stakeholder platforms. We discuss multi-objective ranking/recommendation techniques, discuss different ways in which stakeholders specify their objectives, highlight user specific characteristics (e.g. user receptivity) which could be leveraged when developing joint optimization modules and finally present a number of real world case-studies of such multi-stakeholder platforms.

Part B: Automated Recommendation System: As the recommendation tasks are getting more diverse and the recommending models are growing more complicated, it is increasingly challenging to develop a proper recommendation system that can adapt well to a new recommendation task. In this tutorial, we focus on how automated machine learning (AutoML) techniques can benefit the design and usage of recommendation systems. Specifically, we start from a full scope describing what can be automated for recommendation systems. Then, we elaborate more on three important topics under such a scope, i.e., feature engineering, hyperparameter optimization/neural architecture search, and algorithm selection. The core issues and recent works under these topics will be introduced, summarized, and discussed. Finally, we finalize the tutorial with conclusions and some future directions.

Physics Inspired Models in Artificial Intelligence

Ideas originating in physics have informed progress in artificial intelligence and machine learning for many decades. However the pedigree of many such ideas is oft neglected in the Computer Science community. The tutorial focuses on current and past ideas from physics that have helped in furthering AI and machine learning. Recent advances in physics inspired ideas in AI are also explored especially how insights from physics may hold the promise of opening the black box of deep learning. Lastly, current and future trends in this area and outlines of a research agenda on how physics-inspired models can benefit AI machine learning is given.

Scientific Text Mining and Knowledge Graphs

Unstructured scientific text, in various forms of textual artifacts, including manuscripts, publications, patents, and proposals, is used to store the tremendous wealth of knowledge discovered after weeks, months, and years, developing hypotheses, working in the lab or clinic, and analyzing results. A grand challenge on data mining research is to develop effective methods for transforming the scientific text into well-structured forms (e.g., ontology, taxonomy, knowledge graphs), so that machine intelligent systems can build on them for hypothesis generation and validation. In this tutorial, we provide a comprehensive overview on recent research and development in this direction. First, we introduce a series of text mining methods that extract phrases, entities, scientific concepts, relations, claims, and experimental evidence. Then we discuss methods that construct and learn from scientific knowledge graphs for accurate search, document classification, and exploratory analysis. Specifically, we focus on scalable, effective, weakly supervised methods that work on text in sciences (e.g., chemistry, biology).

Learning with Small Data

In the era of big data, data-driven methods have become increasingly popular in various applications, such as image recognition, traffic signal control, fake news detection. The superior performance of these data-driven approaches relies on large-scale labeled training data, which are probably inaccessible in real-world applications, i.e., "small (labeled) data" challenge. Examples include predicting emergent events in a city, detecting emerging fake news, and forecasting the progression of conditions for rare diseases. In most scenarios, people care about these small data cases most and thus improving the learning effectiveness of machine learning algorithms with small labeled data has been a popular research topic.

In this tutorial, we will review the trending state-of-the-art machine learning techniques for learning with small (labeled) data. These techniques are organized from two aspects: (1) providing a comprehensive review of recent studies about knowledge generalization, transfer, and sharing, where transfer learning, multi-task learning, and meta-learning are discussed. Particularly, we will focus more on meta-learning, which improves the model generalization ability and has been proven to be an effective approach recently; (2) introducing the cutting-edge techniques which focus on incorporating domain knowledge into machine learning models. Different from model-based knowledge transfer techniques, in real-world applications, domain knowledge (e.g., physical laws) provides us with a new angle to deal with the small data challenge. Specifically, domain knowledge can be used to optimize learning strategies and/or guide the model design. In data mining field, we believe that learning with small data is a trending topic with important social impact, which will attract both researchers and practitioners from academia and industry.

Adversarial Attacks and Defenses: Frontiers, Advances and Practice

Deep neural networks (DNN) have achieved unprecedented success in numerous machine learning tasks in various domains. However, the existence of adversarial examples leaves us a big hesitation when applying DNN models on safety-critical tasks such as autonomous vehicles and malware detection. These adversarial examples are intentionally crafted instances, either appearing in the train or test phase, which can fool the DNN models to make severe mistakes. Therefore, people are dedicated to devising more robust models to resist adversarial examples, but usually they are broken by new stronger attacks. This arms-race between adversarial attacks and defenses has been drawn increasing attention in recent years. In this tutorial, we provide a comprehensive overview on the frontiers and advances of adversarial attacks and their countermeasures. In particular, we give a detailed introduction of different types of attacks under different scenarios, including evasion and poisoning attacks, white-box and black box attacks. We will also discuss how the defending strategies develop to compete against these attacks, and how new attacks come out to break these defenses. Moreover, we will discuss the story of adversarial attacks and defenses in other data domains, especially in graph structured data. Then, we introduce DeepRobust, a Pytorch adversarial learning library which aims to build a comprehensive and easy-to-use platform to foster this research field. Finally, we summarize the tutorial with discussions on open issues and challenges about adversarial attacks and defenses. Via our tutorial, our audience can grip the main idea and key approaches of the game between adversarial attacks and defenses.

Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web

How do we surface the large amount of information present in HTML documents on the Web, from news articles to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including knowledge base construction, question answering, recommendation, and more. In this tutorial, we present approaches for information extraction (IE) from Web data that can be differentiated along two key dimensions: 1) the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and 2) the thrust to develop scalable approaches with zero to limited human supervision.

Recent Advances on Graph Analytics and Its Applications in Healthcare

Graph is a natural representation encoding both the features of the data samples and relationships among them. Analysis with graphs is a classic topic in data mining and many techniques have been proposed in the past. In recent years, because of the rapid development of data mining and knowledge discovery, many novel graph analytics algorithms have been proposed and successfully applied in a variety of areas. The goal of this tutorial is to summarize the graph analytics algorithms developed recently and how they have been applied in healthcare. In particular, our tutorial will cover both the technical advances and the application in healthcare. On the technical aspect, we will introduce deep network embedding techniques, graph neural networks, knowledge graph construction and inference, graph generative models and graph neural ordinary differential equation models. On the healthcare side, we will introduce how these methods can be applied in predictive modeling of clinical risks (e.g., chronic disease onset, in-hospital mortality, condition exacerbation, etc.) and disease subtyping with multi-modal patient data (e.g., electronic health records, medical image and multi-omics), knowledge discovery from biomedical literature and integration with data-driven models, as well as pharmaceutical research and development (e.g., de-novo chemical compound design and optimization, patient similarity for clinical trial recruitment and pharmacovigilance). We will conclude the whole tutorial with a set of potential issues and challenges such as interpretability, fairness and security. In particular, considering the global pandemic of COVID-19, we will also summarize the existing research that have already leveraged graph analytics to help with the understanding the mechanism, transmission, treatment and prevention of COVID-19, as well as point out the available resources and potential opportunities for future research.

Tutorial on Human-Centered Explainability for Healthcare

In recent years, the rapid advances in Artificial Intelligence (AI) techniques along with an ever-increasing availability of healthcare data have made many novel analyses possible. Significant successes have been observed in a wide range of tasks such as next diagnosis prediction, AKI prediction, adverse event predictions including mortality and unexpected hospital re-admissions. However, there has been limited adoption and use in the clinical practice of these methods due to their black-box nature. A significant amount of research is currently focused on making such methods more interpretable or to make post-hoc explanations more accessible. However, most of this work is done at a very low level and as a result, may not have a direct impact at the point-of-care. This tutorial will provide an overview of the landscape of different approaches that have been developed for explainability in healthcare. Specifically, we will present the problem of explainability as it pertains to various personas involved in healthcare viz. data scientists, clinical researchers, and clinicians. We will chart out the requirements for such personas and present an overview of the different approaches that can address such needs. We will also walk-through several use-cases for such approaches. In this process, we will provide a brief introduction to explainability, charting its different dimensions as well as covering some relevant interpretability methods spanning such dimensions. We will touch upon some practical guides for explainability and provide a brief survey of open source tools such as the IBM AI Explainability 360 Open Source Toolkit.

Recent Advances in Multimodal Educational Data Mining in K-12 Education

Recently we have seen a rapid rise in the amount of education data available through the digitization of education. This huge amount of education data usually exhibits in a mixture form of images, videos, speech, texts, etc. It is crucial to consider data from different modalities to build successful applications in AI in education (AIED). This tutorial targets AI researchers and practitioners who are interested in applying state-of-the-art multimodal machine learning techniques to tackle some of the hard-core AIED tasks. These include tasks such as automatic short answer grading, student assessment, class quality assurance, knowledge tracing, etc.

In this tutorial, we will comprehensively review recent developments of applying multimodal learning approaches in AIED, with a focus on those classroom multimodal data. Beyond introducing the recent advances of computer vision, speech, natural language processing in education respectively, we will discuss how to combine data from different modalities and build AI driven educational applications on top of these data. More specifically, we will talk about (1) representation learning; (2) algorithmic assessment & evaluation; and (3) personalized feedback. Participants will learn about recent trends and emerging challenges in this topic, representative tools and learning resources to obtain ready-to-use models, and how related models and techniques benefit real-world AIED applications.

Tutorial on Online User Engagement: Metrics and Optimization

User engagement plays a central role in companies operating online services, such as search engines, news portals, e-commerce sites, entertainment services, and social networks. A main challenge is to leverage collected knowledge about the daily online behavior of millions of users to understand what engages them short-term and more importantly long-term. Two critical steps of improving user engagement are metrics and their optimization. The most common way that engagement is measured is through various online metrics, acting as proxy measures of user engagement. This tutorial will review these metrics, their advantages and drawbacks, and their appropriateness to various types of online services. Once metrics are defined, how to optimize them will become the key issue. We will survey methodologies including machine learning models and experimental designs that are utilized to optimize these metrics via direct or indirect ways. As case studies, we will focus on four types of services, news, search, entertainment, and e-commerce.

Data Pricing -- From Economics to Data Science

Data are invaluable. How can we assess the value of data objectively and quantitatively? Pricing data, or information goods in general, has been studied and practiced in dispersed areas and principles, such as economics, data management, data mining, electronic commerce, and marketing. In this tutorial, we present a unified and comprehensive overview of this important direction. We examine various motivations behind data pricing, understand the economics of data pricing, review the development and evolution of pricing models, and compare the proposals of marketplaces of data. We cover both digital products, such as ebooks and MP3 music, and data products, such as data sets, data queries and machine learning models. We also connect data pricing with the highly related areas, such as cloud service pricing, privacy pricing, and decentralized privacy preserving infrastructure like blockchains.

Deep Graph Learning: Foundations, Advances and Applications

Many real data come in the form of non-grid objects, i.e. graphs, from social networks to molecules. Adaptation of deep learning from grid-alike data (e.g. images) to graphs has recently received unprecedented attention from both machine learning and data mining communities, leading to a new cross-domain field---Deep Graph Learning (DGL). Instead of painstaking feature engineering, DGL aims to learn informative representations of graphs in an end-to-end manner. It has exhibited remarkable success in various tasks, such as node/graph classification, link prediction, etc.

In this tutorial, we aim to provide a comprehensive introduction to deep graph learning. We first introduce the theoretical foundations on deep graph learning with a focus on describing various Graph Neural Network Models (GNNs). We then cover the key achievements of DGL in recent years. Specifically, we discuss the four topics: 1) training deep GNNs; 2) robustness of GNNs; 3) scalability of GNNs; and 4) self-supervised and unsupervised learning of GNNs. Finally, we will introduce the applications of DGL towards various domains, including but not limited to drug discovery, computer vision, medical image analysis, social network analysis, natural language processing and recommendation.

Multi-modal Network Representation Learning

In today's information and computational society, complex systems are often modeled as multi-modal networks associated with heterogeneous structural relation, unstructured attribute/content, temporal context, or their combinations. The abundant information in multi-modal network requires both a domain understanding and large exploratory search space when doing feature engineering for building customized intelligent solutions in response to different purposes. Therefore, automating the feature discovery through representation learning in multi-modal networks has become essential for many applications. In this tutorial, we systematically review the area of multi-modal network representation learning, including a series of recent methods and applications. These methods will be categorized and introduced in the perspectives of unsupervised, semi-supervised and supervised learning, with corresponding real applications respectively. In the end, we conclude the tutorial and raise open discussions. The authors of this tutorial are active and productive researchers in this area.

Data Science for the Real Estate Industry

World's major industries, such as Financial Services, Telecom, Advertising, Healthcare, Education, etc, have attracted the attention of the KDD community for decades. Hundreds of KDD papers have been published on topics related to these industries and dozens of workshops organized---some of which have become an integral part of the conference agenda (e.g. the Health Day). Somewhat unexpectedly, the KDD conference has barely addressed the real estate industry, despite its enormous size and prominence. The reason for that apparent mismatch is two-fold: (a) until recently, the real estate industry did not appreciate the value data science methods could add (with some exceptions, such as econometrics methods for creating real-estate price indices); (b) the Data Science community has not been aware of challenging real estate problems that are perfectly suited to its methods. This tutorial provides a step towards resolving this issue. We provide an introduction to real estate for data scientists, and outline a spectrum of data science problems, many of which are being tackled by new "prop-tech" companies, while some are yet to be approached. We present concrete examples from three of these companies (where the authors work): Airbnb -- the most popular short-term rental marketplace, Cherre -- a real estate data integration platform, and Compass -- the largest independent real estate brokerage in the U.S.

Overview and Importance of Data Quality for Machine Learning Tasks

It is well understood from literature that the performance of a machine learning (ML) model is upper bounded by the quality of the data. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality. One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and failure to do so can result in inaccurate analytics and unreliable decisions. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. This tutorial surveys all the important data quality related approaches discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.

Interpreting and Explaining Deep Neural Networks: A Perspective on Time Series Data

Explainable and interpretable machine learning models and algorithms are important topics which have received growing attention from research, application and administration. Many complex Deep Neural Networks (DNNs) are often perceived as black-boxes. Researchers would like to be able to interpret what the DNN has learned in order to identify biases and failure models and improve models. In this tutorial, we will provide a comprehensive overview on methods to analyze deep neural networks and an insight how those interpretable and explainable methods help us understand time series data.

Edge AI: Systems Design and ML for IoT Data Analytics

With the explosion in Big Data, it is often forgotten that much of the data nowadays is generated at the edge. Specifically, a major source of data is users' endpoint devices like phones, smart watches, etc., that are connected to the internet, also known as the Internet-of-Things (IoT). This "edge of data" faces several new challenges related to hardware-constraints, privacy-aware learning, and distributed learning (both training as well as inference). So what systems and machine learning algorithms can we use to generate or exploit data at the edge? Can network science help us solve machine learning (ML) problems? Can IoT-devices help people who live with some form of disability and many others benefit from health monitoring?

In this tutorial, we introduce the network science and ML techniques relevant to edge computing, discuss systems for ML (e.g., model compression, quantization, HW/SW co-design, etc.) and ML for systems design (e.g., run-time resource optimization, power management for training and inference on edge devices), and illustrate their impact in addressing concrete IoT applications.

Data Sketching for Real Time Analytics: Theory and Practice

Speed, cost, and scale. These are 3 of the biggest challenges in analyzing big data. While modern data systems continue to push the boundaries of scale, the problems of speed and cost are fundamentally tied to the size of data being scanned or processed. Processing thousands of queries that each access terabytes of data with sub-second latency remains infeasible. Data sketching techniques provide means to drastically reduce this size, allowing for real-time or interactive data analysis with reduced costs but with approximate answers.

This tutorial covers a number of useful data sketching and sampling methods and demonstrate their use using the Apache DataSketches project. We focus particularly on common problems in analytic problems such as counting distinct items, quantiles, histograms, heavy hitters, and aggregations with large group bys. For these, we covers algorithms, techniques, and theory that can aid both practitioners and theorists in constructing sketches and designing systems that achieve desired error guarantees. For practitioners and implementers, we show how some of these sketches can be easily instantiated using the Apache Datasketches project.

Deep Learning for Anomaly Detection

Anomaly detection has been widely studied and used in diverse applications. Building an effective anomaly detection system requires researchers and developers to learn complex structure from noisy data, identify dynamic anomaly patterns, and detect anomalies with limited labels. Recent advancements in deep learning techniques have greatly improved anomaly detection performance, in comparison with classical approaches, and have extended anomaly detection to a wide variety of applications. This tutorial will help the audience gain a comprehensive understanding of deep learning based anomaly detection techniques in various application domains. First, we give an overview of the anomaly detection problem, introducing the approaches taken before the deep model era and listing out the challenges they faced. Then we survey the state-of-the-art deep learning models that range from building block neural network structures such as MLP, CNN, and LSTM, to more complex structures such as autoencoder, generative models (VAE, GAN, Flow-based models), to deep one-class detection models, etc. In addition, we illustrate how techniques such as transfer learning and reinforcement learning can help amend the label sparsity issue in anomaly detection problems and how to collect and make the best use of user labels in practice. Second to last, we discuss real world use cases coming from and outside LinkedIn. The tutorial concludes with a discussion of future trends.

Deep Learning for Industrial AI: Challenges, New Methods and Best Practices

Industrial AI is concerned with the application of Artificial Intelligence (AI), Machine Learning (ML) and related technologies towards addressing real-world use cases in industrial and societal domains. These uses cases can be broadly categorized into the horizontal areas of maintenance and repair, operations and supply chain, quality, safety, design, and end-to-end optimization - with applications in a variety of verticals. In the last few years, we have witnessed a growing interest in applying Deep Learning (DL) techniques to Industrial AI problems, ranging from using sequence models such as Long Short-Term Memory (LSTM) for predicting failures in equipment, to using Deep Reinforcement Learning (Deep RL) for scheduling and dispatching. Applying deep learning techniques to industrial applications imposes a set of unique challenges, which include, but are not limited to, (1) limited data, highly skewed class distribution and occurrence of rare classes such as failures, (2) multi-modal data (sensors, events, images, text, etc.) indexed over space and time (3) the need for explainable decisions, (4) a need to attain consistency between different but "related" models and between multiple generations of the same model, and (5) decision making to optimize business outcomes where the cost of a mistake could be very high. This tutorial presents an overview of these challenges, along with new methods and best practices to address them. Examples of these methods include using sequence DL models and Functional Neural Networks (FNNs) for modeling sensor and spatiotemporal measurements; using multi-task learning, graph models and ensemble learning for improving consistency of DL models; using deep RL for health indicator learning and dynamic dispatching; cost-based decision making for prognostics; and using GANs for generating senor data for prognostics. Finally, we will present some open problems in Industrial AI and how the research community can shape the future of the next industrial and societal revolution.

Embedding-Driven Multi-Dimensional Topic Mining and Text Analysis

People nowadays are immersed in a wealth of text data, ranging from news articles, to social media, academic publications, advertisements, and economic reports. A grand challenge of data mining is to develop effective, scalable and weakly-supervised methods for extracting actionable structures and knowledge from massive text data. Without requiring extensive and corpus-specific human annotations, these methods will satisfy people's diverse applications and needs for comprehending and making good use of large-scale corpora.

In this tutorial, we will introduce recent advances in text embeddings and their applications to a wide range of text mining tasks that facilitate multi-dimensional analysis of massive text corpora. Specifically, we first overview a set of recently developed unsupervised and weakly-supervised text embedding methods including state-of-the-art context-free embeddings and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several embedding-driven text mining techniques that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge, in the form of multi-dimensional topics and multi-faceted taxonomies, from large-scale text corpora. We finally show that the topics and taxonomies so discovered will naturally form a multi-dimensional TextCube structure, which greatly enhances text exploration and analysis for various important applications, including text classification, retrieval and summarization. We will demonstrate on the most recent real-world datasets (including political news articles as well as scientific publications related to the coronavirus) how multi-dimensional analysis of massive text corpora can be conducted with the introduced embedding-driven text mining techniques.

Learning by Exploration: New Challenges in Real-World Environments

Learning is a predominant theme for any intelligent system, humans, or machines. Moving beyond the classical paradigm of learning from past experience, e.g., offline supervised learning from given labels, a learner needs to actively collect exploratory feedback to learn from the unknowns, i.e., learning through exploration. This tutorial will introduce the learning by exploration paradigm, which is the key ingredient in many interactive online learning problems, including the multi-armed bandit and, more generally, reinforcement learning problems.

In this tutorial, we will first motivate the need for exploration in machine learning algorithms and highlight its importance in many real-world problems where online sequential decision making is involved. In real-world application scenarios, considerable challenges arise in such a learning problem, including sample complexity, costly and even outdated feedback, and ethical considerations of exploration (such as fairness and privacy). We will introduce several classical exploration strategies and then highlight the aforementioned three fundamental challenges in the learning from exploration paradigm and introduce the recent research development on addressing them, respectively.

Image and Video Understanding for Recommendation and Spam Detection Systems

Image and video-based content has become ever present in a variety of domains like news, entertainment and education. Users typically discover and engage with content via search and recommendation systems. It is also important to serve high quality data to users by filtering out irrelevant or harmful content. Thus, there is an increasing need to leverage the rich information in image and video content in order to power systems for search and recommendation. At the same time, the effectiveness and efficiency of these systems has been accelerated by the availability of large-scale labeled datasets and sophisticated deep learning-based models.

This tutorial is aimed at providing an overview of image and video understanding, and their practical applications in the industry. We focus on deep learning-based state of the art techniques for image and video understanding. This includes tasks like image classification and segmentation, image-based content retrieval and video classification. We also focus on applications of these technologies to large-scale recommendation and low quality content detection systems. We present concrete examples from various LinkedIn production systems, and also discuss associated practical challenges.

Data-Driven Never-Ending Learning Question Answering Systems

This tutorial focuses on how to build Question Answering (QA) syetems based on the Never-Ending Learning (NEL) approach. NEL systems can be roughly described as computer systems that learn over time to become better in solving a specific task. Different NEL approaches have been proposed and applied in different tasks and domains. Recent advances encourage us to keep addressing the problem of how to build computer systems that can take advantage of NEL principles. Considering that it is not always so straightforward to have NEL principles applied to ML models, this tutorial guides the audience (with hands-on examples and supporting theory, algorithms and models) on how to model a system in a NEL fashion and intends to help KDD community to become familiar with such approaches. Question Answering is chosen as application domain mainly because of the relevance of the topic (QA) for KDD and AI communities in general.

SESSION: Diversity and Inclusion Abstracts

How Can Computer Science Education Address Inequities

The 2019 global pandemic and the social protests in support of Black Lives Matter has made it clear that society has yet to eradicate systemic racism. Can computing education help? Academics, particularly in STEM fields, are often shielded from these conversations as we think they belong in a social science classroom. The effect of the pandemic and the protests has made it abundantly clear that we can no longer be apathetic about systemic issues that impact equity in society. Based on personal experience and on years of participation in Broadening Participation in Computing (BPC) efforts, the author suggests actions that CS departments can do to fight inequity. These can be classified into several broad categories: (a) provide more support and opportunities to students from underrepresented groups, (b) encourage faculty to become active participants in addressing inequity, (c) update the computing curriculum to be more inclusive and culturally responsive, and (d) evolve the departmental infrastructure to manage diversity, equity and inclusion.

Diversity and Inclusion, a Perspective from a Four Years MSI Faculty Member

This presentation provides information on the path taken by a faculty member at a Minority Serving Institution (MSI). She presents the difficulties and opportunities that have been presented to her and, as a role model she has managed to improve the quality of life of her students.

Knowing the distribution of the population in computer areas (by race, gender, and other characteristics) is relevant to be able to introduce the concept of diversity, equality, and inclusion and to establish basic strategies to improve these characteristics in the work and study environments for minority populations.

On the other hand, it is important to know the types of bias and how this issue has been addressed (successfully and unsuccessfully) in companies related to computing and the impact it generates on diversity, inclusion, and equity.

Finally, we will briefly mention how the issue of diversity, inclusion, and bias impacts research results, particularly in data science.

CoRE Lab - An Effort to Engage College Hispanic Students in STEM

According to PCAST 2012 report [1], from 2012 to 2021 the number in STEM graduates must be increased by 1 million in order to meet the nation's workforce needed. By 2018, the Hispanic community comprised only the six percent of the US workforce in STEM [2]. This scenario presents a continuous challenge for Hispanic Serving Institutions (HIS).

The Computing Research and Engineering Lab (CoRE Lab) was created as an initiative to attend the lack of engagement of Computer Engineering Students at Inter American University of Puerto Rico - Bayamon campus, an HSI.

We at CoRE Lab used the participation of students at STEM related competitions as a hook to create a coworking space to engage students. The space turned challenges into opportunities to develop students' skills and increase their engagement. First, a relatively big number of students (25+) working in the same physical space was used to develop leadership skills; leaders were assigned to different sub teams according to the requirements of the different projects. Second, students from different levels converging at CoRE Lab presented a diversity challenge. This was turned into the opportunity to practice mentoring between students. Third, engineering students working at one or two projects presented a time management challenge. This situation was used to develop time management skills. Students were required to create schedules and to generate periodical reports on their progress. In addition, and with the support of the Computing Alliance of Hispanic Serving Institutions (CAHSI), the CoRE Lab space was the ideal space to implement an evidence-based practice Peer-Led-Team Learning (PLTL) with success.

Finally, a key factor in the success of CoRE Lab was the dynamic of the interaction in the group. The space was recognized by the students as Their space, adding a sense of belonging to the space. It is recognized as a place where students can work on their projects, study, and share while improving their academic and professional skills.

The success of this effort can be seen through different factors after three years of the first project: first, the number of students participating in extracurricular projects went from six to twenty seven; second, the number of students participating in summer internships went from seven to seventeen; third, the full implementation of the PLTL program for CS1 was consolidate with eight totally volunteer students and impacting over fifty computer engineering freshmen in the last academic year.

Support for Diverse Students

This talk is intended to motivate diverse students in their study of computer science and engineering. Prof. Jiménez discusses the path he took to become a computer science professor. He describes his efforts promoting women and under-represented minorities in computer science and engineering. He ends with some advice to diverse students.

Broadening Participation in Technology Policy

Those who work in politics and policy are often unprepared to address the issues that current technology developments have created. A glaring example is the congressional hearings with Facebook where Congress Members asked embarrassingly fundamental questions and were not able to get to the heart of the issue. Many congress people do not have technology advisor on staff to assist in addressing technical issues. Not only does Congress not have technology experts on staff, but the few who are involved are not from underrepresented minority communities. This makes it even more difficult to adequately address issues that more directly impact communities of color including predictive policing, voter suppression, and facial recognition.

The PhDx fellowship program was created by the Media Democracy Fund to bridge this gap. Underrepresented PhD students in STEM fields spend multiple summers working in various technology policy outlets including Upturn, Free Press, and the Leadership Conference on Civil Rights. In this talk, I will discuss my experience in the program, how it helped me to understand the various career paths I could take in the policy realm with a technical degree, and how I have been able to make a meaningful impact in the field for my community.

The Dark Side of Machine Learning Algorithms: How and Why They Can Leverage Bias, and What Can Be Done to Pursue Algorithmic Fairness

Machine learning and access to big data are revolutionizing the way many industries operate, providing analytics and automation to many aspects of real-world practical tasks that were previously thought to be necessarily manual. With the pervasiveness of artificial intelligence and machine learning over the past decade, and their epidemic spread in a variety of applications, algorithmic fairness has become a prominent open research problem. For instance, machine learning is used in courts to assess the probability that a defendant recommits a crime; in the medical domain to assist with diagnosis or predict predisposition to certain diseases; in social welfare systems; and autonomous vehicles. The decision making processes in these real-world applications have a direct effect on people's lives, and can cause harm to society if the machine learning algorithms