KDD 2019 Proceedings
KDD '19- Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningFull Citation in the ACM Digital Library
SESSION: Keynote Talks
While the trend in machine learning has tended towards more complex hypothesis spaces, it is not clear that this extra complexity is always necessary or helpful for many domains. In particular, models and their predictions are often made easier to understand by adding interpretability constraints. These constraints shrink the hypothesis space; that is, they make the model simpler. Statistical learning theory suggests that generalization may be improved as a result as well. However, adding extra constraints can make optimization (exponentially) harder. For instance it is much easier in practice to create an accurate neural network than an accurate and sparse decision tree. We address the following question: Can we show that a simple-but-accurate machine learning model might exist for our problem, before actually finding it? If the answer is promising, it would then be worthwhile to solve the harder constrained optimization problem to find such a model. In this talk, I present an easy calculation to check for the possibility of a simpler model. This calculation indicates that simpler-but-accurate models do exist in practice more often than you might think. I then briefly overview several new methods for interpretable machine learning. These methods are for (i) sparse optimal decision trees, (ii) sparse linear integer models (also called medical scoring systems), and (iii) interpretable case-based reasoning in deep neural networks for computer vision.
Data and data analysis are widely assumed to be the key part of the solution to healthcare systems' problems. Indeed, there are countless ways in which data can be converted into better medical diagnostic tools, more effective therapeutics, and improved productivity for clinicians. But while there is clearly great potential, some big challenges remain to make this all a reality, including making access to health data easier, addressing privacy and ethics concerns, and ensuring the clinical safety of "learning" systems. This talk illustrates what is possible in healthcare technology, and details key challenges that currently prevent this from becoming a reality.
SESSION: Research Track Papers
We present a reformulation of the distance metric learning problem as a penalized optimization problem, with a penalty term corresponding to the von Neumann entropy of the distance metric. This formulation leads to a mapping to statistical mechanics such that the metric learning optimization problem becomes equivalent to free energy minimization. Correspondingly, our approach leads to an analytical solution of the optimization problem based on the Boltzmann distribution. The mapping established in this work suggests new approaches for dimensionality reduction and provides insights into determination of optimal parameters for the penalty term. Furthermore, we demonstrate that the metric projects the data onto direction of maximum dissimilarity with optimal and tunable separation between classes and thus the transformation can be used for high dimensional data visualization, classification, and clustering tasks. We benchmark our method against previous distance learning methods and provide an efficient implementation in an R package available to download at: \urlhttps://github.com/kouroshz/fenn
The understanding of job mobility can benefit talent management operations in a number of ways, such as talent recruitment, talent development, and talent retention. While there is extensive literature showing the predictability of the organization-level job mobility patterns (e.g., in terms of the employee turnover rate), there are no effective solutions for supporting the understanding of job mobility at an individual level. To this end, in this paper, we propose a hierarchical career-path-aware neural network for learning individual-level job mobility. Specifically, we aim at answering two questions related to individuals in their career paths: 1) who will be the next employer? 2) how long will the individual work in the new position? Specifically, our model exploits a hierarchical neural network structure with embedded attention mechanism for characterizing the internal and external job mobility. Also, it takes personal profile information into consideration in the learning process. Finally, the extensive results on real-world data show that the proposed model can lead to significant improvements in prediction accuracy for the two aforementioned prediction problems. Moreover, we show that the above two questions are well addressed by our model with a certain level of interpretability. For the case studies, we provide data-driven evidence showing interesting patterns associated with various factors (e.g., job duration, firm type, etc.) in the job mobility prediction process.
Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities.
Deep neural network based transfer learning has been widely used to leverage information from the domain with rich data to help domain with insufficient data. When the source data distribution is different from the target data, transferring knowledge between these domains may lead to negative transfer. To mitigate this problem, a typical way is to select useful source domain data for transferring. However, limited studies focus on selecting high-quality source data to help neural network based transfer learning. To bridge this gap, we propose a general Minimax Game based model for selective Transfer Learning (MGTL). More specifically, we build a selector, a discriminator and a TL module in the proposed method. The discriminator aims to maximize the differences between selected source data and target data, while the selector acts as an attacker to selected source data that are close to the target to minimize the differences. The TL module trains on the selected data and provides rewards to guide the selector. Those three modules play a minimax game to help select useful source data for transferring. Our method is also shown to speed up the training process of the learning task in the target domain than traditional TL methods. To the best of our knowledge, this is the first to build a minimax game based model for selective transfer learning. To examine the generality of our method, we evaluate it on two different tasks: item recommendation and text retrieval. Extensive experiments over both public and real-world datasets demonstrate that our model outperforms the competing methods by a large margin. Meanwhile, the quantitative evaluation shows our model can select data which are close to target data. Our model is also deployed in a real-world system and significant improvement over the baselines is observed.
We consider the problem of localizing a submatrix with larger-than-usual entry values inside a data matrix, without the prior knowledge of the submatrix size. We establish an optimization framework based on a multiscale scan statistic, and develop algorithms in order to approach the optimizer. We also show that our estimator only requires a signal strength of the same order as the minimax estimator with oracle knowledge of the submatrix size, to exactly recover the anomaly with high probability. We perform some simulations that show that our estimator has superior performance compared to other estimators which do not require prior submatrix knowledge, while being comparatively faster to compute.
Machine learning applications are often plagued with confounders that can impact the generalizability of the learners. In clinical settings, demographic characteristics often play the role of confounders. Confounding is especially problematic in remote digital health studies where the participants self-select to enter the study, thereby making it difficult to balance the demographic characteristics of participants. One effective approach to combat confounding is to match samples with respect to the confounding variables in order to improve the balance of the data. This procedure, however, leads to smaller datasets and hence negatively impact the inferences drawn from the learners. Alternatively, confounding adjustment methods that make more efficient use of the data (such as inverse probability weighting) usually rely on modeling assumptions, and it is unclear how robust these methods are to violations of these assumptions. Here, instead of proposing a new method to control for confounding, we develop novel permutation based statistical tools to detect and quantify the influence of observed confounders, and estimate the unconfounded performance of the learner. Our tools can be used to evaluate the effectiveness of existing confounding adjustment methods. We evaluate the statistical properties of our methods in a simulation study, and illustrate their application using real-life data from a Parkinson's disease mobile health study collected in an uncontrolled environment.
Representation learning on graphs, also called graph embedding, has demonstrated its significant impact on a series of machine learning applications such as classification, prediction and recommendation. However, existing work has largely ignored the rich information contained in the properties (or attributes) of both nodes and edges of graphs in modern applications, e.g., those represented by property graphs. To date, most existing graph embedding methods either focus on plain graphs with only the graph topology, or consider properties on nodes only. We propose PGE, a graph representation learning framework that incorporates both node and edge properties into the graph embedding procedure. PGE uses node clustering to assign biases to differentiate neighbors of a node and leverages multiple data-driven matrices to aggregate the property information of neighbors sampled based on a biased strategy. PGE adopts the popular inductive model for neighborhood aggregation. We provide detailed analyses on the efficacy of our method and validate the performance of PGE by showing how PGE achieves better embedding results than the state-of-the-art graph embedding methods on benchmark applications such as node classification and link prediction over real-world datasets.
Recent years have witnessed growing interests in developing deep models for incremental learning. However, existing approaches often utilize the fixed structure and online backpropagation for deep model optimization, which is difficult to be implemented for incremental data scenarios. Indeed, for streaming data, there are two main challenges for building deep incremental models. First, there is a requirement to develop deep incremental models with Capacity Scalability. In other words, the entire training data are not available before learning the task. It is a challenge to make the deep model structure scaling with streaming data for flexible model evolution and faster convergence. Second, since the stream data distribution usually changes in nature (concept drift), there is a constraint for Capacity Sustainability. That is, how to update the model while preserving previous knowledge for overcoming the catastrophic forgetting. To this end, in this paper, we develop an incremental adaptive deep model (IADM) for dealing with the above two capacity challenges in real-world incremental data scenarios. Specifically, IADM provides an extra attention model for the hidden layers, which aims to learn deep models with adaptive depth from streaming data and enables capacity scalability. Also, we address capacity sustainability by exploiting the attention based fisher information matrix, which can prevent the forgetting in consequence. Finally, we conduct extensive experiments on real-world data and show that IADM outperforms the state-of-the-art methods with a substantial margin. Moreover, we show that IADM has better capacity scalability and sustainability in incremental learning scenarios.
Partial label learning aims to induce a multi-class classifier from training examples where each of them is associated with a set of candidate labels, among which only one is the ground-truth label. The common strategy to train predictive model is disambiguation, i.e. differentiating the modeling outputs of individual candidate labels so as to recover ground-truth labeling information. Recently, feature-aware disambiguation was proposed to generate different labeling confidences over candidate label set by utilizing the graph structure of feature space. However, the existence of noise and outliers in training data makes the similarity derived from original features less reliable. To this end, we proposed a novel approach for partial label learning based on adaptive graph guided disambiguation (PL-AGGD). Compared with fixed graph, adaptive graph could be more robust and accurate to reveal the intrinsic manifold structure within the data. Moreover, instead of the two-stage strategy in previous algorithms, our approach performs label disambiguation and predictive model training simultaneously. Specifically, we present a unified framework which jointly optimizes the ground-truth labeling confidences, similarity graph and model parameters to achieve strong generalization performance. Extensive experiments show that PL-AGGD performs favorably against state-of-the-art partial label learning approaches.
Attributed networks are pervasive in numerous of high-impact domains. As opposed to conventional plain networks where only pairwise node dependencies are observed, both the network topology and node attribute information are readily available on attributed networks. More often than not, the nodal attributes are depicted in a high-dimensional feature space and are therefore notoriously difficult to tackle due to the curse of dimensionality. Additionally, features that are irrelevant to the network structure could hinder the discovery of actionable patterns from attributed networks. Hence, it is important to leverage feature selection to find a high-quality feature subset that is tightly correlated to the network structure. Few of the existing efforts either model the network structure at a macro-level by community analysis or directly make use of the binary relations. Consequently, they fail to exploit the finer-grained tie strength information for feature selection and may lead to suboptimal results. Motivated by the sociology findings, in this work, we investigate how to harness the tie strength information embedded on the network structure to facilitate the selection of relevant nodal attributes. Methodologically, we propose a principled unsupervised feature selection framework ADAPT to find informative features that can be used to regenerate the observed links and further characterize the adaptive neighborhood structure of the network. Meanwhile, an effective optimization algorithm for the proposed ADAPT framework is also presented. Extensive experimental studies on various real-world attributed networks validate the superiority of the proposed ADAPT framework.
Early classification of time series is the prediction of the class label of a time series before it is observed in its entirety. In time-sensitive domains where information is collected over time it is worth sacrificing some classification accuracy in favor of earlier predictions, ideally early enough for actions to be taken. However, since accuracy and earliness are contradictory objectives, a solution must address this challenge to discover task-dependent trade-offs. We design an early classification model, called EARLIEST, which tackles this multi-objective optimization problem, jointly learning (1) to classify time series and (2) at which timestep to halt and generate this prediction. By learning the objectives together, we achieve a user-controlled balance between these contradictory goals while capturing their natural relationship. Our model consists of the novel pairing of a recurrent discriminator network with a stochastic policy network, with the latter learning a halting-policy as a reinforcement learning task. The learned policy interprets representations generated by the recurrent model and controls its dynamics, sequentially deciding whether or not to request observations from future timesteps. For a rich variety of datasets (four synthetic and three real-world), we demonstrate that EARLIEST consistently out-performs state-of-the-art alternatives in accuracy and earliness while discovering signal locations without supervision.
Alternating Direction Method of Multipliers (ADMM) has been used successfully in many conventional machine learning applications and is considered to be a useful alternative to Stochastic Gradient Descent (SGD) as a deep learning optimizer. However, as an emerging domain, several challenges remain, including 1) The lack of global convergence guarantees, 2) Slow convergence towards solutions, and 3) Cubic time complexity with regard to feature dimensions. In this paper, we propose a novel optimization framework for deep learning via ADMM (dlADMM) to address these challenges simultaneously. The parameters in each layer are updated backward and then forward so that the parameter information in each layer is exchanged efficiently. The time complexity is reduced from cubic to quadratic in (latent) feature dimensions via a dedicated algorithm design for subproblems that enhances them utilizing iterative quadratic approximations and backtracking. Finally, we provide the first proof of global convergence for an ADMM-based method (dlADMM) in a deep neural network problem under mild conditions. Experiments on benchmark datasets demonstrated that our proposed dlADMM algorithm outperforms most of the comparison methods.
Network embedding, which aims to represent network data in a low-dimensional space, has been commonly adopted for analyzing heterogeneous information networks (HIN). Although exiting HIN embedding methods have achieved performance improvement to some extent, they still face a few major weaknesses. Most importantly, they usually adopt negative sampling to randomly select nodes from the network, and they do not learn the underlying distribution for more robust embedding. Inspired by generative adversarial networks (GAN), we develop a novel framework HeGAN for HIN embedding, which trains both a discriminator and a generator in a minimax game. Compared to existing HIN embedding methods, our generator would learn the node distribution to generate better negative samples. Compared to GANs on homogeneous networks, our discriminator and generator are designed to be relation-aware in order to capture the rich semantics on HINs. Furthermore, towards more effective and efficient sampling, we propose a generalized generator, which samples "latent" nodes directly from a continuous distribution, not confined to the nodes in the original network as existing methods are. Finally, we conduct extensive experiments on four real-world datasets. Results show that we consistently and significantly outperform state-of-the-art baselines across all datasets and tasks.
Mobile user profiles are a summary of characteristics of user-specific mobile activities. Mobile user profiling is to extract a user's interest and behavioral patterns from mobile behavioral data. While some efforts have been made for mobile user profiling, existing methods can be improved via representation learning with awareness of substructures in users' behavioral graphs. Specifically, in this paper, we study the problem of mobile users profiling with POI check-in data. To this end, we first construct a graph, where a vertex is a POI category and an edge is the transition frequency of a user between two POI categories, to represent each user. We then formulate mobile user profiling as a task of representation learning from user behavioral graphs. We later develop a deep adversarial substructured learning framework for the task. This framework has two mutually-enhanced components. The first component is to preserve the structure of the entire graph, which is formulated as an encoding-decoding paradigm. In particular, the structure of the entire graph is preserved by minimizing reconstruction loss between an original graph and a reconstructed graph. The second component is to preserve the structure of subgraphs, which is formulated as a substructure detector based adversarial training paradigm. In particular, this paradigm includes a substructure detector and an adversarial trainer. Instead of using non-differentiable substructure detection algorithms, we pre-train a differentiable convolutional neural network as the detector to approximate these detection algorithms. The adversarial trainer is to match the detected substructure of the reconstructed graph to the detected substructure of the original graph. Also, we provide an effective solution for the optimization problems. Moreover, we exploit the learned representations of users for the next activity type prediction. Finally, we present extensive experimental results to demonstrate the improved performances of the proposed method.
Semi-supervised learning is sought for leveraging the unlabelled data when labelled data is difficult or expensive to acquire. Deep generative models (e.g., Variational Autoencoder (VAE)) and semi-supervised Generative Adversarial Networks (GANs) have recently shown promising performance in semi-supervised classification for the excellent discriminative representing ability. However, the latent code learned by the traditional VAE is not exclusive (repeatable) for a specific input sample, which prevents it from excellent classification performance. In particular, the learned latent representation depends on a non-exclusive component which is stochastically sampled from the prior distribution. Moreover, the semi-supervised GAN models generate data from pre-defined distribution (e.g., Gaussian noises) which is independent of the input data distribution and may obstruct the convergence and is difficult to control the distribution of the generated data. To address the aforementioned issues, we propose a novel Adversarial Variational Embedding (AVAE) framework for robust and effective semi-supervised learning to leverage both the advantage of GAN as a high quality generative model and VAE as a posterior distribution learner. The proposed approach first produces an exclusive latent code by the model which we call VAE++, and meanwhile, provides a meaningful prior distribution for the generator of GAN. The proposed approach is evaluated over four different real-world applications and we show that our method outperforms the state-of-the-art models, which confirms that the combination of VAE++ and GAN can provide significant improvements in semi-supervised lassification.
We propose the first adversarially robust algorithm for monotone submodular maximization under single and multiple knapsack constraints with scalable implementations in distributed and streaming settings. For a single knapsack constraint, our algorithm outputs a robust summary of almost optimal (up to polylogarithmic factors) size, from which a constant-factor approximation to the optimal solution can be constructed. For multiple knapsack constraints, our approximation is within a constant-factor of the best known non-robust solution. We evaluate the performance of our algorithms by comparison to natural robustifications of existing non-robust algorithms under two objectives: 1) dominating set for large social network graphs from Facebook and Twitter collected by the Stanford Network Analysis Project (SNAP), 2) movie recommendations on a dataset from MovieLens. Experimental results show that our algorithms give the best objective for a majority of the inputs and show strong performance even compared to offline algorithms that are given the set of removals in advance.
Traditional recommender systems rely on user feedback such as ratings or clicks to the items, to analyze the user interest and provide personalized recommendations. However, rating or click feedback are limited in that they do not exactly tell why users like or dislike an item. If a user does not like the recommendations and can not effectively express the reasons via rating and clicking, the feedback from the user may be very sparse. These limitations lead to inefficient model learning of the recommender system. To address these limitations, more effective user feedback to the recommendations should be designed, so that the system can effectively understand a user's preference and improve the recommendations over time. In this paper, we propose a novel dialog-based recommender system to interactively recommend a list of items with visual appearance. At each time, the user receives a list of recommended items with visual appearance. The user can point to some items and describe their feedback, such as the desired features in the items they want in natural language. With this natural language based feedback, the recommender system updates and provides another list of items. To model the user behaviors of viewing, commenting and clicking on a list of items, we propose a visual dialog augmented cascade model. To efficiently understand the user preference and learn the model, exploration should be encouraged to provide more diverse recommendations to quickly collect user feedback on more attributes of the items. We propose a variant of the cascading bandits, where the neural representations of the item images and user feedback in natural language are utilized. In a task of recommending a list of footwear, we show that our visual dialog augmented interactive recommender needs around 41.03% rounds of recommendations, compared to the traditional interactive recommender only relying on the user click behavior.
We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.
Visualization of high-dimensional data is a fundamental yet challenging problem in data mining. These visualization techniques are commonly used to reveal the patterns in the high-dimensional data, such as clusters and the similarity among clusters. Recently, some successful visualization tools (e.g., BH-t-SNE and LargeVis) have been developed. However, there are two limitations with them : (1) they cannot capture the global data structure well. Thus, their visualization results are sensitive to initialization, which may cause confusions to the data analysis. (2) They cannot scale to large-scale datasets. They are not suitable to be implemented on the GPU platform because their complex algorithm logic, high memory cost, and random memory access mode will lead to low hardware utilization. To address the aforementioned problems, we propose a novel visualization approach named as Anchor-t-SNE (AtSNE), which provides efficient GPU-based visualization solution for large-scale and high-dimensional data. Specifically, we generate a number of anchor points from the original data and regard them as the skeleton of the layout, which holds the global structure information. We propose a hierarchical optimization approach to optimize the positions of the anchor points and ordinary data points in the layout simultaneously. Our approach presents much better and robust visual effects on 11 public datasets, and achieve 5 to 28 times speed-up on different datasets, compared with the current state-of-the-art methods. In particular, we deliver a high-quality 2-D layout for a 20 million and 96-dimension dataset within 5 hours, while the current methods fail to give results due to running out of the memory.
Backbones refer to critical tree structures that span a set of nodes of interests in networks. This paper introduces a novel class of attributed backbones and detection algorithms in richly attributed networks. Unlike conventional models, attributed backbones capture dynamics in edge cost model: it specifies affinitive attributes for each edge, and the cost of each edge is dynamically determined by the selection of its associated affinitive attributes and the closeness of their values at its end nodes. The backbone discovery is to compute an attributed backbone that covers interested nodes with smallest connection cost dynamically determined by selected affinitive attributes. While this problem is hard to approximate, we develop feasible algorithms within practical reach for large attributed networks. (1) We show that this problem is fixed-parameter approximable parameterized by the number of affinitive attributes, by providing a Lagrangean-preserving 2-approximation. (2) When the attribute number is large and specifying closeness function is difficult, we provide a fast heuristic, which learns an edge-generative model, and applies the model to infer best backbones, without the need of specifying closeness functions. Using real-world networks, we verify the effectiveness and efficiency of our algorithms and show their applications in collaboration recommendation.
To help enforce data-protection regulations such as GDPR and detect unauthorized uses of personal data, we develop a new model auditing technique that helps users check if their data was used to train a machine learning model. We focus on auditing deep-learning models that generate natural-language text, including word prediction and dialog generation. These models are at the core of popular online services and are often trained on personal data such as users' messages, searches, chats, and comments. We design and evaluate a black-box auditing method that can detect, with very few queries to a model, if a particular user's texts were used to train it (among thousands of other users). We empirically show that our method can successfully audit well-generalized models that are not overfitted to the training data. We also analyze how text-generation models memorize word sequences and explain why this memorization makes them amenable to auditing.
Feature selection is the preprocessing step in machine learning which tries to select the most relevant features for the subsequent prediction task. Effective feature selection could help reduce dimensionality, improve prediction accuracy and increase result comprehensibility. It is very challenging to find the optimal feature subset from the subset space as the space could be very large. While much effort has been made by existing studies, reinforcement learning can provide a new perspective for the searching strategy in a more global way. In this paper, we propose a multi-agent reinforcement learning framework for the feature selection problem. Specifically, we first reformulate feature selection with a reinforcement learning framework by regarding each feature as an agent. Then, we obtain the state of environment in three ways, i.e., statistic description, autoencoder and graph convolutional network (GCN), in order to make the algorithm better understand the learning progress. We show how to learn the state representation in a graph-based way, which could tackle the case when not only the edges, but also the nodes are changing step by step. In addition, we study how the coordination between different features would be improved by more reasonable reward scheme. The proposed method could search the feature subset space globally and could be easily adapted to the real-time case (real-time feature selection) due to the nature of reinforcement learning. Also, we provide an efficient strategy to accelerate the convergence of multi-agent reinforcement learning. Finally, extensive experimental results show the significant improvement of the proposed method over conventional approaches.
Network embedding (NE) aims to embed the nodes of a network into a vector space, and serves as the bridge between machine learning and network data. Despite their widespread success, NE algorithms typically contain a large number of hyperparameters for preserving the various network properties, which must be carefully tuned in order to achieve satisfactory performance. Though automated machine learning (AutoML) has achieved promising results when applied to many types of data such as images and texts, network data poses great challenges to AutoML and remains largely ignored by the literature of AutoML. The biggest obstacle is the massive scale of real-world networks, along with the coupled node relationships that make any straightforward sampling strategy problematic. In this paper, we propose a novel framework, named AutoNE, to automatically optimize the hyperparameters of a NE algorithm on massive networks. In detail, we employ a multi-start random walk strategy to sample several small sub-networks, perform each trial of configuration selection on the sampled sub-network, and design a meta-leaner to transfer the knowledge about optimal hyperparameters from the sub-networks to the original massive network. The transferred meta-knowledge greatly reduces the number of trials required when predicting the optimal hyperparameters for the original network. Extensive experiments demonstrate that our framework can significantly outperform the existing methods, in that it needs less time and fewer trials to find the optimal hyperparameters.
Generalized additive models (GAMs) are favored in many regression and binary classification problems because they are able to fit complex, nonlinear functions while still remaining interpretable. In the first part of this paper, we generalize a state-of-the-art GAM learning algorithm based on boosted trees to the multiclass setting, showing that this multiclass algorithm outperforms existing GAM learning algorithms and sometimes matches the performance of full complexity models such as gradient boosted trees. In the second part, we turn our attention to the interpretability of GAMs in the multiclass setting. Surprisingly, the natural interpretability of GAMs breaks down when there are more than two classes. Naive interpretation of multiclass GAMs can lead to false conclusions. Inspired by binary GAMs, we identify two axioms that any additive model must satisfy in order to not be visually misleading. We then develop a technique called Additive Post-Processing for Interpretability (API) that provably transforms a pretrained additive model to satisfy the interpretability axioms without sacrificing accuracy. The technique works not just on models trained with our learning algorithm, but on any multiclass additive model, including multiclass linear and logistic regression. We demonstrate the effectiveness of API on a 12-class infant mortality dataset.
Beyond Personalization: Social Content Recommendation for Creator Equality and Consumer Satisfaction
An effective content recommendation in modern social media platforms should benefit both creators to bring genuine benefits to them and consumers to help them get really interesting content. In this paper, we propose a model called Social Explorative Attention Network (SEAN) for content recommendation. SEAN uses a personalized content recommendation model to encourage personal interests driven recommendation. Moreover, SEAN allows the personalization factors to attend to users' higher-order friends on the social network to improve the accuracy and diversity of recommendation results. Constructing two datasets from a popular decentralized content distribution platform, Steemit, we compare SEAN with state-of-the-art CF and content based recommendation approaches. Experimental results demonstrate the effectiveness of SEAN in terms of both Gini coefficients for recommendation equality and F1 scores for recommendation performance.
Recent works show that Graph Neural Networks (GNNs) are highly non-robust with respect to adversarial attacks on both the graph structure and the node attributes, making their outcomes unreliable. We propose the first method for certifiable (non-)robustness of graph convolutional networks with respect to perturbations of the node attributes. We consider the case of binary node attributes (e.g. bag-of-words) and perturbations that are L_0-bounded. If a node has been certified with our method, it is guaranteed to be robust under any possible perturbation given the attack model. Likewise, we can certify non-robustness. Finally, we propose a robust semi-supervised training procedure that treats the labeled and unlabeled nodes jointly. As shown in our experimental evaluation, our method significantly improves the robustness of the GNN with only minimal effect on the predictive accuracy.
Graph convolutional network (GCN) has been successfully applied to many graph-based applications; however, training a large-scale GCN remains challenging. Current SGD-based algorithms suffer from either a high computational cost that exponentially grows with number of GCN layers, or a large space requirement for keeping the entire graph and the embedding of each node in memory. In this paper, we propose Cluster-GCN, a novel GCN algorithm that is suitable for SGD-based training by exploiting the graph clustering structure. Cluster-GCN works as the following: at each step, it samples a block of nodes that associate with a dense subgraph identified by a graph clustering algorithm, and restricts the neighborhood search within this subgraph. This simple but effective strategy leads to significantly improved memory and computational efficiency while being able to achieve comparable test accuracy with previous algorithms. To test the scalability of our algorithm, we create a new Amazon2M data with 2 million nodes and 61 million edges which is more than 5 times larger than the previous largest publicly available dataset (Reddit). For training a 3-layer GCN on this data, Cluster-GCN is faster than the previous state-of-the-art VR-GCN (1523 seconds vs 1961 seconds) and using much less memory (2.2GB vs 11.2GB). Furthermore, for training 4 layer GCN on this data, our algorithm can finish in around 36 minutes while all the existing GCN training algorithms fail to train due to the out-of-memory issue. Furthermore, Cluster-GCN allows us to train much deeper GCN without much time and memory overhead, which leads to improved prediction accuracy---using a 5-layer Cluster-GCN, we achieve state-of-the-art test F1 score 99.36 on the PPI dataset, while the previous best result was 98.71 by~\citezhang2018gaan.
In this paper we consider clustering problems in which each point is endowed with a color. The goal is to cluster the points to minimize the classical clustering cost but with the additional constraint that no color is over-represented in any cluster. This problem is motivated by practical clustering settings, e.g., in clustering news articles where the color of an article is its source, it is preferable that no single news source dominates any cluster. For the most general version of this problem, we obtain an algorithm that has provable guarantees of performance; our algorithm is based on finding a fractional solution using a linear program and rounding the solution subsequently. For the special case of the problem where no color has an absolute majority in any cluster, we obtain a simpler combinatorial algorithm also with provable guarantees. Experiments on real-world data shows that our algorithms are effective in finding good clustering without over-representation.
attention in recent years. Unlike the standard convolutional neural network, graph convolutional neural networks perform the convolutional operation on the graph data. Compared with the generic data, the graph data possess the similarity information between different nodes. Thus, it is important to preserve this kind of similarity information in the hidden layers of graph convolutional neural networks. However, existing works fail to do that. On the other hand, it is challenging to enforce the hidden layers to preserve the similarity relationship. To address this issue, we propose a novel CRF layer for graph convolutional neural networks to encourage similar nodes to have similar hidden features. In this way, the similarity information can be preserved explicitly. In addition, the proposed CRF layer is easy to compute and optimize. Therefore, it can be easily inserted into existing graph convolutional neural networks to improve their performance. At last, extensive experimental results have verified the effectiveness of our proposed CRF layer.
Modern search engines increasingly incorporate tabular content, which consists of a set of entities each augmented with a small set of facts. The facts can be obtained from multiple sources: an entity's knowledge base entry, the infobox on its Wikipedia page, or its row within a WebTable. Crucially, the informativeness of a fact depends not only on the entity but also the specific context(e.g., the query).To the best of our knowledge, this paper is the first to study the problem of contextual fact ranking: given some entities and a context (i.e., succinct natural language description), identify the most informative facts for the entities collectively within the context.We propose to contextually rank the facts by exploiting deep learning techniques. In particular, we develop pointwise and pair-wise ranking models, using textual and statistical information for the given entities and context derived from their sources. We enhance the models by incorporating entity type information from an IsA (hypernym) database. We demonstrate that our approaches achieve better performance than state-of-the-art baselines in terms of MAP, NDCG, and recall. We further conduct user studies for two specific applications of contextual fact ranking-table synthesis and table compression-and show that our models can identify more informative facts than the baselines.
Concepts are often described in terms of positive integer-valued attributes that are organized in a hierarchy. For example, cities can be described in terms of how many places there are of various types (e.g. nightlife spots, residences, food venues), and these places are organized in a hierarchy (e.g. a Portuguese restaurant is a type of food venue). This hierarchy imposes particular constraints on the values of related attributes---e.g. there cannot be more Portuguese restaurants than food venues. Moreover, knowing that a city has many food venues makes it less surprising that it also has many Portuguese restaurants, and vice versa. In the present paper, we attempt to characterize such concepts in terms of so-called contrastive antichains: particular kinds of subsets of their attributes and their values. We address the question of when a contrastive antichain is interesting, in the sense that it concisely describes the unique aspects of the concept, and this while duly taking into account the known attribute dependencies implied by the hierarchy. Our approach is capable of accounting for previously identified contrastive antichains, making iterative mining possible. Besides the interestingness measure, we also present an algorithm that scales well in practice, and demonstrate the usefulness of the method in an extensive empirical results section.
Taxi and sharing bike bring great convenience to urban transportation. A lot of efforts have been made to improve the efficiency of taxi service or bike sharing system by predicting the next-period pick-up or drop-off demand. Different from the existing research, this paper is motivated by the following two facts: 1) From a micro view, an observed spatial demand at any time slot could be decomposed as a combination of many hidden spatial demand bases; 2) From a macro view, the multiple transportation demands are strongly correlated with each other, both spatially and temporally. Definitely, the above two views have great potential to revolutionize the existing taxi or bike demand prediction methods. Along this line, this paper provides a novel Co-prediction method based on Spatio-Temporal neural Network, namely, CoST-Net. In particular, a deep convolutional neural network is constructed to decompose a spatial demand into a combination of hidden spatial demand bases. The combination weight vector is used as a representation of the decomposed spatial demand. Then, a heterogeneous Long Short-Term Memory (LSTM) is proposed to integrate the states of multiple transportation demands, and also model the dynamics of them mixedly. Last, the environmental features such as humidity and temperature are incorporated with the achieved overall hidden states to predict the multiple demands simultaneously. Experiments have been conducted on real-world taxi and sharing bike demand data, results demonstrate the superiority of the proposed method over both classical and the state-of-the-art transportation demand prediction methods.
Coresets are important tools to generate concise summaries of massive datasets for approximate analysis. A coreset is a small subset of points extracted from the original point set such that certain geometric properties are preserved with provable guarantees. This paper investigates the problem of maintaining a coreset to preserve the minimum enclosing ball (MEB) for a sliding window of points that are continuously updated in a data stream. Although the problem has been extensively studied in batch and append-only streaming settings, no efficient sliding-window solution is available yet. In this work, we first introduce an algorithm, called AOMEB, to build a coreset for MEB in an append-only stream. AOMEB improves the practical performance of the state-of-the-art algorithm while having the same approximation ratio. Furthermore, using AOMEB as a building block, we propose two novel algorithms, namely SWMEB and SWMEB+, to maintain coresets for MEB over the sliding window with constant approximation ratios. The proposed algorithms also support coresets for MEB in a reproducing kernel Hilbert space (RKHS). Finally, extensive experiments on real-world and synthetic datasets demonstrate that SWMEB and SWMEB+ achieve speedups of up to four orders of magnitude over the state-of-the-art batch algorithm while providing coresets for MEB with rather small errors compared to the optimal ones.
Low-rank tensor factorization has been widely used for many real world tensor completion problems. While most existing factorization models assume a multilinearity relationship between tensor entries and their corresponding factors, real world tensors tend to have more complex interactions than multilinearity. In many recent works, it is observed that multilinear models perform worse than nonlinear models. We identify one potential reason for this inferior performance: the nonlinearity inside data obfuscates the underlying low-rank structure such that the tensor seems to be a high-rank tensor. Solving this problem requires a model to simultaneously capture the complex interactions and preserve the low-rank structure. In addition, the model should be scalable and robust to missing observations in order to learn from large yet sparse real world tensors. We propose a novel convolutional neural network (CNN) based model, named CoSTCo (Convolutional Sparse Tensor Completion). Our model leverages the expressive power of CNN to model the complex interactions inside tensors and its parameter sharing scheme to preserve the desired low-rank structure. CoSTCo is scalable as it does not involve computation- or memory- heavy tasks such as Kronecker product. We conduct extensive experiments on several real world large sparse tensors and the experimental results show that our model clearly outperforms both linear and nonlinear state-of-the-art tensor completion methods.
We focus on the problem of streaming recommender system and explore novel collaborative filtering algorithms to handle the data dynamicity and complexity in a streaming manner. Although deep neural networks have demonstrated the effectiveness of recommendation tasks, it is lack of explorations on integrating probabilistic models and deep architectures under streaming recommendation settings. Conjoining the complementary advantages of probabilistic models and deep neural networks could enhance both model effectiveness and the understanding of inference uncertainties. To bridge the gap, in this paper, we propose a Coupled Variational Recurrent Collaborative Filtering (CVRCF) framework based on the idea of Deep Bayesian Learning to handle the streaming recommendation problem. The framework jointly combines stochastic processes and deep factorization models under a Bayesian paradigm to model the generation and evolution of users' preferences and items' popularities. To ensure efficient optimization and streaming update, we further propose a sequential variational inference algorithm based on a cross variational recurrent neural network structure. Experimental results on three benchmark datasets demonstrate that the proposed framework performs favorably against the state-of-the-art methods in terms of both temporal dependency modeling and predictive accuracy. The learned latent variables also provide visualized interpretations for the evolution of temporal dynamics.
Despite the great success of many matrix factorization based collaborative filtering approaches, there is still much space for improvement in recommender system field. One main obstacle is the cold-start and data sparseness problem, requiring better solutions. Recent studies have attempted to integrate review information into rating prediction. However, there are two main problems: (1) most of existing works utilize a static and independent method to extract the latent feature representation of user and item reviews ignoring the correlation between the latent features, which may fail to capture the preference of users comprehensively. (2) there is no effective framework that unifies ratings and reviews. Therefore, we propose a novel d ual a ttention m utual l earning between ratings and reviews for item recommendation, named DAML. Specifically, we utilize local and mutual attention of the convolutional neural network to jointly learn the features of reviews to enhance the interpretability of the proposed DAML model. Then the rating features and review features are integrated into a unified neural network model, and the higher-order nonlinear interaction of features are realized by the neural factorization machines to complete the final rating prediction. Experiments on the five real-world datasets show that DAML achieves significantly better rating prediction accuracy compared to the state-of-the-art methods. Furthermore, the attention mechanism can highlight the relevant information in reviews to increase the interpretability of rating prediction.
Although deep learning has been applied to successfully address many data mining problems, relatively limited work has been done on deep learning for anomaly detection. Existing deep anomaly detection methods, which focus on learning new feature representations to enable downstream anomaly detection methods, perform indirect optimization of anomaly scores, leading to data-inefficient learning and suboptimal anomaly scoring. Also, they are typically designed as unsupervised learning due to the lack of large-scale labeled anomaly data. As a result, they are difficult to leverage prior knowledge (e.g., a few labeled anomalies) when such information is available as in many real-world anomaly detection applications. This paper introduces a novel anomaly detection framework and its instantiation to address these problems. Instead of representation learning, our method fulfills an end-to-end learning of anomaly scores by a neural deviation learning, in which we leverage a few (e.g., multiple to dozens) labeled anomalies and a prior probability to enforce statistically significant deviations of the anomaly scores of anomalies from that of normal data objects in the upper tail. Extensive results show that our method can be trained substantially more data-efficiently and achieves significantly better anomaly scoring than state-of-the-art competing methods.
The emergence of real-time auction in online advertising has drawn huge attention of modeling the market competition, i.e., bid landscape forecasting. The problem is formulated as to forecast the probability distribution of market price for each ad auction. With the consideration of the censorship issue which is caused by the second-price auction mechanism, many researchers have devoted their efforts on bid landscape forecasting by incorporating survival analysis from medical research field. However, most existing solutions mainly focus on either counting-based statistics of the segmented sample clusters, or learning a parameterized model based on some heuristic assumptions of distribution forms. Moreover, they neither consider the sequential patterns of the feature over the price space. In order to capture more sophisticated yet flexible patterns at fine-grained level of the data, we propose a Deep Landscape Forecasting (DLF) model which combines deep learning for probability distribution forecasting and survival analysis for censorship handling. Specifically, we utilize a recurrent neural network to flexibly model the conditional winning probability w.r.t. each bid price. Then we conduct the bid landscape forecasting through probability chain rule with strict mathematical derivations. And, in an end-to-end manner, we optimize the model by minimizing two negative likelihood losses with comprehensive motivations. Without any specific assumption for the distribution form of bid landscape, our model shows great advantages over previous works on fitting various sophisticated market price distributions. In the experiments over two large-scale real-world datasets, our model significantly outperforms the state-of-the-art solutions under various metrics.
Predicting when and where events will occur in cities, like taxi pick-ups, crimes, and vehicle collisions, is a challenging and important problem with many applications in fields such as urban planning, transportation optimization and location-based marketing. Though many point processes have been proposed to model events in a continuous spatio-temporal space, none of them allow for the consideration of the rich contextual factors that affect event occurrence, such as weather, social activities, geographical characteristics, and traffic. In this paper, we propose DMPP (Deep Mixture Point Processes), a point process model for predicting spatio-temporal events with the use of rich contextual information; a key advance is its incorporation of the heterogeneous and high-dimensional context available in image and text data. Specifically, we design the intensity of our point process model as a mixture of kernels, where the mixture weights are modeled by a deep neural network. This formulation allows us to automatically learn the complex nonlinear effects of the contextual factors on event occurrence. At the same time, this formulation makes analytical integration over the intensity, which is required for point process estimation, tractable. We use real-world data sets from different domains to demonstrate that DMPP has better predictive performance than existing methods.
Online prediction has become one of the most essential tasks in many real-world applications. Two main characteristics of typical online prediction tasks include tabular input space and online data generation. Specifically, tabular input space indicates the existence of both sparse categorical features and dense numerical ones, while online data generation implies continuous task-generated data with potentially dynamic distribution. Consequently, effective learning with tabular input space as well as fast adaption to online data generation become two vital challenges for obtaining the online prediction model. Although Gradient Boosting Decision Tree (GBDT) and Neural Network (NN) have been widely used in practice, either of them yields their own weaknesses. Particularly, GBDT can hardly be adapted to dynamic online data generation, and it tends to be ineffective when facing sparse categorical features; NN, on the other hand, is quite difficult to achieve satisfactory performance when facing dense numerical features. In this paper, we propose a new learning framework, DeepGBM, which integrates the advantages of the both NN and GBDT by using two corresponding NN components: (1) CatNN, focusing on handling sparse categorical features. (2) GBDT2NN, focusing on dense numerical features with distilled knowledge from GBDT. Powered by these two components, DeepGBM can leverage both categorical and numerical features while retaining the ability of efficient online update. Comprehensive experiments on a variety of publicly available datasets have demonstrated that DeepGBM can outperform other well-recognized baselines in various online prediction tasks.
In recent years, to mitigate the problem of fake news, computational detection of fake news has been studied, producing some promising early results. While important, however, we argue that a critical missing piece of the study be the explainability of such detection, i.e., why a particular piece of news is detected as fake. In this paper, therefore, we study the explainable detection of fake news. We develop a sentence-comment co-attention sub-network to exploit both news contents and user comments to jointly capture explainable top-k check-worthy sentences and user comments for fake news detection. We conduct extensive experiments on real-world datasets and demonstrate that the proposed method not only significantly outperforms 7 state-of-the-art fake news detection methods by at least 5.33% in F1-score, but also (concurrently) identifies top-k user comments that explain why a news piece is fake, better than baselines by 28.2% in NDCG and 30.7% in Precision.
Graph data widely exist in many high-impact applications. Inspired by the success of deep learning in grid-structured data, graph neural network models have been proposed to learn powerful node-level or graph-level representation. However, most of the existing graph neural networks suffer from the following limitations: (1) there is limited analysis regarding the graph convolution properties, such as seed-oriented, degree-aware and order-free; (2) the node's degreespecific graph structure is not explicitly expressed in graph convolution for distinguishing structure-aware node neighborhoods; (3) the theoretical explanation regarding the graph-level pooling schemes is unclear.
To address these problems, we propose a generic degree-specific graph neural network named DEMO-Net motivated by Weisfeiler-Lehman graph isomorphism test that recursively identifies 1-hop neighborhood structures. In order to explicitly capture the graph topology integrated with node attributes, we argue that graph convolution should have three properties: seed-oriented, degree-aware, order-free. To this end, we propose multi-task graph convolution where each task represents node representation learning for nodes with a specific degree value, thus leading to preserving the degreespecific graph structure. In particular, we design two multi-task learning methods: degree-specific weight and hashing functions for graph convolution. In addition, we propose a novel graph-level pooling/readout scheme for learning graph representation provably lying in a degree-specific Hilbert kernel space. The experimental results on several node and graph classification benchmark data sets demonstrate the effectiveness and efficiency of our proposed DEMO-Net over state-of-the-art graph neural network models.
Partial label learning is an emerging weakly-supervised learning framework where each training example is associated with multiple candidate labels among which only one is valid. Dimensionality reduction serves as an effective way to help improve the generalization ability of learning system, while the task of partial label dimensionality reduction is challenging due to the unknown ground-truth labeling information. In this paper, the first attempt towards partial label dimensionality reduction is investigated by endowing the popular linear discriminant analysis (LDA) techniques with the ability of dealing with partial label training examples. Specifically, a novel learning procedure named DELIN is proposed which alternates between LDA dimensionality reduction and candidate label disambiguation based on estimated labeling confidences over candidate labels. On one hand, the projection matrix of LDA is optimized by utilizing disambiguation-guided labeling confidences. On the other hand, the labeling confidences are disambiguated by resorting to kNN aggregation in the LDA-induced feature space. Extensive experiments on synthetic as well as real-world partial label data sets clearly validate the effectiveness of DELIN in improving the generalization ability of state-of-the-art partial label learning algorithms.
Scientific computational models are crucial for analyzing and understanding complex real-life systems that are otherwise difficult for experimentation. However, the complex behavior and the vast input-output space of these models often make them opaque, slowing the discovery of novel phenomena. In this work, we present HINT (Hessian INTerestingness) -- a new algorithm that can automatically and systematically explore black-box models and highlight local nonlinear interactions in the input-output space of the model. This tool aims to facilitate the discovery of interesting model behaviors that are unknown to the researchers. Using this simple yet powerful tool, we were able to correctly rank all pairwise interactions in known benchmark models and do so faster and with greater accuracy than state-of-the-art methods. We further applied HINT to existing computational neuroscience models, and were able to reproduce important scientific discoveries that were published years after the creation of those models. Finally, we ran HINT on two real-world models (in neuroscience and earth science) and found new behaviors of the model that were of value to domain experts.
Online learning algorithms update models via one sample per iteration, thus efficient to process large-scale datasets and useful to detect malicious events for social benefits, such as disease outbreak and traffic congestion on the fly. However, existing algorithms for graph-structured models focused on the offline setting and the least square loss, incapable for online setting, while methods designed for online setting cannot be directly applied to the problem of complex (usually non-convex) graph-structured sparsity model. To address these limitations, in this paper we propose a new algorithm for graph-structured sparsity constraint problems under online setting, which we call GraphDA. The key part in GraphDA is to project both averaging gradient (in dual space) and primal variables (in primal space) onto lower dimensional subspaces, thus capturing the graph-structured sparsity effectively. Furthermore, the objective functions assumed here are generally convex so as to handle different losses for online learning settings. To the best of our knowledge, GraphDA is the first online learning algorithm for graph-structure constrained optimization problems. To validate our method, we conduct extensive experiments on both benchmark graph and real-world graph datasets. Our experiment results show that, compared to other baseline methods, GraphDA not only improves classification performance, but also successfully captures graph-structured features more effectively, hence stronger interpretability.
Sequential recommendation and information dissemination are two traditional problems for sequential information retrieval. The common goal of the two problems is to predict future user-item interactions based on past observed interactions. The difference is that the former deals with users' histories of clicked items, while the latter focuses on items' histories of infected users.In this paper, we take a fresh view and propose dual sequential prediction models that unify these two thinking paradigms. One user-centered model takes a user's historical sequence of interactions as input, captures the user's dynamic states, and approximates the conditional probability of the next interaction for a given item based on the user's past clicking logs. By contrast, one item-centered model leverages an item's history, captures the item's dynamic states, and approximates the conditional probability of the next interaction for a given user based on the item's past infection records. To take advantage of the dual information, we design a new training mechanism which lets the two models play a game with each other and use the predicted score from the opponent to design a feedback signal to guide the training. We show that the dual models can better distinguish false negative samples and true negative samples compared with single sequential recommendation or information dissemination models. Experiments on four real-world datasets demonstrate the superiority of proposed model over some strong baselines as well as the effectiveness of dual training mechanism between two models.
Given a large, semi-infinite collection of co-evolving data sequences (e.g., IoT/sensor streams), which contains multiple distinct dynamic time-series patterns, our aim is to incrementally monitor current dynamic patterns and forecast future behavior. We present an intuitive model, namely OrbitMap, which provides a good summary of time-series evolution in streams. We also propose a scalable and effective algorithm for fitting and forecasting time-series data streams. Our method is designed as a dynamic, interactive and flexible system, and is based on latent non-linear differential equations. Our proposed method has the following advantages: (a) It is effective: it captures important time-evolving patterns in data streams and enables real-time, long-range forecasting; (b) It is general: our model is general and practical and can be applied to various types of time-evolving data streams; (c) It is scalable: our algorithm does not depend on data size, and thus is applicable to very large sequences. Extensive experiments on real datasets demonstrate that OrbitMap makes long-range forecasts, and consistently outperforms the best existing state-of-the-art methods as regards accuracy and execution speed.
Many real-world problems are time-evolving in nature, such as the progression of diseases, the cascading process when a post is broadcasting in a social network, or the changing of climates. The observational data characterizing these complex problems are usually only available at discrete time stamps, this makes the existing research on analyzing these problems mostly based on a cross-sectional analysis. In this paper, we try to model these time-evolving phenomena by a dynamic system and the data sets observed at different time stamps are probability distribution functions generated by such a dynamic system. We propose a theorem which builds a mathematical relationship between a dynamical system modeled by differential equations and the distribution function (or survival function) of the cross-sectional states of this system. We then develop a survival analysis framework to learn the differential equations of a dynamical system from its cross-sectional states. With such a framework, we are able to capture the continuous-time dynamics of an evolutionary system.We validate our framework on both synthetic and real-world data sets. The experimental results show that our framework is able to discover and capture the generative dynamics of various data distributions accurately. Our study can potentially facilitate scientific discoveries of the unknown dynamics of complex systems in the real world.
Network community detection is a hot research topic in network analysis. Although many methods have been proposed for community detection, most of them only take into consideration the lower-order structure of the network at the level of individual nodes and edges. Thus, they fail to capture the higher-order characteristics at the level of small dense subgraph patterns, e.g., motifs. Recently, some higher-order methods have been developed but they typically focus on the motif-based hypergraph which is assumed to be a connected graph. However, such assumption cannot be ensured in some real-world networks. In particular, the hypergraph may become fragmented. That is, it may consist of a large number of connected components and isolated nodes, despite the fact that the original network is a connected graph. Therefore, the existing higher-order methods would suffer seriously from the above fragmentation issue, since in these approaches, nodes without connection in hypergraph can't be grouped together even if they belong to the same community. To address the above fragmentation issue, we propose an Edge enhancement approach for Motif-aware community detection (EdMot ). The main idea is as follows. Firstly, a motif-based hypergraph is constructed and the top K largest connected components in the hypergraph are partitioned into modules. Afterwards, the connectivity structure within each module is strengthened by constructing an edge set to derive a clique from each module. Based on the new edge set, the original connectivity structure of the input network is enhanced to generate a rewired network, whereby the motif-based higher-order structure is leveraged and the hypergraph fragmentation issue is well addressed. Finally, the rewired network is partitioned to obtain the higher-order community structure. Extensive experiments have been conducted on eight real-world datasets and the results show the effectiveness of the proposed method in improving the community detection performance of state-of-the-art methods.
With the increasing availability of moving-object tracking data, use of this data for route search and recommendation is increasingly important. To this end, we propose a novel parallel split-and-combine approach to enable route search by locations (RSL-Psc). Given a set of routes, a set of places to visit O, and a threshold θ, we retrieve the route composed of sub-routes that (i) has similarity to O no less than θ and (ii) contains the minimum number of sub-route combinations. The resulting functionality targets a broad range of applications, including route planning and recommendation, ridesharing, and location-based services in general. To enable efficient and effective RSL-Psc computation on massive route data, we develop novel search space pruning techniques and enable use of the parallel processing capabilities of modern processors. Specifically, we develop two parallel algorithms, Fully-Split Parallel Search (FSPS) and Group-Split Parallel Search (GSPS). We divide the route split-and-combine task into ∑k=0 M S(|O|,k+1) sub-tasks, where M is the maximum number of combinations and S(⋅) is the Stirling number of the second kind. In each sub-task, we use network expansion and exploit spatial similarity bounds for pruning. The algorithms split candidate routes into sub-routes and combine them to construct new routes. The sub-tasks are independent and are performed in parallel. Extensive experiments with real data offer insight into the performance of the algorithms, indicating that our RSL-Psc problem can generate high-quality results and that the two algorithms are capable of achieving high efficiency and scalability.
With the proliferation of commercial tracking systems, sports data is being generated at an unprecedented speed and the interest in sports play retrieval has grown dramatically as well. However, it is challenging to design an effective, efficient and robust similarity measure for sports play retrieval. To this end, we propose a deep learning approach to learn the representations of sports plays, called play2vec, which is robust against noise and takes only linear time to compute the similarity between two sports plays. We conduct experiments on real-world soccer match data, and the results show that our solution performs more effectively and efficiently compared with the state-of-the-art methods.
Express systems are widely deployed in many major cities. Couriers in an express system load parcels at transit station and deliver them to customers. Meanwhile, they also try to serve the pick-up requests which come stochastically in real time during the delivery process. Having brought much convenience and promoted the development of e-commerce, express systems face challenges on courier management to complete the massive number of tasks per day. Considering this problem, we propose a reinforcement learning based framework to learn a courier management policy. Firstly, we divide the city into independent regions, in each of which a constant number of couriers deliver parcels and serve requests cooperatively. Secondly, we propose a soft-label clustering algorithm named Balanced Delivery-Service Burden (BDSB) to dispatch parcels to couriers in each region. BDSB guarantees that each courier has almost even delivery and expected request-service burden when departing from transit station, giving a reasonable initialization for online management later. As pick-up requests come in real time, a Contextual Cooperative Reinforcement Learning (CCRL) model is proposed to guide where should each courier deliver and serve in each short period. Being formulated in a multi-agent way, CCRL focuses on the cooperation among couriers while also considering the system context. Experiments on real-world data from Beijing are conducted to confirm the outperformance of our model.
Analysis of large-scale sequential data has been one of the most crucial tasks in areas such as bioinformatics, text, and audio mining. Existing string kernels, however, either (i) rely on local features of short substructures in the string, which hardly capture long discriminative patterns, (ii) sum over too many substructures, such as all possible subsequences, which leads to diagonal dominance of the kernel matrix, or (iii) rely on non-positive-definite similarity measures derived from the edit distance. Furthermore, while there have been works addressing the computational challenge with respect to the length of string, most of them still experience quadratic complexity in terms of the number of training samples when used in a kernel-based classifier. In this paper, we present a new class of global string kernels that aims to (i) discover global properties hidden in the strings through global alignments, (ii) maintain positive-definiteness of the kernel, without introducing a diagonal dominant kernel matrix, and (iii) have a training cost linear with respect to not only the length of the string but also the number of training string samples. To this end, the proposed kernels are explicitly defined through a series of different random feature maps, each corresponding to a distribution of random strings. We show that kernels defined this way are always positive-definite, and exhibit computational benefits as they always produce Random String Embeddings (RSE) that can be directly used in any linear classification models. Our extensive experiments on nine benchmark datasets corroborate that RSE achieves better or comparable accuracy in comparison to state-of-the-art baselines, especially with the strings of longer lengths. In addition, we empirically show that RSE scales linearly with the increase of the number and the length of string.
This paper studies the problem of MCC-Sparse, Maximum Clique Computation over large real-world graphs that are usually Sparse. In the literature, MCC-Sparse has been studied separately and less extensively than its dense counterpart MCC-Dense, and advanced algorithmic techniques that are developed for MCC-Dense have not been utilized in the existing MCC-Sparse solvers. In this paper, we design an algorithm MC-BRB which transforms an instance of MCC-Sparse to instances of k-clique finding over dense subgraphs (KCF-Dense) that can be computed by the existing MCC-Dense solvers. To further improve the efficiency, we then develop a new branch-reduce-&-bound framework for KCF-Dense by proposing light-weight reducing techniques and leveraging the existing advanced branching and bounding techniques of MCC-Dense solvers. In addition, we also design an ego-centric algorithm MC-EGO for heuristically computing a near-maximum clique in near-linear time. We conduct extensive empirical studies on large real graphs and demonstrate the efficiency and effectiveness of our techniques.
Personalized Route Recommendation (PRR) aims to generate user-specific route suggestions in response to users' route queries. Early studies cast the PRR task as a pathfinding problem on graphs, and adopt adapted search algorithms by integrating heuristic strategies. Although these methods are effective to some extent, they require setting the cost functions with heuristics. In addition, it is difficult to utilize useful context information in the search procedure. To address these issues, we propose using neural networks to automatically learn the cost functions of a classic heuristic algorithm, namely A* algorithm, for the PRR task. Our model consists of two components. First, we employ attention-based Recurrent Neural Networks (RNN) to model the cost from the source to the candidate location by incorporating useful context information. Instead of learning a single cost value, the RNN component is able to learn a time-varying vectorized representation for the moving state of a user. Second, we propose to use a value network for estimating the cost from a candidate location to the destination. For capturing structural characteristics, the value network is built on top of improved graph attention networks by incorporating the moving state of a user and other context information. The two components are integrated in a principled way for deriving a more accurate cost of a candidate location. Extensive experiment results on three real-world datasets have shown the effectiveness and robustness of the proposed model.
Collaborative filtering (CF) has become one of the most popular and widely used methods in recommender systems, but its performance degrades sharply for users with rare interaction data. Most existing hybrid CF methods try to incorporate side information such as review texts to alleviate the data sparsity problem. However, the process of exploiting and integrating side information is computationally expensive. Existing hybrid recommendation methods treat each user equally and ignore that the pure CF methods have already achieved both effective and efficient recommendation performance for active users with sufficient interaction records and the little improvement brought by side information to these active users is ignorable. Therefore, they are not cost-effective solutions. One cost-effective idea to bypass this dilemma is to generate sufficient "real" interaction data for the inactive users with the help of side information, and then a pure CF method could be performed on this augmented dataset effectively. However, there are three major challenges to implement this idea. Firstly, how to ensure the correctness of the generated interaction data. Secondly, how to combine the data augmentation process and recommendation process into a unified model and train the model end-to-end. Thirdly, how to make the solution generalizable for various side information and recommendation tasks. In light of these challenges, we propose a generic and effective CF model called AugCF that supports a wide variety of recommendation tasks. AugCF is based on Conditional Generative Adversarial Nets that additionally consider the class (like or dislike) as a feature to generate new interaction data, which can be a sufficiently real augmentation to the original dataset. Also, AugCF adopts a novel discriminator loss and Gumbel-Softmax approximation to enable end-to-end training. Finally, extensive experiments are conducted on two large-scale recommendation datasets, and the experimental results show the superiority of our proposed model.
We present a novel method named Latent Semantic Imputation (LSI) to transfer external knowledge into semantic space for enhancing word embedding. The method integrates graph theory to extract the latent manifold structure of the entities in the affinity space and leverages non-negative least squares with standard simplex constraints and power iteration method to derive spectral embeddings. It provides an effective and efficient approach to combining entity representations defined in different Euclidean spaces. Specifically, our approach generates and imputes reliable embedding vectors for low-frequency words in the semantic space and benefits downstream language tasks that depend on word embedding. We conduct comprehensive experiments on a carefully designed classification problem and language modeling and demonstrate the superiority of the enhanced embedding via LSI over several well-known benchmark embeddings. We also confirm the consistency of the results under different parameter settings of our method.
Reinforcement learning aims at searching the best policy model for decision making, and has been shown powerful for sequential recommendations. The training of the policy by reinforcement learning, however, is placed in an environment. In many real-world applications, however, the policy training in the real environment can cause an unbearable cost, due to the exploration in the environment. Environment reconstruction from the past data is thus an appealing way to release the power of reinforcement learning in these applications. The reconstruction of the environment is, basically, to extract the casual effect model from the data. However, real-world applications are often too complex to offer fully observable environment information. Therefore, quite possibly there are unobserved confounding variables lying behind the data. The hidden confounder can obstruct an effective reconstruction of the environment. In this paper, by treating the hidden confounder as a hidden policy, we propose a deconfounded multi-agent environment reconstruction (DEMER) approach in order to learn the environment together with the hidden confounder. DEMER adopts a multi-agent generative adversarial imitation learning framework. It proposes to introduce the confounder embedded policy, and use the compatible discriminator for training the policies. We then apply DEMER in an application of driver program recommendation. We firstly use an artificial driver program recommendation environment, abstracted from the real application, to verify and analyze the effectiveness of DEMER. We then test DEMER in the real application of Didi Chuxing. Experiment results show that DEMER can effectively reconstruct the hidden confounder, and thus can build the environment better. DEMER also derives a recommendation policy with a significantly improved performance in the test phase of the real application.
Influenza leads to regular losses of lives annually and requires careful monitoring and control by health organizations. Annual influenza forecasts help policymakers implement effective countermeasures to control both seasonal and pandemic outbreaks. Existing forecasting techniques suffer from problems such as poor forecasting performance, lack of modeling flexibility, data sparsity, and/or lack of intepretability. We propose EpiDeep, a novel deep neural network approach for epidemic forecasting which tackles all of these issues by learning meaningful representations of incidence curves in a continuous feature space and accurately predicting future incidences, peak intensity, peak time, and onset of the upcoming season. We present extensive experiments on forecasting ILI (influenza-like illnesses) in the United States, leveraging multiple metrics to quantify success. Our results demonstrate that EpiDeep is successful at learning meaningful embeddings and, more importantly, that these embeddings evolve as the season progresses. Furthermore, our approach outperforms non-trivial baselines by up to 40%.
Exploratory analysis over network data is often limited by the ability to efficiently calculate graph statistics, which can provide a model-free understanding of the macroscopic properties of a network. We introduce a framework for estimating the graphlet count---the number of occurrences of a small subgraph motif (e.g. a wedge or a triangle) in the network. For massive graphs, where accessing the whole graph is not possible, the only viable algorithms are those that make a limited number of vertex neighborhood queries. We introduce a Monte Carlo sampling technique for graphlet counts, called Lifting, which can simultaneously sample all graphlets of size up to k vertices for arbitrary k. This is the first graphlet sampling method that can provably sample every graphlet with positive probability and can sample graphlets of arbitrary size k. We outline variants of lifted graphlet counts, including the ordered, unordered, and shotgun estimators, random walk starts, and parallel vertex starts. We prove that our graphlet count updates are unbiased for the true graphlet count and have a controlled variance for all graphlets. We compare the experimental performance of lifted graphlet counts to the state-of-the art graphlet sampling procedures: Waddling and the pairwise subgraph random walk.
How can we estimate the importance of nodes in a knowledge graph (KG)? A KG is a multi-relational graph that has proven valuable for many tasks including question answering and semantic search. In this paper, we present GENI, a method for tackling the problem of estimating node importance in KGs, which enables several downstream applications such as item recommendation and resource allocation. While a number of approaches have been developed to address this problem for general graphs, they do not fully utilize information available in KGs, or lack flexibility needed to model complex relationship between entities and their importance. To address these limitations, we explore supervised machine learning algorithms. In particular, building upon recent advancement of graph neural networks (GNNs), we develop GENI, a GNN-based method designed to deal with distinctive challenges involved with predicting node importance in KGs. Our method performs an aggregation of importance scores instead of aggregating node embeddings via predicate-aware attention mechanism and flexible centrality adjustment. In our evaluation of GENI and existing methods on predicting node importance in real-world KGs with different characteristics, GENI achieves 5-17% higher [email protected] than the state of the art.
The $L_1 $ regularization (Lasso) has proven to be a versatile tool to select relevant features and estimate the model coefficients simultaneously and has been widely used in many research areas such as genomes studies, finance, and biomedical imaging. Despite its popularity, it is very challenging to guarantee the feature selection consistency of Lasso especially when the dimension of the data is huge. One way to improve the feature selection consistency is to select an ideal tuning parameter. Traditional tuning criteria mainly focus on minimizing the estimated prediction error or maximizing the posterior model probability, such as cross-validation and BIC, which may either be time-consuming or fail to control the false discovery rate (FDR) when the number of features is extremely large. The other way is to introduce pseudo-features to learn the importance of the original ones. Recently, the Knockoff filter is proposed to control the FDR when performing feature selection. However, its performance is sensitive to the choice of the expected FDR threshold. Motivated by these ideas, we propose a new method using pseudo-features to obtain an ideal tuning parameter. In particular, we present the E fficient T uning of Lasso (ET-Lasso ) to separate active and inactive features by adding permuted features as pseudo-features in linear models. The pseudo-features are constructed to be inactive by nature, which can be used to obtain a cutoff to select the tuning parameter that separates active and inactive features. Experimental studies on both simulations and real-world data applications are provided to show that ET-Lasso can effectively and efficiently select active features under a wide range of scenarios.
This paper targets to a novel but practical recommendation problem named exact-K recommendation. It is different from traditional top-K recommendation, as it focuses more on (constrained) combinatorial optimization which will optimize to recommend a whole set of K items called card, rather than ranking optimization which assumes that "better" items should be put into top positions. Thus we take the first step to give a formal problem definition, and innovatively reduce it to Maximum Clique Optimization based on graph. To tackle this specific combinatorial optimization problem which is NP-hard, we propose Graph Attention Networks (GAttN) with a Multi-head Self-attention encoder and a decoder with attention mechanism. It can end-to-end learn the joint distribution of the K items and generate an optimal card rather than rank individual items by prediction scores. Then we propose Reinforcement Learning from Demonstrations (RLfD) which combines the advantages in behavior cloning and reinforcement learning, making it sufficient-and-efficient to train the model. Extensive experiments on three datasets demonstrate the effectiveness of our proposed GAttN with RLfD method, it outperforms several strong baselines with a relative improvement of 7.7% and 4.7% on average in Precision and Hit Ratio respectively, and achieves state-of-the-art (SOTA) performance for the exact-K recommendation problem.
Adaptive learning, also known as adaptive teaching, relies on learning path recommendation, which sequentially recommends personalized learning items (e.g., lectures, exercises) to satisfy the unique needs of each learner. Although it is well known that modeling the cognitive structure including knowledge level of learners and knowledge structure (e.g., the prerequisite relations) of learning items is important for learning path recommendation, existing methods for adaptive learning often separately focus on either knowledge levels of learners or knowledge structure of learning items. To fully exploit the multifaceted cognitive structure for learning path recommendation, we propose a Cognitive Structure Enhanced framework for Adaptive Learning, named CSEAL. By viewing path recommendation as a Markov Decision Process and applying an actor-critic algorithm, CSEAL can sequentially identify the right learning items to different learners. Specifically, we first utilize a recurrent neural network to trace the evolving knowledge levels of learners at each learning step. Then, we design a navigation algorithm on the knowledge structure to ensure the logicality of learning paths, which reduces the search space in the decision process. Finally, the actor-critic algorithm is used to determine what to learn next and whose parameters are dynamically updated along the learning path. Extensive experiments on real-world data demonstrate the effectiveness and robustness of CSEAL.
In this paper, we study the problem of online influence maximization in social networks. In this problem, a learner aims to identify the set of "best influencers" in a network by interacting with the network, i.e., repeatedly selecting seed nodes and observing activation feedback in the network. We capitalize on an important property of the influence maximization problem named network assortativity, which is ignored by most existing works in online influence maximization. To realize network assortativity, we factorize the activation probability on the edges into latent factors on the corresponding nodes, including influence factor on the giving nodes and susceptibility factor on the receiving nodes. We propose an upper confidence bound based online learning solution to estimate the latent factors, and therefore the activation probabilities. Considerable regret reduction is achieved by our factorization based online influence maximization algorithm. Extensive empirical evaluations on two real-world networks showed the effectiveness of our proposed solution.
Given a dynamic graph stream, how can we detect the sudden appearance of anomalous patterns, such as link spam, follower boosting, or denial of service attacks? Additionally, can we categorize the types of anomalies that occur in practice, and theoretically analyze the anomalous signs arising from each type? In this work, we propose AnomRank, an online algorithm for anomaly detection in dynamic graphs. AnomRank uses a two-pronged approach defining two novel metrics for anomalousness. Each metric tracks the derivatives of its own version of a 'node score' (or node importance) function. This allows us to detect sudden changes in the importance of any node. We show theoretically and experimentally that the two-pronged approach successfully detects two common types of anomalies: sudden weight changes along an edge, and sudden structural changes to the graph. AnomRank is (a) Fast and Accurate: up to 49.5x faster or 35% more accurate than state-of-the-art methods, (b) Scalable: linear in the number of edges in the input graph, processing millions of edges within 2 seconds on a stock laptop/desktop, and (c) Theoretically Sound: providing theoretical guarantees of the two-pronged approach.
Empirical entropy refers to the information entropy calculated from the empirical distribution of a dataset. It is a widely used aggregation function for knowledge discovery, as well as the foundation of other aggregation functions such as mutual information. However, computing the exact empirical entropy on a large-scale dataset can be expensive. Using a random subsample, we can compute an approximation of the empirical entropy efficiently. We derive probabilistic error bounds for the approximation, where the error bounds reduce in a near square root rate with respect to the subsample size. We further study two applications which can benefit from the error-bounded approximation: feature ranking and filtering based on mutual information. We develop algorithms to progressively subsample the dataset and return correct answers with high probability. The sample complexity of the algorithms is independent of data size. The empirical evaluation of our algorithms on large-scale real-world datasets demonstrates up to three orders of magnitude speedup over exact methods with \errrate\ error.
A social network is an ecosystem, and one of its ultimate goals is to maintain itself sustainable, namely keeping users generating information and being informed. However, the reasons why some social ecosystems can keep self-sustaining and others end up with non-active or dead states are largely unknown.
In this paper, rather than studying social ecosystems at the population level, we analyze the fates of different microscopic social ecosystems, namely the final states of their collective activity dynamics in a real-world online social media with detailed individual level records for the first time. We find huge complexities in microscopic social ecosystems, including complex species types, complex individual interaction networks, and complex dynamics and final states. In order to capture the observed complexities in the real-world data, we propose a microscopic ecological model, which is able to capture the complex fates of heterogeneous microscopic social ecosystems accurately in both synthetic and empirical datasets. Furthermore, we analyze the driven factors of the fates of microscopic social ecosystems, including interaction networks of individuals and dynamical interaction mechanisms of species, leading to the control of microscopic social ecosystems, that is the ability to influence the temporal behaviours and their final states towards active or dead fates.
The process of opinion formation is inherently a network process, with user opinions in a social network being driven to a certain average opinion. One simple and intuitive incarnation of this opinion attractor is the average of user opinions weighted by the users' eigenvector centralities. This value is a lucrative target for control, as altering it essentially changes the mass opinion in the network. Since any potentially malicious influence upon the opinion distribution in a society is undesirable, it is important to design methods to prevent external attacks upon it. In this work, we assume that the adversary aims to maliciously change the network's average opinion by altering the opinions of some unknown users. We, then, state an NP-hard problem of disabling such opinion control attempts via strategically altering the network's users' eigencentralities by recommending a limited number of links to the users. Relying on Markov chain theory, we provide perturbation analysis that shows how eigencentrality and, hence, our problem's objective change in response to a link's addition to the network. The latter leads to the design of a pseudo-linear-time heuristic, relying on efficient estimation of mean first passage times in Markov chains. We have confirmed our theoretical and algorithmic findings, and studied effectiveness and efficiency of our heuristic in experiments with synthetic and real networks.
Can a system discover what a user wants without the user explicitly issuing a query? A recommender system proposes items of potential interest based on past user history. On the other hand, active search incites, and learns from, user feedback, in order to recommend items that meet a user's current tacit interests, hence promises to offer up-to-date recommendations going beyond those of a recommender system. Yet extant active search methods require an overwhelming amount of user input, relying solely on such input for each item they pick. In this paper, we propose MF-ASC, a novel active search mechanism that performs well with minimal user input. MF-ASC combines cheap, low-fidelity evaluations in the style of a recommender system with the user's high-fidelity input, using Gaussian process regression with multiple target variables (cokriging). To our knowledge, this is the first application of cokriging to active search. Our empirical study with synthetic and real-world data shows that MF-ASC outperforms the state of the art in terms of result relevance within a budget of interactions.
Precisely evaluating the effect of new policies (e.g. ad-placement models, recommendation functions, ranking functions) is one of the most important problems for improving interactive systems. The conventional policy evaluation methods rely on online A/B tests, but they are usually extremely expensive and may have undesirable impacts. Recently, Inverse Propensity Score (IPS) estimators are proposed as alternatives to evaluate the effect of new policy with offline logged data that was collected from a different policy in the past. They tend to remove the distribution shift induced by past policy. However, they ignore the distribution shift that would be induced by the new policy, which results in imprecise evaluation. Moreover, their performances rely on accurate estimation of propensity score, which can not be guaranteed or validated in practice. In this paper, we propose a non-parametric method, named Focused Context Balancing (FCB) algorithm, to learn sample weights for context balancing, so that the distribution shift induced by the past policy and new policy can be eliminated respectively. To validate the effectiveness of our FCB algorithm, we conduct extensive experiments on both synthetic and real world datasets. The experimental results clearly demonstrate that our FCB algorithm outperforms existing estimators by achieving more precise and robust results for offline policy evaluation.
GCN-MF: Disease-Gene Association Identification By Graph Convolutional Networks and Matrix Factorization
Discovering disease-gene association is a fundamental and critical biomedical task, which assists biologists and physicians to discover pathogenic mechanism of syndromes. With various clinical biomarkers measuring the similarities among genes and disease phenotypes, network-based semi-supervised learning (NSSL) has been commonly utilized by these studies to address this class-imbalanced large-scale data issue. However, most existing NSSL approaches are based on linear models and suffer from two major limitations: 1) They implicitly consider a local-structure representation for each candidate; 2) They are unable to capture nonlinear associations between diseases and genes. In this paper, we propose a new framework for disease-gene association task by combining Graph Convolutional Network (GCN) and matrix factorization, named GCN-MF. With the help of GCN, we could capture non-linear interactions and exploit measured similarities. Moreover, we define a margin control loss function to reduce the effect of sparsity. Empirical results demonstrate that the proposed deep learning algorithm outperforms all other state-of-the-art methods on most of metrics.
Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space
Hierarchical clustering is typically performed using algorithmic-based optimization searching over the discrete space of trees. While these optimization methods are often effective, their discreteness restricts them from many of the benefits of their continuous counterparts, such as scalable stochastic optimization and the joint optimization of multiple objectives or components of a model (e.g. end-to-end training). In this paper, we present an approach for hierarchical clustering that searches over continuous representations of trees in hyperbolic space by running gradient descent. We compactly represent uncertainty over tree structures with vectors in the Poincare ball. We show how the vectors can be optimized using an objective related to recently proposed cost functions for hierarchical clustering (Dasgupta, 2016; Wang and Wang, 2018). Using our method with a mini-batch stochastic gradient descent inference procedure, we are able to outperform prior work on clustering millions of ImageNet images by 15 points of dendrogram purity. Further, our continuous tree representation can be jointly optimized in multi-task learning applications offering a 9 point improvement over baseline methods.
Graph neural networks, which generalize deep neural network models to graph structured data, have attracted increasing attention in recent years. They usually learn node representations by transforming, propagating and aggregating node features and have been proven to improve the performance of many graph related tasks such as node classification and link prediction. To apply graph neural networks for the graph classification task, approaches to generate thegraph representation from node representations are demanded. A common way is to globally combine the node representations. However, rich structural information is overlooked. Thus a hierarchical pooling procedure is desired to preserve the graph structure during the graph representation learning. There are some recent works on hierarchically learning graph representation analogous to the pooling step in conventional convolutional neural (CNN) networks. However, the local structural information is still largely neglected during the pooling process. In this paper, we introduce a pooling operator $\pooling$ based on graph Fourier transform, which can utilize the node features and local structures during the pooling process. We then design pooling layers based on the pooling operator, which are further combined with traditional GCN convolutional layers to form a graph neural network framework $\m$ for graph classification. Theoretical analysis is provided to understand $\pooling$ from both local and global perspectives. Experimental results of the graph classification task on $6$ commonly used benchmarks demonstrate the effectiveness of the proposed framework.
Random walks are widely adopted in various network analysis tasks ranging from network embedding to label propagation. It could capture and convert geometric structures into structured sequences while alleviating the issues of sparsity and curse of dimensionality. Though random walks on plain networks have been intensively studied, in real-world systems, nodes are often not pure vertices, but own different characteristics, described by the rich set of data associated with them. These node attributes contain plentiful information that often complements the network, and bring opportunities to the random-walk-based analysis. However, it is unclear how random walks could be developed for attributed networks towards an effective joint information extraction. Node attributes make the node interactions more complicated and are heterogeneous with respect to topological structures.
To bridge the gap, we explore to perform joint random walks on attributed networks, and utilize them to boost the deep node representation learning. The proposed framework GraphRNA consists of two major components, i.e., a collaborative walking mechanism - AttriWalk, and a tailored deep embedding architecture for random walks, named graph recurrent networks (GRN). AttriWalk considers node attributes as a bipartite network and uses it to propel the walking more diverse and mitigate the tendency of converging to nodes with high centralities. AttriWalk enables us to advance the prominent deep network embedding model, graph convolutional networks, towards a more effective architecture - GRN. GRN empowers node representations to interact in the same way as nodes interact in the original attributed network. Experimental results on real-world datasets demonstrate the effectiveness of GraphRNA compared with the state-of-the-art embedding algorithms.
Attention operators have been widely applied in various fields, including computer vision, natural language processing, and network embedding learning. Attention operators on graph data enables learnable weights when aggregating information from neighboring nodes. However, graph attention operators (GAOs) consume excessive computational resources, preventing their applications on large graphs. In addition, GAOs belong to the family of soft attention, instead of hard attention, which has been shown to yield better performance. In this work, we propose novel hard graph attention operator~(hGAO) and channel-wise graph attention operator~(cGAO). hGAO uses the hard attention mechanism by attending to only important nodes. Compared to GAO, hGAO improves performance and saves computational cost by only attending to important nodes. To further reduce the requirements on computational resources, we propose the cGAO that performs attention operations along channels. cGAO avoids the dependency on the adjacency matrix, leading to dramatic reductions in computational resource requirements. Experimental results demonstrate that our proposed deep models with the new operators achieve consistently better performance. Comparison results also indicates that hGAO achieves significantly better performance than GAO on both node and graph embedding tasks. Efficiency comparison shows that our cGAO leads to dramatic savings in computational resources, making them applicable to large graphs.
We address a fundamental problem in chemistry known as chemical reaction product prediction. Our main insight is that the input reactant and reagent molecules can be jointly represented as a graph, and the process of generating product molecules from reactant molecules can be formulated as a sequence of graph transformations. To this end, we propose Graph Transformation Policy Network (GTPN) - a novel generic method that combines the strengths of graph neural networks and reinforcement learning to learn reactions directly from data with minimal chemical knowledge. Compared to previous methods, GTPN has some appealing properties such as: end-to-end learning, and making no assumption about the length or the order of graph transformations. In order to guide model search through the complex discrete space of sets of bond changes effectively, we extend the standard policy gradient loss by adding useful constraints. Evaluation results show that GTPN improves the top-1 accuracy over the current state-of-the-art method by about 3% on the large USPTO dataset.
We present a graph-based semi-supervised learning (SSL) method for learning edge flows defined on a graph. Specifically, given flow measurements on a subset of edges, we want to predict the flows on the remaining edges. To this end, we develop a computational framework that imposes certain constraints on the overall flows, such as (approximate) flow conservation. These constraints render our approach different from classical graph-based SSL for vertex labels, which posits that tightly connected nodes share similar labels and leverages the graph structure accordingly to extrapolate from a few vertex labels to the unlabeled vertices. We derive bounds for our method's reconstruction error and demonstrate its strong performance on synthetic and real-world flow networks from transportation, physical infrastructure, and the Web. Furthermore, we provide two active learning algorithms for selecting informative edges on which to measure flow, which has applications for optimal sensor deployment. The first strategy selects edges to minimize the reconstruction error bound and works well on flows that are approximately divergence-free. The second approach clusters the graph and selects bottleneck edges that cross cluster-boundaries, which works well on flows with global trends.
GroupINN: Grouping-based Interpretable Neural Network for Classification of Limited, Noisy Brain Data
Mapping the human brain, or understanding how certain brain regions relate to specific aspects of cognition, has been and remains an active area of neuroscience research. Functional magnetic resonance imaging (fMRI) data---in the form of images, time series or graphs---are central in this research, but pose many challenges in phenotype prediction tasks (e.g., noisy, small training samples). Standardly employed handcrafted models and newly proposed neural network methods pose limitations in the expressive power and interpretability, respectively, in this context. In this work focusing on fMRI-derived brain graphs, a modality that partially handles some challenges of fMRI data, we propose a grouping-based interpretable neural network model, GroupINN, that effectively classifies cognitive performance with 85% fewer model parameters than baseline deep models, while also identifying the most predictive brain subnetworks within several task-specific contexts. Our method incorporates the idea of node grouping into the design of the neural network. That way, unlike other methods that employ clustering as a preprocessing step to reorder nodes, GroupINN learns the node grouping and extracts graph features jointly. Experiments on task-based fMRI datasets show that our method is $2.6-69\times$ faster than other deep models, while achieving comparable or better accuracy and providing interpretability.
In many complex domains, the input data are often not suited for the typical vector representations used in deep learning models. For example, in relational learning and computer vision tasks, the data are often better represented as sets (e.g., the neighborhood of a node, a cloud of points). In these cases, a key challenge is to learn an embedding function that is invariant to permutations of the input. While there has been some recent work on principled methods for learning permutation-invariant representations of sets, these approaches are limited in their applicability to set-of-sets (SoS) tasks, such as subgraph prediction and scene classification. In this work, we develop a deep neural network framework to learn inductive SoS embeddings that are invariant to SoS permutations. Specifically, we propose HATS, a hierarchical sequence model with attention mechanisms for inductive set-of-sets embeddings. We develop stochastic optimization and inference methods for learning HATS, and our experiments demonstrate that HATS achieves superior performance across a wide range of set-of-sets tasks.
Representation learning in heterogeneous graphs aims to pursue a meaningful vector representation for each node so as to facilitate downstream applications such as link prediction, personalized recommendation, node classification, etc. This task, however, is challenging not only because of the demand to incorporate heterogeneous structural (graph) information consisting of multiple types of nodes and edges, but also due to the need for considering heterogeneous attributes or contents (e.g., text or image) associated with each node. Despite a substantial amount of effort has been made to homogeneous (or heterogeneous) graph embedding, attributed graph embedding as well as graph neural networks, few of them can jointly consider heterogeneous structural (graph) information as well as heterogeneous contents information of each node effectively. In this paper, we propose HetGNN, a heterogeneous graph neural network model, to resolve this issue. Specifically, we first introduce a random walk with restart strategy to sample a fixed size of strongly correlated heterogeneous neighbors for each node and group them based upon node types. Next, we design a neural network architecture with two modules to aggregate feature information of those sampled neighboring nodes. The first module encodes "deep" feature interactions of heterogeneous contents and generates content embedding for each node. The second module aggregates content (attribute) embeddings of different neighboring groups (types) and further combines them by considering the impacts of different groups to obtain the ultimate node embedding. Finally, we leverage a graph context loss and a mini-batch gradient descent procedure to train the model in an end-to-end manner. Extensive experiments on several datasets demonstrate that HetGNN can outperform state-of-the-art baselines in various graph mining tasks, i.e., link prediction, recommendation, node classification & clustering and inductive node classification & clustering.
Spatial structured models are predictive models that capture dependency structure between samples based on their locations in the space. Learning such models plays an important role in many geoscience applications such as water surface mapping, but it also poses significant challenges due to implicit dependency structure in continuous space and high computational costs. Existing models often assume that the dependency structure is based on either spatial proximity or network topology, and thus cannot incorporate complex dependency structure such as contour and flow direction on a 3D potential surface. To fill the gap, this paper proposes a novel spatial structured model called hidden Markov contour tree (HMCT), which generalizes the traditional hidden Markov model from a total order sequence to a partial order polytree. HMCT also advances existing work on hidden Markov trees through capturing complex contour structures on a 3D surface. We propose efficient model construction and learning algorithms. Evaluations on real world hydrological datasets show that our HMCT outperforms multiple baseline methods in classification performance and that HMCT is scalable to large data sizes (e.g., classifying millions of samples in seconds).
Exploring Hidden Points of Interest (H-POIs), which are rarely referred in online search and recommendation systems due to insufficient check-in records, benefits business and individuals. In this work, we investigate how to eliminate the hidden feature of H-POIs by enhancing conventional crowdsourced ranking aggregation framework with heterogeneous (i.e., H-POI and Popular Point of Interest (P-POI)) pairwise tasks. We propose a two-phase solution focusing on both effectiveness and efficiency. In offline phase, we substantially narrow down the search space by retrieving a set of geo-textual valid heterogeneous pairs as the initial candidates and develop two practical data-driven strategies to compute worker qualities. In the online phase, we minimize the cost of assessment by introducing an active learning algorithm to jointly select pairs and workers with worker quality, uncertainty of P-POI rankings and uncertainty of the model taken into account. In addition, a (Minimum Spanning) Tree-constrained Skip search strategy is proposed for the purpose of reducing search time cost. Empirical experiments based on real POI datasets verify that the ranking accuracy of H-POIs can be greatly improved with small number of query iterations.
The chronological order of user-item interactions is a key feature in many recommender systems, where the items that users will interact may largely depend on those items that users just accessed recently. However, with the tremendous increase of users and items, sequential recommender systems still face several challenging problems: (1) the hardness of modeling the long-term user interests from sparse implicit feedback; (2) the difficulty of capturing the short-term user interests given several items the user just accessed. To cope with these challenges, we propose a hierarchical gating network (HGN), integrated with the Bayesian Personalized Ranking (BPR) to capture both the long-term and short-term user interests. Our HGN consists of a feature gating module, an instance gating module, and an item-item product module. In particular, our feature gating and instance gating modules select what item features can be passed to the downstream layers from the feature and instance levels, respectively. Our item-item product module explicitly captures the item relations between the items that users accessed in the past and those items users will access in the future. We extensively evaluate our model with several state-of-the-art methods and different validation metrics on five real-world datasets. The experimental results demonstrate the effectiveness of our model on Top-N sequential recommendation.
Automatic synonym recognition is of great importance for entity-centric text mining and interpretation. Due to the high language use variability in real-life, manual construction of semantic resources to cover all synonyms is prohibitively expensive and may also result in limited coverage. Although there are public knowledge bases, they only have limited coverage for languages other than English. In this paper, we focus on medical domain and propose an automatic way to accelerate the process of medical synonymy resource development for Chinese, including both formal entities from healthcare professionals and noisy descriptions from end-users. Motivated by the success of distributed word representations, we design a multi-task model with hierarchical task relationship to learn more representative entity/term embeddings and apply them to synonym prediction. In our model, we extend the classical skip-gram word embedding model by introducing an auxiliary task "neighboring word semantic type prediction'' and hierarchically organize them based on the task complexity. Meanwhile, we incorporate existing medical term-term synonymous knowledge into our word embedding learning framework. We demonstrate that the embeddings trained from our proposed multi-task model yield significant improvement for entity semantic relatedness evaluation, neighboring word semantic type prediction and synonym prediction compared with baselines. Furthermore, we create a large medical text corpus in Chinese that includes annotations for entities, descriptions and synonymous pairs for future research in this direction.
Hypothesis generation (HG) refers to the task of mining meaningful implicit association between unlinked biomedical concepts. The majority of prior studies have focused on uncovering these implicit linkages from static snapshots of the corpus, thereby largely ignoring the temporal dynamics of medical concepts. More recently, a few initial studies attempted to overcome this issue by modelling the temporal change of concepts from natural language text. However, they still fail to leverage the evolutionary features of concepts from contemporary knowledge-bases (KB's) such as semantic lexicons and ontologies. In practice such KB's contain up-to-date information that is important to incorporate, especially, in highly evolving domains such as biomedicine. Furthermore, considering the complementary strength of these sources of information - corpus and ontology - a few natural questions arise: Can joint modelling of (co)-evolutionary dynamics from these resources aid in encoding the temporal features at a granular level? Can the mutual evolution between these intertwined resources lead to better predictive effects? To answer these questions, in this study, we present a novel HG framework that unearths the latent associations between concepts by modeling their co-evolution across complementary sources of information. More specifically, the proposed approach adopts a shared temporal matrix factorization framework that models the co-evolution of concepts across both corpus and KB. Extensive experiments on the largest available biomedical corpus validates the effectiveness of the proposed approach.
We consider the problem of telling apart cause from effect between two univariate continuous-valued random variables X and Y. In general, it is impossible to make definite statements about causality without making assumptions on the underlying model; one of the most important aspects of causal inference is hence to determine under which assumptions are we able to do so. In this paper we show under which general conditions we can identify cause from effect by simply choosing the direction with the best regression score. We define a general framework of identifiable regression-based scoring functions, and show how to instantiate it in practice using regression splines. Compared to existing methods that either give strong guarantees, but are hardly applicable in practice, or provide no guarantees, but do work well in practice, our instantiation combines the best of both worlds; it gives guarantees, while empirical evaluation on synthetic and real-world data shows that it performs at least as well as the state of the art.
Classifier explanations have been identified as a crucial component of knowledge discovery. Local explanations evaluate the behavior of a classifier in the vicinity of a given instance. A key step in this approach is to generate synthetic neighbors of the given instance. This neighbor generation process is challenging and it has considerable impact on the quality of explanations. To assess quality of generated neighborhoods, we propose a local intrinsic dimensionality (LID) based locality constraint. Based on this, we then propose a new neighborhood generation method. Our method first fits a local embedding/subspace around a given instance using the LID of the test instance as the target dimensionality, then generates neighbors in the local embedding and projects them back to the original space. Experimental results show that our method generates more realistic neighborhoods and consequently better explanations. It can be used in combination with existing local explanation algorithms.
Latent factor models (LFMs) such as matrix factorization have achieved the state-of-the-art performance among various collaborative filtering approaches for recommendation. Despite the high recommendation accuracy of LFMs, a critical issue to be resolved is their lack of interpretability. Extensive efforts have been devoted to interpreting the prediction results of LFMs. However, they either rely on auxiliary information which may not be available in practice, or sacrifice recommendation accuracy for interpretability. Influence functions, stemming from robust statistics, have been developed to understand the effect of training points on the predictions of black-box models. Inspired by this, we propose a novel explanation method named FIA (Fast Influence Analysis) to understand the prediction of trained LFMs by tracing back to the training data with influence functions. We present how to employ influence functions to measure the impact of historical user-item interactions on the prediction results of LFMs and provide intuitive neighbor-style explanations based on the most influential interactions. Our proposed FIA exploits the characteristics of two important LFMs, matrix factorization and neural collaborative filtering, and is capable of accelerating the overall influence analysis process. We provide a detailed complexity analysis for FIA over LFMs and conduct extensive experiments to evaluate its performance using real-world datasets. The results demonstrate the effectiveness and efficiency of FIA, and the usefulness of the generated explanations for the recommendation results.
As one of the most important investing approaches, technical analysis attempts to forecast stock movement by interpreting the inner rules from historic price and volume data. To address the vital noisy nature of financial market, generic technical analysis develops technical trading indicators, as mathematical summarization of historic price and volume data, to form up the foundation for robust and profitable investment strategies. However, an observation reveals that stocks with different properties have different affinities over technical indicators, which discloses a big challenge for the indicator-oriented stock selection and investment. To address this problem, in this paper, we design a Technical Trading Indicator Optimization(TTIO) framework that manages to optimize the original technical indicator by leveraging stock-wise properties. To obtain effective representations of stock properties, we propose a Skip-gram architecture to learn stock embedding inspired by a valuable knowledge repository formed by fund manager's collective investment behaviors. Based on the learned stock representations, TTIO further learns a re-scaling network to optimize the indicator's performance. Extensive experiments on real-world stock market data demonstrate that our method can obtain the very stock representations that are invaluable for technical indicator optimization since the optimized indicators can result in strong investing signals than original ones.
One of the major challenges in machine learning nowadays is to provide predictions with not only high accuracy but also user-friendly explanations. Although in recent years we have witnessed increasingly popular use of deep neural networks for sequence modeling, it is still challenging to explain the rationales behind the model outputs, which is essential for building trust and supporting the domain experts to validate, critique and refine the model.
We propose ProSeNet, an interpretable and steerable deep sequence model with natural explanations derived from case-based reasoning. The prediction is obtained by comparing the inputs to a few prototypes, which are exemplar cases in the problem domain. For better interpretability, we define several criteria for constructing the prototypes, including simplicity, diversity, and sparsity and propose the learning objective and the optimization procedure. ProSeNet also provides a user-friendly approach to model steering: domain experts without any knowledge on the underlying model or parameters can easily incorporate their intuition and experience by manually refining the prototypes.
We conduct experiments on a wide range of real-world applications, including predictive diagnostics for automobiles, ECG, and protein sequence classification and sentiment analysis on texts. The result shows that ProSeNet can achieve accuracy on par with state-of-the-art deep learning models. We also evaluate the interpretability of the results with concrete case studies. Finally, through user study on Amazon Mechanical Turk (MTurk), we demonstrate that the model selects high-quality prototypes which align well with human knowledge and can be interactively refined for better interpretability without loss of performance.
Interview Choice Reveals Your Preference on the Market: To Improve Job-Resume Matching through Profiling Memories
Online recruitment services are now rapidly changing the landscape of hiring traditions on the job market. There are hundreds of millions of registered users with resumes, and tens of millions of job postings available on the Web. Learning good job-resume matching for recruitment services is important. Existing studies on job-resume matching generally focus on learning good representations of job descriptions and resume texts with comprehensive matching structures. We assume that it would bring benefits to learn the preference of both recruiters and job-seekers from previous interview histories and expect such preference is helpful to improve job-resume matching. To this end, in this paper, we propose a novel matching network with preference modeled. The key idea is to explore the latent preference given the history of all interviewed candidates for a job posting and the history of all job applications for a particular talent. To be more specific, we propose a profiling memory module to learn the latent preference representation by interacting with both the job and resume sides. We then incorporate the preference into the matching framework as an end-to-end learnable neural network. Based on the real-world data from an online recruitment platform namely "Boss Zhipin", the experimental results show that the proposed model could improve the job-resume matching performance against a series of state-of-the-art methods. In this way, we demonstrate that recruiters and talents indeed have preference and such preference can improve job-resume matching on the job market.
User satisfaction is an important variable in Web search evaluation studies and has received more and more attention in recent years. Many studies regard user satisfaction as the ground truth for designing better evaluation metrics. However, most of the existing studies focus on designing Cranfield-like evaluation metrics to reflect user satisfaction at query-level. As information need becomes more and more complex, users often need multiple queries and multi-round search interactions to complete a search task (e.g. exploratory search). In those cases, how to characterize the user's satisfaction during a search session still remains to be investigated. In this paper, we collect a dataset through a laboratory study in which users need to complete some complex search tasks. With the help of hierarchical linear models (HLM), we try to reveal how user's query-level and session-level satisfaction are affected by different cognitive effects. A number of interesting findings are made. At query level, we found that although the relevance of top-ranked documents have important impacts (primacy effect), the average/maximum of perceived usefulness of clicked documents is a much better sign of user satisfaction. At session level, perceived satisfaction for a particular query is also affected by the other queries in the same session (anchor effect or expectation effect). We also found that session-level satisfaction correlates mostly with the last query in the session (recency effect). The findings will help us design better session-level user behavior models and corresponding evaluation metrics.
Networks have been widely used as the data structure for abstracting real-world systems as well as organizing the relations among entities. Network embedding models are powerful tools in mapping nodes in a network into continuous vector-space representations in order to facilitate subsequent tasks such as classification and link prediction. Existing network embedding models comprehensively integrate all information of each node, such as links and attributes, towards a single embedding vector to represent the node's general role in the network. However, a real-world entity could be multifaceted, where it connects to different neighborhoods due to different motives or self-characteristics that are not necessarily correlated. For example, in a movie recommender system, a user may love comedies or horror movies simultaneously, but it is not likely that these two types of movies are mutually close in the embedding space, nor the user embedding vector could be sufficiently close to them at the same time. In this paper, we propose a polysemous embedding approach for modeling multiple facets of nodes, as motivated by the phenomenon of word polysemy in language modeling. Each facet of a node is mapped to an embedding vector, while we also maintain an association degree between each pair of node and facet. The proposed method is adaptive to various existing embedding models, without significantly complicating the optimization process. We also discuss how to engage embedding vectors of different facets for inference tasks including classification and link prediction. Experiments on real-world datasets help comprehensively evaluate the performance of the proposed method.
Set-level problems are as important as instance-level problems. The core in solving set-level problems is: how to measure the similarity between two sets. This paper investigates data-dependent kernels that are derived directly from data. We introduce Isolation Set-Kernel which is solely dependent on data distribution, requiring neither class information nor explicit learning. In contrast, most current set-similarities are not dependent on the underlying data distribution. We theoretically analyze the characteristic of Isolation Set-Kernel. As the set-kernel has a finite feature map, we show that it can be used to speed up the set-kernel computation significantly. We apply Isolation Set-Kernel to Multi-Instance Learning (MIL) using SVM classifier, and demonstrate that it outperforms other set-kernels or other solutions to the MIL problem.
To provide more accurate, diverse, and explainable recommendation, it is compulsory to go beyond modeling user-item interactions and take side information into account. Traditional methods like factorization machine (FM) cast it as a supervised learning problem, which assumes each interaction as an independent instance with side information encoded. Due to the overlook of the relations among instances or items (e.g., the director of a movie is also an actor of another movie), these methods are insufficient to distill the collaborative signal from the collective behaviors of users. In this work, we investigate the utility of knowledge graph (KG), which breaks down the independent interaction assumption by linking items with their attributes. We argue that in such a hybrid structure of KG and user-item graph, high-order relations --- which connect two items with one or multiple linked attributes --- are an essential factor for successful recommendation. We propose a new method named Knowledge Graph Attention Network (KGAT) which explicitly models the high-order connectivities in KG in an end-to-end fashion. It recursively propagates the embeddings from a node's neighbors (which can be users, items, or attributes) to refine the node's embedding, and employs an attention mechanism to discriminate the importance of the neighbors. Our KGAT is conceptually advantageous to existing KG-based recommendation methods, which either exploit high-order relations by extracting paths or implicitly modeling them with regularization. Empirical results on three public benchmarks show that KGAT significantly outperforms state-of-the-art methods like Neural FM and RippleNet. Further studies verify the efficacy of embedding propagation for high-order relation modeling and the interpretability benefits brought by the attention mechanism. We release the codes and datasets at https://github.com/xiangwang1223/knowledge_graph_attention_network.
In this paper, we make an extension of K-means for the clustering of multiple means. The popular K-means clustering uses only one center to model each class of data. However, the assumption on the shape of the clusters prohibits it to capture the non-convex patterns. Moreover, many categories consist of multiple subclasses which obviously cannot be represented by a single prototype. We propose a K-Multiple-Means (KMM) method to group the data points with multiple sub-cluster means into specified k clusters. Unlike the methods which use the agglomerative strategies, the proposed method formalizes the multiple-means clustering problem as an optimization problem and updates the partitions of m sub-cluster means and k clusters by an alternating optimization strategy. Notably, the partition of the original data with multiple-means representation is modeled as a bipartite graph partitioning problem with the constrained Laplacian rank. We also show the theoretical analysis of the connection between our method and the K-means clustering. Meanwhile, KMM is linear scaled with respect to n. Experimental results on several synthetic and well-known real-world data sets are conducted to show the effectiveness of the proposed algorithm.
Knowledge graphs capture structured information and relations between a set of entities or items. As such knowledge graphs represent an attractive source of information that could help improve recommender systems. However, existing approaches in this domain rely on manual feature engineering and do not allow for an end-to-end training. Here we propose Knowledge-aware Graph Neural Networks with Label Smoothness regularization (KGNN-LS) to provide better recommendations. Conceptually, our approach computes user-specific item embeddings by first applying a trainable function that identifies important knowledge graph relationships for a given user. This way we transform the knowledge graph into a user-specific weighted graph and then apply a graph neural network to compute personalized item embeddings. To provide better inductive bias, we rely on label smoothness assumption, which posits that adjacent items in the knowledge graph are likely to have similar user relevance labels/scores. Label smoothness provides regularization over the edge weights and we prove that it is equivalent to a label propagation scheme on a graph. We also develop an efficient implementation that shows strong scalability with respect to the knowledge graph size. Experiments on four datasets show that our method outperforms state of the art baselines. KGNN-LS also achieves strong performance in cold-start scenarios where user-item interactions are sparse.
Recommendation models mainly deal with categorical variables, such as user/item ID and attributes. Besides the high-cardinality issue, the interactions among such categorical variables are usually long-tailed, with the head made up of highly frequent values and a long tail of rare ones. This phenomenon results in the data sparsity issue, making it essential to regularize the models to ensure generalization. The common practice is to employ grid search to manually tune regularization hyperparameters based on the validation data. However, it requires non-trivial efforts and large computation resources to search the whole candidate space; even so, it may not lead to the optimal choice, for which different parameters should have different regularization strengths. In this paper, we propose a hyperparameter optimization method, lambdaOpt, which automatically and adaptively enforces regularization during training. Specifically, it updates the regularization coefficients based on the performance of validation data. With lambdaOpt, the notorious tuning of regularization hyperparameters can be avoided; more importantly, it allows fine-grained regularization (i.e. each parameter can have an individualized regularization coefficient), leading to better generalized models. We show how to employ lambdaOpt on matrix factorization, a classical model that is representative of a large family of recommender models. Extensive experiments on two public benchmarks demonstrate the superiority of our method in boosting the performance of top-K recommendation.
Motivated by the computational and storage challenges that dense embeddings pose, we introduce the problem of latent network summarization that aims to learn a compact, latent representation of the graph structure with dimensionality that is independent of the input graph size (\i.e., #nodes and #edges), while retaining the ability to derive node representations on the fly. We propose Multi-LENS, an inductive multi-level latent network summarization approach that leverages a set of relational operators and relational functions (compositions of operators) to capture the structure of egonets and higher-order subgraphs, respectively. The structure is stored in low-rank, size-independent structural feature matrices, which along with the relational functions comprise our latent network summary. Multi-LENS is general and naturally supports both homogeneous and heterogeneous graphs with or without directionality, weights, attributes or labels. Extensive experiments on real graphs show 3.5-34.3% improvement in AUC for link prediction, while requiring 80-2152x less output storage space than baseline embedding methods on large datasets. As application areas, we show the effectiveness of Multi-LENS in detecting anomalies and events in the Enron email communication graph and Twitter co-mention graph.
Class-conditional variants of Generative adversarial networks (GANs) have recently achieved a great success due to its ability of selectively generating samples for given classes, as well as improving generation quality. However, its training requires a large set of class-labeled data, which is often expensive and difficult to collect in practice. In this paper, we propose an active sampling method to reduce the labeling cost for effectively training the class-conditional GANs. On one hand, the most useful examples are selected for external human labeling to jointly reduce the difficulty of model learning and alleviate the missing of adversarial training; on the other hand, fake examples are actively sampled for internal model retraining to enhance the adversarial training between the discriminator and generator. By incorporating the two strategies into a unified framework, we provide a cost-effective approach to train class-conditional GANs, which achieves higher generation quality with less training examples. Experiments on multiple datasets, diverse GAN configurations and various metrics demonstrate the effectiveness of our approaches.
Event forecasting with an aim at modeling contextual information is an important task for applications such as automated analysis generation and resource allocation. Captured contextual information for an event of interest can aid human analysts in understanding the factors associated with that event. However, capturing contextual information within event forecasting is challenging due to several factors: (i) uncertainty of context structure and formulation, (ii) high dimensional features, and (iii) adaptation of features over time. Recently, graph representations have demonstrated success in applications such as traffic forecasting, social influence prediction, and visual question answering systems. In this paper, we study graph representations in modeling social events to identify dynamic properties of event contexts as social indicators. Inspired by graph neural networks, we propose a novel graph convolutional network for predicting future events (e.g., civil unrest movements). We extract and learn graph representations from historical/prior event documents. By employing the hidden word graph features, our proposed model predicts the occurrence of future events and identifies sequences of dynamic graphs as event context. Experimental results on multiple real-world data sets show that the proposed method is competitive against various state-of-the-art methods for social event prediction.
In plenty of real-life tasks, strongly supervised information is hard to obtain, such that there is not sufficient high-quality supervision to make traditional learning approaches succeed. Therefore, weakly supervised learning has drawn considerable attention recently. In this paper, we consider the problem of learning from incomplete and inaccurate supervision, where only a limited subset of training data is labeled but potentially with noise. This setting is challenging and of great importance but rarely studied in the literature. We notice that in many applications, the limited labeled data are usually with one-sided noise. For instance, considering the bug detection task in the software system, the identified buggy codes are indeed with defects whereas the codes that have been checked many times or newly fixed may still have other flaws due to the complexity of the system. We propose a novel method which is able to effectively alleviate the negative influence of one-sided label noise with the help of a vast number of unlabeled data. Excess risk analysis is provided as theoretical justifications on the usefulness of incomplete and one-sided inaccurate supervision. We conduct experiments on synthetic, benchmark datasets, and real-life tasks to validate the effectiveness of the proposed approach.
Graph is a standard approach to modeling structured data. Although many machine learning methods depend on the metric of the input objects, defining an appropriate distance function on graph is still a controversial issue. We propose a novel supervised metric learning method for a subgraph-based distance, called interpretable graph metric learning (IGML). IGML optimizes the distance function in such a way that a small number of important subgraphs can be adaptively selected. This optimization is computationally intractable with naive application of existing optimization algorithms. We construct a graph mining based efficient algorithm to deal with this computational difficulty. Important advantages of our method are 1) guarantee of the optimality from the convex formulation, and 2) high interpretability of results. To our knowledge, none of the existing studies provide an interpretable subgraph-based metric in a supervised manner. In our experiments, we empirically verify superior or comparable prediction performance of IGML to other existing graph classification methods which do not have clear interpretability. Further, we demonstrate usefulness of IGML through some illustrative examples of extracted subgraphs and an example of data analysis on the learned metric space.
Recently, network embedding (NE) has achieved great successes in learning low dimensional representations for network nodes and has been increasingly applied to various network analytic tasks. In this paper, we consider the representation learning problem for content-rich networks whose nodes are associated with rich content information. Content-rich network embedding is challenging in fusing the complex structural dependencies and the rich contents. To tackle the challenges, we propose a generative model, Network-to-Network Network Embedding (Net2Net-NE) model, which can effectively fuse the structure and content information into one continuous embedding vector for each node. Specifically, we regard the content-rich network as a pair of networks with different modalities, i.e., content network and node network. By exploiting the strong correlation between the focal node and the nodes to whom it is connected to, a multilayer recursively composable encoder is proposed to fuse the structure and content information of the entire ego network into the egocentric node embedding. Moreover, a cross-modal decoder is deployed to mapping the egocentric node embeddings into node identities in an interconnected network. By learning the identity of each node according to its content, the mapping from content network to node network is learned in a generative manner. Hence the latent encoding vectors learned by the Net2Net-NE can be used as effective node embeddings. Extensive experimental results on three real-world networks demonstrate the superiority of Net2Net-NE over state-of-the-art methods.
Link prediction in signed social networks is an important and challenging problem in social network analysis. To produce the most accurate prediction results, two questions must be answered: (1) Which unconnected node pairs are likely to be connected by a link in future? (2) What will the signs of the new links be? These questions are challenging, and current research seldom well solves both issues simultaneously. Additionally, neutral social relationships, which are common in many social networks can affect the accuracy of link prediction. Yet neutral links are not considered in most existing methods. Hence, in this paper, we propose a s igned l atent f actor (SLF) model that answers both these questions and, additionally, considers four types of relationships: positive, negative, neutral and no relationship at all. The model links social relationships of different types to the comprehensive, but opposite, effects of positive and negative SLFs. The SLF vectors for each node are learned by minimizing a negative log-likelihood objective function. Experiments on four real-world signed social networks support the efficacy of the proposed model.
Modeling user behavior from unstructured software log-trace data is critical in providing personalized service (\emphe.g., cross-platform recommendation). Existing user modeling approaches cannot well handle the long-term temporal information in log data, or produce semantically meaningful results for interpreting user logs. To address these challenges, we propose a Log2Intent framework for interpretable user modeling in this paper. Log2Intent adopts a deep sequential modeling framework that contains a temporal encoder, a semantic encoder and a log action decoder, and it fully captures the long-term temporal information in user sessions. Moreover, to bridge the semantic gap between log-trace data and human language, a recurrent semantics memory unit (RSMU) is proposed to encode the annotation sentences from an auxiliary software tutorial dataset, and the output of RSMU is fed into the semantic encoder of Log2Intent. Comprehensive experiments on a real-world Photoshop log-trace dataset with an auxiliary Photoshop tutorial dataset demonstrate the effectiveness of the proposed Log2Intent framework over the state-of-the-art log-trace user modeling method in three different tasks, including log annotation retrieval, user interest detection and user next action prediction.
MCNE: An End-to-End Framework for Learning Multiple Conditional Network Representations of Social Network
Recently, the Network Representation Learning (NRL) techniques, which represent graph structure via low-dimension vectors to support social-oriented application, have attracted wide attention. Though large efforts have been made, they may fail to describe the multiple aspects of similarity between social users, as only a single vector for one unique aspect has been represented for each node. To that end, in this paper, we propose a novel end-to-end framework named MCNE to learn multiple conditional network representations, so that various preferences for multiple behaviors could be fully captured. Specifically, we first design a binary mask layer to divide the single vector as conditional embeddings for multiple behaviors. Then, we introduce the attention network to model interaction relationship among multiple preferences, and further utilize the adapted message sending and receiving operation of graph neural network, so that multi-aspect preference information from high-order neighbors will be captured. Finally, we utilize Bayesian Personalized Ranking loss function to learn the preference similarity on each behavior, and jointly learn multiple conditional node embeddings via multi-task learning framework. Extensive experiments on public datasets validate that our MCNE framework could significantly outperform several state-of-the-art baselines, and further support the visualization and transfer learning tasks with excellent interpretability and robustness.
This paper proposes a recommender system to alleviate the cold-start problem that can estimate user preferences based on only a small number of items. To identify a user's preference in the cold state, existing recommender systems, such as Netflix, initially provide items to a user; we call those items evidence candidates. Recommendations are then made based on the items selected by the user. Previous recommendation studies have two limitations: (1) the users who consumed a few items have poor recommendations and (2) inadequate evidence candidates are used to identify user preferences. We propose a meta-learning-based recommender system called MeLU to overcome these two limitations. From meta-learning, which can rapidly adopt new task with a few examples, MeLU can estimate new user's preferences with a few consumed items. In addition, we provide an evidence candidate selection strategy that determines distinguishing items for customized preference estimation. We validate MeLU with two benchmark datasets, and the proposed model reduces at least 5.92% mean absolute error than two comparative models on the datasets. We also conduct a user study experiment to verify the evidence selection strategy.
The number of scientific publications is ever increasing. The long time to digest a scientific paper posts great challenges on the number of papers people can read, which impedes a quick grasp of major activities in new research areas especially for intelligence analysts and novice researchers. To accelerate such a process, we first define a new problem called mining algorithm roadmap in scientific publications, and then propose a new weakly supervised method to build the roadmap. The algorithm roadmap describes evolutionary relation between different algorithms, and sketches the undergoing research and the dynamics of the area. It is a tool for analysts and researchers to locate the successors and families of algorithms when analyzing and surveying a research field. We first propose abbreviated words as candidates for algorithms and then use tables as weak supervision to extract these candidates and labels. Next we propose a new method called Cross-sentence Attention NeTwork for cOmparative Relation (CANTOR) to extract comparative algorithms from text. Finally, we derive order for individual algorithm pairs with time and frequency to construct the algorithm roadmap. Through comprehensive experiments, our proposed algorithm shows its superiority over the baseline methods on the proposed task.
We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds important applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted significant attention in the past two decades. However, all previous algorithms either cannot scale well to long strings and large similarity thresholds, or suffer from imperfect accuracy.
In this paper we propose a new algorithm for edit similarity joins using a novel string partition based approach. We show mathematically that with high probability our algorithm achieves a perfect accuracy, and runs in linear time plus a data-dependent verification step. Experiments on real world datasets show that our algorithm significantly outperforms the state-of-the-art algorithms for edit similarity joins, and achieves perfect accuracy on all the datasets that we have tested.
Visual multimedia is one of the most prevalent sources of modern online content and engagement. However, despite its prevalence, little is known about user engagement with such content. For instance, how can we model engagement for a specific content or viewer sample, and across multiple samples? Can we model and discover patterns in these interactions, and detect outlying behaviors corresponding to abnormal engagement? In this paper, we study these questions in depth. Understanding these questions has implications in user modeling and understanding, ranking, trust and safety and more. For analysis, we consider content and viewer dwell time (engagement duration) behaviors with images and videos on Snapchat Stories, one of the largest multimedia-driven social sharing services. To our knowledge, we are the first to model and analyze dwell time behaviors on such media. Specifically, our contributions include (a) individual modeling: we propose and evaluate the ŁFmodel, ŁTmodel and \Vmodel parametric models to describe dwell times of unlooped/looped media and viewers which outperform alternatives, (b) aggregate modeling: we show how to flexibly summarize the respective joint distributions of multivariate parametrized fits across many samples using Vine Copulas in the analog \ALFmodel, \ALTmodel and \AVmodel models, which enable inferences regarding aggregate behavioral patterns, and offer the ability to simulate real-looking engagement data (c) anomaly detection: we demonstrate our aggregate models can robustly detect anomalies present during training ($0.9+$ AUROC across most attack models), and also enable discovery of real dwell time anomalies.
Time series prediction is an intensively studied topic in data mining. In spite of the considerable improvements, recent deep learning-based methods overlook the existence of extreme events, which result in weak performance when applying them to real time series. Extreme events are rare and random, but do play a critical role in many real applications, such as the forecasting of financial crisis and natural disasters. In this paper, we explore the central theme of improving the ability of deep learning on modeling extreme events for time series prediction. Through the lens of formal analysis, we first find that the weakness of deep learning methods roots in the conventional form of quadratic loss. To address this issue, we take inspirations from the Extreme Value Theory, developing a new form of loss called Extreme Value Loss (EVL) for detecting the future occurrence of extreme events. Furthermore, we propose to employ Memory Network in order to memorize extreme events in historical records.By incorporating EVL with an adapted memory network module, we achieve an end-to-end framework for time series prediction with extreme events. Through extensive experiments on synthetic data and two real datasets of stock and climate, we empirically validate the effectiveness of our framework. Besides, we also provide a proper choice for hyper-parameters in our proposed framework by conducting several additional experiments.
Multi-task learning is a successful machine learning framework which improves the performance of prediction models by leveraging knowledge among tasks, e.g., the relationships between different tasks. Most of existing multi-task learning methods focus on guiding learning process by predefined task relationships. In fact, these methods have not fully exploited the associated relationships during the learning process. On the one hand, replacing predefined task relationships by adaptively learned ones may result in higher prediction accuracy as it can avoid the risk of misguiding caused by improperly predefined relationships. On the other hand, apart from the task relationships, feature-task dependence and feature-feature interactions could also be employed to guide the learning process. Along this line, we propose aMultiple Relational Attention Network (MRAN) framework for multi-task learning, in which three types of relationships are considered. Correspondingly, MRAN consists of three attention-based relationship learning modules: 1) a task-task relationship learning module which captures the relationships among tasks automatically and controls the positive and negative knowledge transfer adaptively; 2) a feature-feature interaction learning module that handles the complicated interactions among features; 3) a task-feature dependence learning module, which can associate the related features with target tasks separately. To evaluate the effectiveness of the proposed MARN, experiments are conducted on two public datasets and a real-world dataset crawled from a review hosting site. Experimental results demonstrate the superiority of our method over both classical and the state-of-the-art multi-task learning methods.
The task of classifying multi-relational data spans a wide range of domains such as document classification in citation networks, classification of emails, and protein labeling in proteins interaction graphs. Current state-of-the-art classification models rely on learning per-entity latent representations by mining the whole structure of the relations' graph, however, they still face two major problems. Firstly, it is very challenging to generate expressive latent representations in sparse multi-relational settings with implicit feedback relations as there is very little information per-entity. Secondly, for entities with structured properties such as titles and abstracts (text) in documents, models have to be modified ad-hoc. In this paper, we aim to overcome these two main drawbacks by proposing a flexible nonlinear latent embedding model (BRNLE) for the classification of multi-relational data. The proposed model can be applied to entities with structured properties such as text by utilizing the numerical vector representations of those properties. To address the sparsity problem of implicit feedback relations, the model is optimized via a sparsely-regularized multi-relational pair-wise Bayesian personalized ranking loss (BPR). Experiments on four different real-world datasets show that the proposed model significantly outperforms state-of-the-art models for multi-relational classification.
Multi-task Recurrent Neural Networks and Higher-order Markov Random Fields for Stock Price Movement Prediction: Multi-task RNN and Higer-order MRFs for Stock Price Classification
Stock price movement not only depends on the history of individual stock movements, but also complex hidden dynamics associated with other correlated stocks. Despite the substantial effort made to understand the principles of stock price movement, few attempts have been made to predict movement direction based upon a single stock's historical records together with its correlated stocks. Here, we present a multi-task recurrent neural network (RNN) with high-order Markov random fields (MRFs) to predict stock price movement direction. Specifically, we first design a multi-task RNN framework to extract informative features from the raw market data of individual stocks without considering any domain knowledge. Next, we employ binary MRFs with unary features and weighted lower linear envelopes as the higher-order energy function to capture higher-order consistency within the same stock clique (group). We also derive a latent structural SVM algorithm to learn higher-order MRFs in a polynomial number of iterations. Finally, a sub-gradient algorithm is employed to perform end-to-end training of the RNN and high-order MRFs. We conduct thorough empirical studies on three popular Chinese stock market indexes and the proposed method outperforms baseline approaches. To our best knowledge, the proposed technique is the first to investigate intra-clique relationships with higher-order MRFs for stock price movement prediction.
Spectral analysis connects graph structure to the eigenvalues and eigenvectors of associated matrices. Much of spectral graph theory descends directly from spectral geometry, the study of differentiable manifolds through the spectra of associated differential operators. But the translation from spectral geometry to spectral graph theory has largely focused on results involving only a few extreme eigenvalues and their associated eigenvalues. Unlike in geometry, the study of graphs through the overall distribution of eigenvalues --- the \em spectral density --- is largely limited to simple random graph models. The interior of the spectrum of real-world graphs remains largely unexplored, difficult to compute and to interpret. In this paper, we delve into the heart of spectral densities of real-world graphs. We borrow tools developed in condensed matter physics, and add novel adaptations to handle the spectral signatures of common graph motifs. The resulting methods are highly efficient, as we illustrate by computing spectral densities for graphs with over a billion edges on a single compute node. Beyond providing visually compelling fingerprints of graphs, we show how the estimation of spectral densities facilitates the computation of many common centrality measures, and use spectral densities to estimate meaningful information about graph structure that cannot be inferred from the extremal eigenpairs alone.
Embeddings have become a key paradigm to learn graph representations and facilitate downstream graph analysis tasks. Existing graph embedding techniques either sample a large number of node pairs from a graph to learn node embeddings via stochastic optimization, or factorize a high-order proximity/adjacency matrix of the graph via expensive matrix factorization. However, these techniques usually require significant computational resources for the learning process, which hinders their applications on large-scale graphs. Moreover, the cosine similarity preserved by these techniques shows suboptimal efficiency in downstream graph analysis tasks, compared to Hamming similarity, for example. To address these issues, we propose NodeSketch, a highly-efficient graph embedding technique preserving high-order node proximity via recursive sketching. Specifically, built on top of an efficient data-independent hashing/sketching technique, NodeSketch generates node embeddings in Hamming space. For an input graph, it starts by sketching the self-loop-augmented adjacency matrix of the graph to output low-order node embeddings, and then recursively generates k-order node embeddings based on the self-loop-augmented adjacency matrix and (k-1)-order node embeddings. Our extensive evaluation compares NodeSketch against a sizable collection of state-of-the-art techniques using five real-world graphs on two graph analysis tasks. The results show that NodeSketch achieves state-of-the-art performance compared to these techniques, while showing significant speedup of 9x-372x in the embedding learning process and 1.19x-1.68x speedup when performing downstream graph analysis tasks.
Algorithm selection and hyperparameter tuning remain two of the most challenging tasks in machine learning. Automated machine learning (AutoML) seeks to automate these tasks to enable widespread use of machine learning by non-experts. This paper introduces OBOE, a collaborative filtering method for time-constrained model selection and hyperparameter tuning. OBOE forms a matrix of the cross-validated errors of a large number of supervised learning models (algorithms together with hyperparameters) on a large number of datasets, and fits a low rank model to learn the low-dimensional feature vectors for the models and datasets that best predict the cross-validated errors. To find promising models for a new dataset, OBOE runs a set of fast but informative algorithms on the new dataset and uses their cross-validated errors to infer the feature vector for the new dataset. OBOE can find good models under constraints on the number of models fit or the total time budget. To this end, this paper develops a new heuristic for active learning in time-constrained matrix completion based on optimal experiment design. Our experiments demonstrate that OBOE delivers state-of-the-art performance faster than competing approaches on a test bed of supervised learning problems. Moreover, the success of the bilinear model used by OBOE suggests that AutoML may be simpler than was previously understood.
It is well known that the historical logs are used for evaluating and learning policies in interactive systems, e.g. recommendation, search, and online advertising. Since direct online policy learning usually harms user experiences, it is more crucial to apply off-policy learning in real-world applications instead. Though there have been some existing works, most are focusing on learning with one single historical policy. However, in practice, usually a number of parallel experiments, e.g. multiple AB tests, are performed simultaneously. To make full use of such historical data, learning policies from multiple loggers becomes necessary. Motivated by this, in this paper, we investigate off-policy learning when the training data coming from multiple historical policies. Specifically, policies, e.g. neural networks, can be learned directly from multi-logger data, with counterfactual estimators. In order to understand the generalization ability of such estimator better, we conduct generalization error analysis for the empirical risk minimization problem. We then introduce the generalization error bound as the new risk function, which can be reduced to a constrained optimization problem. Finally, we give the corresponding learning algorithm for the new constrained problem, where we can appeal to the minimax problems to control the constraints. Extensive experiments on benchmark datasets demonstrate that the proposed methods achieve better performances than the state-of-the-arts.
Dynamic extensions of Stochastic block model (SBM) are of importance in several fields that generate temporal interaction data. These models, besides producing compact and interpretable network representations, can be useful in applications such as link prediction or network forecasting. In this paper we present a conditional pseudo-likelihood based extension to dynamic SBM that can be efficiently estimated by optimizing a regularized objective. Our formulation leads to a highly scalable approach that can handle very large networks, even with millions of nodes. We also extend our formalism to causal impact for networks that allows us to quantify the impact of external events on a time dependent sequence of networks. We support our work with extensive results on both synthetic and real networks.
In this paper we propose and study the problem of optimizing the influence of outdoor advertising (ad) when impression counts are taken into consideration. Given a database U of billboards, each of which has a location and a non-uniform cost, a trajectory database T and a budget B, it aims to find a set of billboards that has the maximum influence under the budget. In line with the advertising consumer behavior studies, we adopt the logistic function to take into account the impression counts of an ad (placed at different billboards) to a user trajectory when defining the influence measurement. However, this poses two challenges: (1) our problem is NP-hard to approximate within a factor of O(|T|1-ε) for any ε>0 in polynomial time; (2) the influence measurement is non-submodular, which means a straightforward greedy approach is not applicable. Therefore, we propose a tangent line based algorithm to compute a submodular function to estimate the upper bound of influence. Henceforth, we introduce a branch-and-bound framework with a θ-termination condition, achieving θ2/(1 - 1/e) approximation ratio. However, this framework is time-consuming when |U| is huge. Thus, we further optimize it with a progressive pruning upper bound estimation approach which achieves θ2/(1 - 1/e - ε) approximation ratio and significantly decreases the running-time. We conduct the experiments on real-world billboard and trajectory datasets, and show that the proposed approaches outperform the baselines by 95% in effectiveness. Moreover, the optimized approach is around two orders of magnitude faster than the original framework.
We investigate online group formation where members seek to increase their learning potential via collaboration. We capture two common learning models: LpA where each member learns from all higher skilled ones, and LpD where the least skilled member learns from the most skilled one. We formulate the problem of forming groups with the purpose of optimizing peer learning under different affinity structures: AffD where group affinity is the smallest between all members, and AffC where group affinity is the smallest between a designated member (e.g., the least skilled or the most skilled) and all others. This gives rise to multiple variants of a multiobjective optimization problem. We propose principled modeling of these problems and investigate theoretical and algorithmic challenges. We first present hardness results, and then develop computationally efficient algorithms with constant approximation factors. Our real-data experiments demonstrate with statistical significance that forming groups considering affinity improves learning. Our extensive synthetic experiments demonstrate the qualitative and scalability aspects of our solutions.
Origin-Destination Matrix Prediction via Graph Convolution: a New Perspective of Passenger Demand Modeling
Ride-hailing applications are becoming more and more popular for providing drivers and passengers with convenient ride services, especially in metropolises like Beijing or New York. To obtain the passengers' mobility patterns, the online platforms of ride services need to predict the number of passenger demands from one region to another in advance. We formulate this problem as an Origin-Destination Matrix Prediction (ODMP) problem. Though this problem is essential to large-scale providers of ride services for helping them make decisions and some providers have already put it forward in public, existing studies have not solved this problem well. One of the main reasons is that the ODMP problem is more challenging than the common demand prediction. Besides the number of demands in a region, it also requires the model to predict the destinations of them. In addition, data sparsity is a severe issue. To solve the problem effectively, we propose a unified model, Grid-Embedding based Multi-task Learning (GEML) which consists of two components focusing on spatial and temporal information respectively. The Grid-Embedding part is designed to model the spatial mobility patterns of passengers and neighboring relationships of different areas, the pre-weighted aggregator of which aims to sense the sparsity and range of data. The Multi-task Learning framework focuses on modeling temporal attributes and capturing several objectives of the ODMP problem. The evaluation of our model is conducted on real operational datasets from UCAR and Didi. The experimental results demonstrate the superiority of our GEML against the state-of-the-art approaches.
Inspired by applications in sports where the skill of players or teams competing against each other varies over time, we propose a probabilistic model of pairwise-comparison outcomes that can capture a wide range of time dynamics. We achieve this by replacing the static parameters of a class of popular pairwise-comparison models by continuous-time Gaussian processes; the covariance function of these processes enables expressive dynamics. We develop an efficient inference algorithm that computes an approximate Bayesian posterior distribution. Despite the flexbility of our model, our inference algorithm requires only a few linear-time iterations over the data and can take advantage of modern multiprocessor computer architectures. We apply our model to several historical databases of sports outcomes and find that our approach outperforms competing approaches in terms of predictive performance, scales to millions of observations, and generates compelling visualizations that help in understanding and interpreting the data.
Automatically matching reviewers to papers is a crucial step of the peer review process for venues receiving thousands of submissions. Unfortunately, common paper matching algorithms often construct matchings suffering from two critical problems: (1) the group of reviewers assigned to a paper do not collectively possess sufficient expertise, and (2) reviewer workloads are highly skewed. In this paper, we propose a novel local fairness formulation of paper matching that directly addresses both of these issues. Since optimizing our formulation is not always tractable, we introduce two new algorithms, FairIR and FairFlow, for computing fair matchings that approximately optimize the new formulation. FairIR solves a relaxation of the local fairness formulation and then employs a rounding technique to construct a valid matching that provably maximizes the objective and only compromises on fairness with respect to reviewer loads and papers by a small constant. In contrast, FairFlow is not provably guaranteed to produce fair matchings, however it can be 2x as efficient as FairIR and an order of magnitude faster than matching algorithms that directly optimize for fairness. Empirically, we demonstrate that both FairIR and FairFlow improve fairness over standard matching algorithms on real conference data. Moreover, in comparison to state-of-the-art matching algorithms that optimize for fairness only, FairIR achieves higher objective scores, FairFlow achieves competitive fairness, and both are capable of more evenly allocating reviewers.
In contrast to the one-size-fits-all approach to medicine, precision medicine will allow targeted prescriptions based on the specific profile of the patient thereby avoiding adverse reactions and ineffective but expensive treatments. Longitudinal observational data such as Electronic Health Records (EHRs) have become an emerging data source for personalized medicine. In this paper, we propose a unified computational framework, called PerDREP, to predict the unique response patterns of each individual patient from EHR data. PerDREP models individual responses of each patient to the drug exposure by introducing a linear system to account for patients' heterogeneity, and incorporates a patient similarity graph as a network regularization. We formulate PerDREP as a convex optimization problem and develop an iterative gradient descent method to solve it. In the experiments, we identify the effect of drugs on Glycated hemoglobin test results. The experimental results provide evidence that the proposed method is not only more accurate than state-of-the-art methods, but is also able to automatically cluster patients into multiple coherent groups, thus paving the way for personalized medicine.
Modeling sequential interactions between users and items/products is crucial in domains such as e-commerce, social networking, and education. Representation learning presents an attractive opportunity to model the dynamic evolution of users and items, where each user/item can be embedded in a Euclidean space and its evolution can be modeled by an embedding trajectory in this space. However, existing dynamic embedding methods generate embeddings only when users take actions and do not explicitly model the future trajectory of the user/item in the embedding space. Here we propose JODIE, a coupled recurrent neural network model that learns the embedding trajectories of users and items. JODIE employs two recurrent neural networks to update the embedding of a user and an item at every interaction. Crucially, JODIE also models the future embedding trajectory of a user/item. To this end, it introduces a novel projection operator that learns to estimate the embedding of the user at any time in the future. These estimated embeddings are then used to predict future user-item interactions. To make the method scalable, we develop a t-Batch algorithm that creates time-consistent batches and leads to 9x faster training. We conduct six experiments to validate JODIE on two prediction tasks---future interaction prediction and state change prediction---using four real-world datasets. We show that JODIE outperforms six state-of-the-art algorithms in these tasks by at least 20% in predicting future interactions and 12% in state change prediction.
In this paper we use a time-evolving graph which consists of a sequence of graph snapshots over time to model many real-world networks. We study the path classification problem in a time-evolving graph, which has many applications in real-world scenarios, for example, predicting path failure in a telecommunication network and predicting path congestion in a traffic network in the near future. In order to capture the temporal dependency and graph structure dynamics, we design a novel deep neural network named Long Short-Term Memory R-GCN (LRGCN). LRGCN considers temporal dependency between time-adjacent graph snapshots as a special relation with memory, and uses relational GCN to jointly process both intra-time and inter-time relations. We also propose a new path representation method named self-attentive path embedding (SAPE), to embed paths of arbitrary length into fixed-length vectors. Through experiments on a real-world telecommunication network and a traffic network in California, we demonstrate the superiority of LRGCN to other competing methods in path failure prediction, and prove the effectiveness of SAPE on path representation.
Traffic signal control is essential for transportation efficiency in road networks. It has been a challenging problem because of the complexity in traffic dynamics. Conventional transportation research suffers from the incompetency to adapt to dynamic traffic situations. Recent studies propose to use reinforcement learning (RL) to search for more efficient traffic signal plans. However, most existing RL-based studies design the key elements - reward and state - in a heuristic way. This results in highly sensitive performances and a long learning process. To avoid the heuristic design of RL elements, we propose to connect RL with recent studies in transportation research. Our method is inspired by the state-of-the-art method max pressure (MP) in the transportation field. The reward design of our method is well supported by the theory in MP, which can be proved to be maximizing the throughput of the traffic network, i.e., minimizing the overall network travel time. We also show that our concise state representation can fully support the optimization of the proposed reward function. Through comprehensive experiments, we demonstrate that our method outperforms both conventional transportation approaches and existing learning-based methods.
Privacy is a big hurdle for collaborative data mining across multiple parties. We present multi-party computation (MPC) framework designed for large-scale data mining tasks. PrivPy combines an easy-to-use and highly flexible Python programming interface with state-of-the-art secret-sharing-based MPC backend. With essential data types and operations (such as NumPy arrays and broadcasting), as well as automatic code-rewriting, programmers can write modern data mining algorithms conveniently in familiar Python. We demonstrate that we can support many real-world machine learning algorithms (e.g. logistic regression and convolutional neural networks) and large datasets (e.g. 5000-by-1-million matrix) with minimal algorithm porting effort.
Network embedding has attracted increasing attention in recent few years, which is to learn a low-dimensional representation for each node of a network to benefit downstream tasks, such as node classification, link prediction, and network visualization. Essentially, the task of network embedding can be decoupled into discovering the proximity in the original space and preserving it in the low dimensional space. Only with the well-discovered proximity can we preserve it in the low-dimensional space. Thus, it is critical to discover the proximity between different nodes to learn good node representations. To address this problem, in this paper, we propose a novel proximity generative adversarial network (ProGAN) which can generate proximities. As a result, the generated proximity can help to discover the complicated underlying proximity to benefit network embedding. To generate proximities, we design a novel neural network architecture to fulfill it. In particular, the generation of proximities is instantiated to the generation of triplets of nodes, which encodes the similarity relationship between different nodes. In this way, the proposed ProGAN can generate proximities successfully to benefit network embedding. At last, extensive experimental results have verified the effectiveness of ProGAN.
Characterizing temporal dependence patterns is a critical step in understanding the statistical properties of sequential data. Long Range Dependence (LRD) --- referring to long-range correlations decaying as a power law rather than exponentially w.r.t. distance --- demands a different set of tools for modeling the underlying dynamics of the sequential data. While it has been widely conjectured that LRD is present in language modeling and sequential recommendation, the amount of LRD in the corresponding sequential datasets has not yet been quantified in a scalable and model-independent manner. We propose a principled estimation procedure of LRD in sequential datasets based on established LRD theory for real-valued time series and apply it to sequences of symbols with million-item-scale dictionaries. In our measurements, the procedure estimates reliably the LRD in the behavior of users as they write Wikipedia articles and as they interact with YouTube. We further show that measuring LRD better informs modeling decisions in particular for RNNs whose ability to capture LRD is still an active area of research. The quantitative measure informs new Evolutive Recurrent Neural Networks (EvolutiveRNNs) designs, leading to state-of-the-art results on language understanding and sequential recommendation tasks at a fraction of the computational cost.
Understanding learning materials (e.g. test questions) is a crucial issue in online learning systems, which can promote many applications in education domain. Unfortunately, many supervised approaches suffer from the problem of scarce human labeled data, whereas abundant unlabeled resources are highly underutilized. To alleviate this problem, an effective solution is to use pre-trained representations for question understanding. However, existing pre-training methods in NLP area are infeasible to learn test question representations due to several domain-specific characteristics in education. First, questions usually comprise of heterogeneous data including content text, images and side information. Second, there exists both basic linguistic information as well as domain logic and knowledge. To this end, in this paper, we propose a novel pre-training method, namely QuesNet, for comprehensively learning question representations. Specifically, we first design a unified framework to aggregate question information with its heterogeneous inputs into a comprehensive vector. Then we propose a two-level hierarchical pre-training algorithm to learn better understanding of test questions in an unsupervised way. Here, a novel holed language model objective is developed to extract low-level linguistic features, and a domain-oriented objective is proposed to learn high-level logic and knowledge. Moreover, we show that QuesNet has good capability of being fine-tuned in many question-based tasks. We conduct extensive experiments on large-scale real-world question data, where the experimental results clearly demonstrate the effectiveness of QuesNet for question understanding as well as its superior applicability.
Forecasting a large set of time series with hierarchical aggregation constraints is a central problem for many organizations. However, it is particularly challenging to forecast these hierarchical structures. In fact, it requires not only good forecast accuracy at each level of the hierarchy, but also the coherency between different levels, i.e. the forecasts should satisfy the hierarchical aggregation constraints. Given some incoherent base forecasts, the state-of-the-art methods compute revised forecasts based on forecast combination which ensures that the aggregation constraints are satisfied. However, these methods assume the base forecasts are unbiased and constrain the revised forecasts to be also unbiased. We propose a new forecasting method which relaxes these unbiasedness conditions, and seeks the revised forecasts with the best tradeoff between bias and forecast variance. We also present a regularization method which allows us to deal with high-dimensional hierarchies, and provide its theoretical justification. Finally, we compare the proposed method with the state-of-the-art methods both theoretically and empirically. The results on both simulated and real-world data indicate that our methods provide competitive results compared to the state-of-the-art methods.
Relation extraction in knowledge base construction has been researched for the last decades due to its applicability to many problems. Most classical works, such as supervised information extraction and distant supervision, focus on how to construct the knowledge base (KB) by utilizing the large number of labels or certain related KBs. However, in many real-world scenarios, the existing methods may not perform well when a new knowledge base is required but only scarce labels or few related KBs available. In this paper, we propose a novel approach called, Relation Extraction via Domain-aware Transfer Learning (ReTrans), to extract relation mentions from a given text corpus by exploring the experience from a large amount of existing KBs which may not be closely related to the target relation. We first propose to initialize the representation of relation mentions from the massive text corpus and update those representations according to existing KBs. Based on the representations of relation mentions, we investigate the contribution of each KB to the target task and propose to select useful KBs for boosting the effectiveness of the proposed approach. Based on selected KBs, we develop a novel domain-aware transfer learning framework to transfer knowledge from source domains to the target domain, aiming to infer the true relation mentions in the unstructured text corpus. Most importantly, we give the stability and generalization bound of ReTrans. Experimental results on the real world datasets well demonstrate that the effectiveness of our approach, which outperforms all the state-of-the-art baselines.
Network embedding (or graph embedding) has been widely used in many real-world applications. However, existing methods mainly focus on networks with single-typed nodes/edges and cannot scale well to handle large networks. Many real-world networks consist of billions of nodes and edges of multiple types, and each node is associated with different attributes. In this paper, we formalize the problem of embedding learning for the Attributed Multiplex Heterogeneous Network and propose a unified framework to address this problem. The framework supports both transductive and inductive learning. We also give the theoretical analysis of the proposed framework, showing its connection with previous works and proving its better expressiveness. We conduct systematical evaluations for the proposed framework on four different genres of challenging datasets: Amazon, YouTube, Twitter, and Alibaba. Experimental results demonstrate that with the learned embeddings from the proposed framework, we can achieve statistically significant improvements (e.g., 5.99-28.23% lift by F1 scores; p<<0.01, t-test) over previous state-of-the-art methods for link prediction. The framework has also been successfully deployed on the recommendation system of a worldwide leading e-commerce company, Alibaba Group. Results of the offline A/B tests on product recommendation further confirm the effectiveness and efficiency of the framework in practice.
Knowledge transfer has been of great interest in current machine learning research, as many have speculated its importance in modeling the human ability to rapidly generalize learned models to new scenarios. Particularly in cases where training samples are limited, knowledge transfer shows improvement on both the learning speed and generalization performance of related tasks. Recently, Learning Using Privileged Information (LUPI) has presented a new direction in knowledge transfer by modeling the transfer of prior knowledge as a Teacher-Student interaction process. Under LUPI, a Teacher model uses Privileged Information (PI) that is only available at training time to improve the sample complexity required to train a Student learner for a given task. In this work, we present a LUPI formulation that allows privileged information to be retained in a multi-task learning setting. We propose a novel feature matching algorithm that projects samples from the original feature space and the privilege information space into a joint latent space in a way that informs similarity between training samples. Our experiments show that useful knowledge from PI is maintained in the latent space and greatly improves the sample efficiency of other related learning tasks. We also provide an analysis of sample complexity of the proposed LUPI method, which under some favorable assumptions can achieve a greater sample efficiency than brute force methods.
\kdtree \citefriedman1976algorithm has long been deemed unsuitable for exact nearest-neighbor search in high dimensional data. The theoretical guarantees and the empirical performance of \kdtree do not show significant improvements over brute-force nearest-neighbor search in moderate to high dimensions. \kdtree has been used relatively more successfully for approximate search \citemuja2009flann but lack theoretical guarantees. In the article, we build upon randomized-partition trees \citedasgupta2013randomized to propose \kdtree based approximate search schemes with $O(d łog d + łog n)$ query time for data sets with n points in d dimensions and rigorous theoretical guarantees on the search accuracy. We empirically validate the search accuracy and the query time guarantees of our proposed schemes, demonstrating the significantly improved scaling for same level of accuracy.
This work studies product question answering (PQA) which aims to answer product-related questions based on customer reviews. Most recent PQA approaches adopt end2end semantic matching methodologies, which map questions and answers to a latent vector space to measure their relevance. Such methods often achieve superior performance but it tends to be difficult to interpret why. On the other hand, simple keyword-based search methods exhibit natural interpretability through matched keywords, but often suffer from the lexical gap problem. In this work, we develop a new PQA framework (named Riker) that enjoys the benefits of both interpretability and effectiveness. Riker mines rich keyword representations of a question with two major components, internal word re-weighting and external word association, which predict the importance of each question word and associate the question with outside relevant keywords respectively, and can be jointly trained under weak supervision with large-scale QA pairs. The keyword representations from Riker can be directly used as input to a keyword-based search module, enabling the whole process to be effective while preserving good interpretability. We conduct extensive experiments using Amazon QA and review datasets from 5 different departments, and our results show that Riker substantially outperforms previous state-of-the-art methods in both synthetic settings and real user evaluations. In addition, we compare keyword representations from Riker and those from attention mechanisms popularly used for deep neural networks through case studies, showing that the former are more effective and interpretable.
Graph Convolutional Networks (GCNs) are an emerging type of neural network model on graphs which have achieved state-of-the-art performance in the task of node classification. However, recent studies show that GCNs are vulnerable to adversarial attacks, i.e. small deliberate perturbations in graph structures and node attributes, which poses great challenges for applying GCNs to real world applications. How to enhance the robustness of GCNs remains a critical open problem. To address this problem, we propose Robust GCN (RGCN), a novel model that "fortifies'' GCNs against adversarial attacks. Specifically, instead of representing nodes as vectors, our method adopts Gaussian distributions as the hidden representations of nodes in each convolutional layer. In this way, when the graph is attacked, our model can automatically absorb the effects of adversarial changes in the variances of the Gaussian distributions. Moreover, to remedy the propagation of adversarial attacks in GCNs, we propose a variance-based attention mechanism, i.e. assigning different weights to node neighborhoods according to their variances when performing convolutions. Extensive experimental results demonstrate that our proposed method can effectively improve the robustness of GCNs. On three benchmark graphs, our RGCN consistently shows a substantial gain in node classification accuracy compared with state-of-the-art GCNs against various adversarial attack strategies.
Multi-task learning aims to learn multiple tasks jointly by sharing information among related tasks such that the generalization performance over different tasks could be improved. Although multi-task learning has been demonstrated to obtain performance gain in comparison with the single task learning, the main challenge that learning what to share with whom is still not fully resolved. In this paper, we propose a robust clustered multi-task learning approach that clusters tasks into several groups by learning the representative tasks. The main assumption behind our approach is that each task can be represented by a linear combination of some representative tasks that can characterize all tasks. The correlation between tasks can be indicated by the corresponding combination coefficient. By imposing a row-sparse constraint on the correlation matrix, our approach could select the representative tasks and encourage information sharing among the related tasks. In addition, the $l_1,2 $-norm is applied to the representation loss to enhance the robustness of our approach. To solve the resulting bi-convex optimization problem, we design an efficient optimization method based on the alternating direction method of multipliers and accelerated proximal gradient method. Finally, experimental results on synthetic and real-world data sets validate the effectiveness of the proposed approach.
Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding
Graph kernels are widely used for measuring the similarity between graphs. Many existing graph kernels, which focus on local patterns within graphs rather than their global properties, suffer from significant structure information loss when representing graphs. Some recent global graph kernels, which utilizes the alignment of geometric node embeddings of graphs, yield state-of-the-art performance. However, these graph kernels are not necessarily positive-definite. More importantly, computing the graph kernel matrix will have at least quadratic time complexity in terms of the number and the size of the graphs. In this paper, we propose a new family of global alignment graph kernels, which take into account the global properties of graphs by using geometric node embeddings and an associated node transportation based on earth mover's distance. Compared to existing global kernels, the proposed kernel is positive-definite. Our graph kernel is obtained by defining a distribution over random graphs, which can naturally yield random feature approximations. The random feature approximations lead to our graph embeddings, which is named as "random graph embeddings" (RGE). In particular, RGE is shown to achieve (quasi-)linear scalability with respect to the number and the size of the graphs. The experimental results on nine benchmark datasets demonstrate that RGE outperforms or matches twelve state-of-the-art graph classification algorithms.
Graph embedding learns low-dimensional representations for nodes in a graph and effectively preserves the graph structure. Recently, a significant amount of progress has been made toward this emerging research area. However, there are several fundamental problems that remain open. First, existing methods fail to preserve the out-degree distributions on directed graphs. Second, many existing methods employ random walk based proximities and thus suffer from conflicting optimization goals on undirected graphs. Finally, existing factorization methods are unable to achieve scalability and non-linearity simultaneously. This paper presents an in-depth study on graph embedding techniques on both directed and undirected graphs. We analyze the fundamental reasons that lead to the distortion of out-degree distributions and to the conflicting optimization goals. We propose transpose proximity, a unified approach that solves both problems. Based on the concept of transpose proximity, we design STRAP, a factorization based graph embedding algorithm that achieves scalability and non-linearity simultaneously. STRAP makes use of the backward push algorithm to efficiently compute the sparse Personalized PageRank (PPR) as its transpose proximities. By imposing the sparsity constraint, we are able to apply non-linear operations to the proximity matrix and perform efficient matrix factorization to derive the embedding vectors. Finally, we present an extensive experimental study that evaluates the effectiveness of various graph embedding algorithms, and we show that \strap outperforms the state-of-the-art methods in terms of effectiveness and scalability.
We introduce Grinch, a new algorithm for large-scale, non-greedy hierarchical clustering with general linkage functions that compute arbitrary similarity between two point sets. The key components of Grinch are its rotate and graft subroutines that efficiently reconfigure the hierarchy as new points arrive, supporting discovery of clusters with complex structure. Grinch is motivated by a new notion of separability for clustering with linkage functions: we prove that when the linkage function is consistent with a ground-truth clustering, Grinch is guaranteed to produce a cluster tree containing the ground-truth, independent of data arrival order. Our empirical results on benchmark and author coreference datasets (with standard and learned linkage functions) show that Grinch is more accurate than other scalable methods, and orders of magnitude faster than hierarchical agglomerative clustering.
The Multi-Armed Bandit (MAB) is a fundamental model capturing the dilemma between exploration and exploitation in sequential decision making. At every time step, the decision maker selects a set of arms and observes a reward from each of the chosen arms. In this paper, we present a variant of the problem, which we call the Scaling MAB (S-MAB): The goal of the decision maker is not only to maximize the cumulative rewards, i.e., choosing the arms with the highest expected reward, but also to decide how many arms to select so that, in expectation, the cost of selecting arms does not exceed the rewards. This problem is relevant to many real-world applications, e.g., online advertising, financial investments or data stream monitoring. We propose an extension of Thompson Sampling, which has strong theoretical guarantees and is reported to perform well in practice. Our extension dynamically controls the number of arms to draw. Furthermore, we combine the proposed method with ADWIN, a state-of-the-art change detector, to deal with non-static environments. We illustrate the benefits of our contribution via a real-world use case on predictive maintenance.
We study the problem of scaling Multinomial Logistic Regression (MLR) to datasets with very large number of data points in the presence of large number of classes. At a scale where neither data nor the parameters are able to fit on a single machine, we argue that simultaneous data and model parallelism (Hybrid Parallelism) is inevitable. The key challenge in achieving such a form of parallelism in MLR is the log-partition function which needs to be computed across all K classes per data point, thus making model parallelism non-trivial. To overcome this problem, we propose a reformulation of the original objective that exploits double-separability, an attractive property that naturally leads to hybrid parallelism. Our algorithm (DS-MLR) is asynchronous and completely de-centralized, requiring minimal communication across workers while keeping both data and parameter workloads partitioned. Unlike standard data parallel approaches, DS-MLR avoids bulk-synchronization by maintaining local normalization terms on each worker and accumulating them incrementally using a token-ring topology. We demonstrate the versatility of DS-MLR under various scenarios in data and model parallelism, through an empirical study consisting of real-world datasets. In particular, to demonstrate scaling via hybrid parallelism, we created a new benchmark dataset (Reddit-Full) by pre-processing 1.7 billion reddit user comments spanning the period 2007-2015. We used DS-MLR to solve an extreme multi-class classification problem of classifying 211 million data points into their corresponding subreddits. Reddit-Full is a massive data set with data occupying 228 GB and 44 billion parameters occupying 358 GB. To the best of our knowledge, no other existing methods can handle MLR in this setting.
In this work, we propose a moderate policy update method for reinforcement learning, which encourages the agent to explore more boldly in early episodes but updates the policy more cautious. Based on the maximum entropy framework, we propose a softer objective with more conservative constraints and build the separated trust regions for optimization. To reduce the variance of expected entropy return, a calculated state policy entropy of Gaussian distribution is preferred instead of collecting log probability by sampling. This new method, which we call separated trust region for policy mean and variance (STRMV), can be view as an extension to proximal policy optimization (PPO) but it is gentler for policy update and more lively for exploration. We test our approach on a wide variety of continuous control benchmark tasks in the MuJoCo environment. The experiments demonstrate that STRMV outperforms the previous state of art on-policy methods, not only achieving higher rewards but also improving the sample efficiency.
One of the most interesting application scenarios in anomaly detection is when sequential data are targeted. For example, in a safety-critical environment, it is crucial to have an automatic detection system to screen the streaming data gathered by monitoring sensors and to report abnormal observations if detected in real-time. Oftentimes, stakes are much higher when these potential anomalies are intentional or goal-oriented. We propose an end-to-end framework for sequential anomaly detection using inverse reinforcement learning (IRL), whose objective is to determine the decision-making agent's underlying function which triggers his/her behavior. The proposed method takes the sequence of actions of a target agent (and possibly other meta information) as input. The agent's normal behavior is then understood by the reward function which is inferred via IRL.
We use a neural network to represent a reward function. Using a learned reward function, we evaluate whether a new observation from the target agent follows a normal pattern. In order to construct a reliable anomaly detection method and take into consideration the confidence of the predicted anomaly score, we adopt a Bayesian approach for IRL. The empirical study on publicly available real-world data shows that our proposed method is effective in identifying anomalies.
Given past sequential sets of elements, predicting the subsequent sets of elements is an important problem in different domains. With the past orders of customers given, predicting the items that are likely to be bought in their following orders can provide information about the future purchase intentions. With the past clinical records of patients at each visit to the hospitals given, predicting the future clinical records in the subsequent visits can provide information about the future disease progression. These useful information can help to make better decisions in different domains. However, existing methods have not studied this problem well. In this paper, we formulate this problem as a sequential sets to sequential sets learning problem. We propose an end-to-end learning approach based on an encoder-decoder framework to solve the problem. In the encoder, our approach maps the set of elements at each past time step into a vector. In the decoder, our method decodes the set of elements at each subsequent time step from the vectors with a set-based attention mechanism. The repeated elements pattern is also considered in our method to further improve the performance. In addition, our objective function addresses the imbalance and correlation existing among the predicted elements. The experimental results on three real-world data sets showthat our method outperforms the best performance of the compared methods with respect to recall and person-wise hit ratio by 2.7-20.6% and 2.1-26.3%, respectively. Our analysis also shows that our decoder has good generalization to output sequential sets that are even longer than the output of training instances.
Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on $686,765$ data columns retrieved from the VizNet corpus by matching $78$ semantic types from DBpedia to column headers. We characterize each matched column with $1,588$ features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F$_1$ score of $0.89$, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.
In this paper we consider the following important problem: when we explore data visually and observe patterns, how can we determine their statistical significance? Patterns observed in exploratory analysis are traditionally met with scepticism, since the hypotheses are formulated while viewing the data, rather than before doing so. In contrast to this belief, we show that it is, in fact, possible to evaluate the significance of patterns also during exploratory analysis, and that the knowledge of the analyst can be leveraged to improve statistical power by reducing the amount of simultaneous comparisons. We develop a principled framework for determining the statistical significance of visually observed patterns. Furthermore, we show how the significance of visual patterns observed during iterative data exploration can be determined. We perform an empirical investigation on real and synthetic tabular data and time series, using different test statistics and methods for generating surrogate data. We conclude that the proposed framework allows determining the significance of visual patterns during exploratory analysis.
Social recommendation has been playing an important role in suggesting items to users through utilizing information from social connections. However, most existing approaches do not consider the attention factor causing the constraint that people can only accept a limited amount of information due to the limited strength of mind, which has been discovered as an intrinsic physiological property of human by social science. We address this issue by resorting to the concept of limited attention in social science and combining it with machine learning techniques in an elegant way. When introducing the idea of limited attention into social recommendation, two challenges that fail to be solved by existing methods appear: i) how to develop a mathematical model which can optimally choose a subset of friends for each user such that these friends' preferences can best influence the target user, and ii) how can the model learn an optimal attention for each of these selected friends. To tackle these challenges, we first propose to formulate the problem of optimal limited attention in social recommendation. We then develop a novel algorithm through employing an EM-style strategy to jointly optimize users' latent preferences, optimal number of their best influential friends and the corresponding attentions. We also give a rigorous proof to guarantee the algorithm's optimality. The proposed model is capable of efficiently finding an optimal number of friends whose preferences have the best impact on target user as well as adaptively learning an optimal personalized attention towards every selected friend w.r.t. the best recommendation accuracy. Extensive experiments on real-world datasets demonstrate the superiority of our proposed model over several state-of-the-art algorithms.
We present SPuManTE, an efficient algorithm for mining significant patterns from a transactional dataset. SPuManTE controls the Family-wise Error Rate: it ensures that the probability of reporting one or more false discoveries is less than an user-specified threshold. A key ingredient of SPuManTE is UT, our novel unconditional statistical test for evaluating the significance of a pattern, that requires fewer assumptions on the data generation process and is more appropriate for a knowledge discovery setting than classical conditional tests, such as the widely used Fisher's exact test. Computational requirements have limited the use of unconditional tests in significant pattern discovery, but UT overcomes this issue by obtaining the required probabilities in a novel efficient way. SPuManTE combines UT with recent results on the supremum of the deviations of pattern frequencies from their expectations, founded in statistical learning theory. This combination allows SPuManTE to be very efficient, while also enjoying high statistical power. The results of our experimental evaluation show that SPuManTE allows the discovery of statistically significant patterns while properly accounting for uncertainties in patterns' frequencies due to the data generation process.
Inspired by convolutional neural networks on 1D and 2D data, graph convolutional neural networks (GCNNs) have been developed for various learning tasks on graph data, and have shown superior performance on real-world datasets. Despite their success, there is a dearth of theoretical explorations of GCNN models such as their generalization properties. In this paper, we take a first step towards developing a deeper theoretical understanding of GCNN models by analyzing the stability of single-layer GCNN models and deriving their generalization guarantees in a semi-supervised graph learning setting. In particular, we show that the algorithmic stability of a GCNN model depends upon the largest absolute eigenvalue of its graph convolution filter. Moreover, to ensure the uniform stability needed to provide strong generalization guarantees, the largest absolute eigenvalue must be independent of the graph size. Our results shed new insights on the design of new & improved graph convolution filters with guaranteed algorithmic stability. We evaluate the generalization gap and stability on various real-world graph datasets and show that the empirical results indeed support our theoretical findings. To the best of our knowledge, we are the first to study stability bounds on graph learning in a semi-supervised setting and derive generalization bounds for GCNN models.
Hidden Markov Model (HMM) is a powerful tool that has been widely adopted in sequence modeling tasks, such as mobility analysis, healthcare informatics, and online recommendation. However, using HMM for modeling personalized sequences remains a challenging problem: training a unified HMM with all the sequences often fails to uncover interesting personalized patterns; yet training one HMM for each individual inevitably suffers from data scarcity. We address this challenge by proposing a state-sharing sparse hidden Markov model (S3HMM) that can uncover personalized sequential patterns without suffering from data scarcity. This is achieved by two design principles: (1) all the HMMs in the ensemble share the same set of latent states; and (2) each HMM has its own transition matrix to model the personalized transitions. The result optimization problem for S3HMM becomes nontrivial, because of its two-layer hidden state design and the non-convexity in parameter estimation. We design a new Expectation-Maximization algorithm based, which treats the difference of convex programming as a sub-solver to optimize the non-convex function in the M-step with convergence guarantee. Our experimental results show that, S3HMM can successfully uncover personalized sequential patterns in various applications and outperforms baselines significantly in downstream prediction tasks.
We present ARU, an Adaptive Recurrent Unit for streaming adaptation of deep globally trained time-series forecasting models. The ARU combines the advantages of learning complex data transformations across multiple time series from deep global models, with per-series localization offered by closed-form linear models. Unlike existing methods of adaptation that are either memory-intensive or non-responsive after training, ARUs require only fixed sized state and adapt to streaming data via an easy RNN-like update operation. The core principle driving ARU is simple --- maintain sufficient statistics of conditional Gaussian distributions and use them to compute local parameters in closed form. Our contribution is in embedding such local linear models in globally trained deep models while allowing end-to-end training on the one hand, and easy RNN-like updates on the other. Across several datasets we show that ARU is more effective than recently proposed local adaptation methods that tax the global network to compute local parameters.
Session-based Recommendation (SR) is the task of recommending the next item based on previously recorded user interactions. In this work, we study SR in a practical streaming scenario, namely Streaming Session-based Recommendation (SSR), which is a more challenging task due to (1) the uncertainty of user behaviors, and (2) the continuous, large-volume, high-velocity nature of the session data. Recent studies address (1) by exploiting the attention mechanism in Recurrent Neural Network (RNN) to better model the user's current intent, which leads to promising improvements. However, the proposed attention models are based solely on the current session. Moreover, existing studies only perform SR under static offline settings and none of them explore (2). In this work, we target SSR and propose a Streaming Session-based Recommendation Machine (SSRM) to tackle these two challenges. Specifically, to better understand the uncertainty of user behaviors, we propose a Matrix Factorization (MF) based attention model, which improves the commonly used attention mechanism by leveraging the user's historical interactions. To deal with the large-volume and high-velocity challenge, we introduce a reservoir-based streaming model where an active sampling strategy is proposed to improve the efficiency of model updating. We conduct extensive experiments on two real-world datasets. The experimental results demonstrate the superiority of the SSRM method compared to several state-of-the-art methods in terms of MRR and Recall.
Unstructured clinical texts contain rich health-related information. To better utilize the knowledge buried in clinical texts, discovering synonyms for a medical query term has become an important task. Recent automatic synonym discovery methods leveraging raw text information have been developed. However, to preserve patient privacy and security, it is usually quite difficult to get access to large-scale raw clinical texts. In this paper, we study a new setting named synonym discovery on privacy-aware clinical data (i.e., medical terms extracted from the clinical texts and their aggregated co-occurrence counts, without raw clinical texts). To solve the problem, we propose a new framework SurfCon that leverages two important types of information in the privacy-aware clinical data, i.e., the surface form information, and the global context information for synonym discovery. In particular, the surface form module enables us to detect synonyms that look similar while the global context module plays a complementary role to discover synonyms that are semantically similar but in different surface forms, and both allow us to deal with the OOV query issue (i.e., when the query is not found in the given data). We conduct extensive experiments and case studies on publicly available privacy-aware clinical data, and show that SurfCon can outperform strong baseline methods by large margins under various settings.
Semi-Supervised Support Vector Machine (S3VM) is one of the most popular methods for semi-supervised learning. To avoid the trivial solution of classifying all the unlabeled examples to a same class, balancing constraint is often used with S3VM (denoted as BCS3VM). Recently, a novel incremental learning algorithm (IL-S3VM) based on the path following technique was proposed to significantly scale up S3VM. However, the dynamic relationship of balancing constraint with previous labeled and unlabeled samples impede their incremental method for handling BCS3VM. To fill this gap, in this paper, we propose a new incremental S3VM algorithm (IL-BCS3VM) based on IL-S3VM which can effectively handle the balancing constraint and directly update the solution of BCS3VM. Specifically, to handle the dynamic relationship of balancing constraint with previous labeled and unlabeled samples, we design two unique procedures which can respectively eliminate and add the balancing constraint into S3VM. More importantly, we provide the finite convergence analysis for our IL-BCS3VM algorithm. Experimental results on a variety of benchmark datasets not only confirm the finite convergence of IL-BCS3VM, but also show a huge reduction of computational time compared with existing batch and incremental learning algorithms, while retaining the similar generalization performance.
In this paper, we propose Task-Adversarial co-Generative Nets (TAGN) for learning from multiple tasks. It aims to address the two fundamental issues of multi-task learning, i.e., domain shift and limited labeled data, in a principled way. To this end, TAGN first learns the task-invariant representations of features to bridge the domain shift among tasks. Based on the task-invariant features, TAGN generates the plausible examples for each task to tackle the data scarcity issue. In TAGN, we leverage multiple game players to gradually improve the quality of the co-generation of features and examples by using an adversarial strategy. It simultaneously learns the marginal distribution of task-invariant features across different tasks and the joint distributions of examples with labels for each task. The theoretical study shows the desired results: at the equilibrium point of the multi-player game, the feature extractor exactly produces the task-invariant features for different tasks, while both the generator and the classifier perfectly replicate the joint distribution for each task. The experimental results on the benchmark data sets demonstrate the effectiveness of the proposed approach.
Interest in determinantal point processes (DPPs) is increasing in machine learning due to their ability to provide an elegant parametric model over combinatorial sets. In particular, the number of required parameters in a DPP grows only quadratically with the size of the ground set (e.g., item catalog), while the number of possible sets of items grows exponentially. Recent work has shown that DPPs can be effective models for product recommendation and basket completion tasks, since they are able to account for both the diversity and quality of items within a set. We present an enhanced DPP model that is specialized for the task of basket completion, the tensorized DPP. We leverage ideas from tensor factorization in order to customize the model for the next-item basket completion task, where the next item is captured in an extra dimension of the model. We evaluate our model on several real-world datasets, and find that the tensorized DPP provides significantly better predictive quality in several settings than a number of state-of-the art models.
The question of transparency has become a key point of contention between buyers and sellers of display advertising space: ads are allocated via complex, black-box auction systems whose mechanics can be difficult to model let alone optimize against. Motivated by this concern, this paper takes the perspective of a single advertiser and develops statistical tests to confirm whether an underlying auction mechanism is dynamically incentive compatible (IC), so that truthful bidding in each individual auction and across time is an optimal strategy. The most general notion of dynamic-IC presumes that the seller knows how buyers discount future surplus, which is questionable in practice. We characterize dynamic mechanisms that are dynamic-IC for all possible discounting factors according to two intuitive conditions: the mechanism should be IC at each stage in the usual sense, and expected present utility (under truthful bidding) should be independent of past bids. The conditions motivate two separate experiments based on bid perturbations that can be run simultaneously on the same impression traffic. We provide a novel statistical test of stage-IC along with a test for utility-independence that can detect lags in how the seller uses past bid information. We evaluate our tests on display ad data from a major ad exchange and show how they can accurately uncover evidence of first- or second-price auctions coupled with dynamic reserve prices, among other types of dynamic mechanisms.
The Impact of Person-Organization Fit on Talent Management: A Structure-Aware Convolutional Neural Network Approach
Person-Organization fit (P-O fit) refers to the compatibility between employees and their organizations. The study of P-O fit is important for enhancing proactive talent management. While considerable efforts have been made in this direction, it still lacks a quantitative and holistic way for measuring P-O fit and its impact on talent management. To this end, in this paper, we propose a novel data-driven neural network approach for dynamically modeling the compatibility in P-O fit and its meaningful relationships with two critical issues in talent management, namely talent turnover and job performance. Specifically, inspired by the practical management scenarios, we first creatively design an Organizational Structure-aware Convolutional Neural Network (OSCN) for hierarchically extracting organization-aware compatibility features for measuring P-O fit. Then, to capture the dynamic nature of P-O fit and its consequent impact, we further exploit an adapted Recurrent Neural Network with attention mechanism to model the temporal information of P-O fit. Finally, we compare our approach with a number of state-of-the-art baseline methods on real-world talent data. Experimental results clearly demonstrate the effectiveness in terms of turnover prediction and job performance prediction. Moreover, we also show some interesting indicators of talent management through the visualization of network layers.
Conditions play an essential role in scientific observations, hypotheses, and statements. Unfortunately, existing scientific knowledge graphs (SciKGs) represent factual knowledge as a flat relational network of concepts, as same as the KGs in general domain, without considering the conditions of the facts being valid, which loses important contexts for inference and exploration. In this work, we propose a novel representation of SciKG, which has three layers. The first layer has concept nodes, attribute nodes, as well as the attaching links from attribute to concept. The second layer represents both fact tuples and condition tuples. Each tuple is a node of the relation name, connecting to the subject and object that are concept or attribute nodes in the first layer. The third layer has nodes of statement sentences traceable to the original paper and authors. Each statement node connects to a set of fact tuples and/or condition tuples in the second layer. We design a semi-supervised Multi-Input Multi-Output sequence labeling model that learns complex dependencies between the sequence tags from multiple signals and generates output sequences for fact and condition tuples. It has a self-training module of multiple strategies to leverage the massive scientific data for better performance when manual annotation is limited. Experiments on a data set of 141M sentences show that our model outperforms existing methods and the SciKGs we constructed provide a good understanding of the scientific statements.
The popularity of mobile Internet techniques and Online-To-Offline(O2O) business models has led to the emergence of various spatial crowdsourcing (SC) platforms in our daily life. A core issue of SC platforms is to assign tasks to suitable crowd workers. Existing approaches usually focus on the matching of two types of objects,tasks and workers, and let workers to travel to the location of usersto provide services, which is a 2D matching problem. However, recent services provided by some new platforms, such as person-alized haircut service1and station ride-sharing, need users andworkers travel together to a third workplace to complete the service, which is indeed a 3D matching problem. Approaches in the existingstudies either cannot solve such 3D matching problem, or lack aassignment plan satisfying both users' and workers' preference inreal applications. Thus, in this paper, we propose a 3-Dimensional Stable Spatial Matching(3D-SSM) for the 3D matching problem innew SC services. We prove that the 3D-SSM problem is NP-hard, and propose two baseline algorithms and two efficient approximatealgorithms with bounded approximate ratios to solve it. Finally, weconduct extensive experiment studies which verify the efficiencyand effectiveness of the proposed algorithms on real and synthetic datasets.
Employing an optimal traffic light control policy has the potential of having a positive impact, both economic and environmental, on urban mobility. Reinforcement learning techniques have shown promising results in optimizing control policies for basic intersections and low volume traffic. This paper addresses the traffic light control problem in a complex scenario, such as a signalized roundabout with heavy traffic volumes, with the aim of maximizing throughput and avoiding traffic jams. We formulate the environment with a realistic representation of states and actions and a capacity-based reward. We enforce episode terminal conditions to avoid unwanted states, such as long queues interfering with other junctions in the vehicular network. A time-dependent baseline is proposed to reduce the variance of Policy Gradient updates in the setting of episodic conditions, thus improving the algorithm convergence to an optimal solution. We evaluate the method on real data and highly congested traffic, implementing a signalized simulated roundabout with 11 phases. The proposed method is able to avoid traffic jams and achieves higher performance than traditional time-splitting policies and standard Policy Gradient on average delay and effective capacity, while drastically decreasing the emissions.
Towards Robust and Discriminative Sequential Data Learning: When and How to Perform Adversarial Training?
The last decade has witnessed a surge of interest in applying deep learning models for discovering sequential patterns from a large volume of data. Recent works show that deep learning models can be further improved by enforcing models to learn a smooth output distribution around each data point. This can be achieved by augmenting training data with slight perturbations that are designed to alter model outputs. Such adversarial training approaches have shown much success in improving the generalization performance of deep learning models on static data, e.g., transaction data or image data captured on a single snapshot. However, when applied to sequential data, the standard adversarial training approaches cannot fully capture the discriminative structure of a sequence. This is because real-world sequential data are often collected over a long period of time and may include much irrelevant information to the classification task. To this end, we develop a novel adversarial training approach for sequential data classification by investigating when and how to perturb a sequence for an effective data augmentation. Finally, we demonstrate the superiority of the proposed method over baselines in a diversity of real-world sequential datasets.
Quantum computers promise significant advantages over classical computers for a number of different applications. We show that the complete loss function landscape of a neural network can be represented as the quantum state output by a quantum computer. We demonstrate this explicitly for a binary neural network and, further, show how a quantum computer can train the network by manipulating this state using a well-known algorithm known as quantum amplitude amplification. We further show that with minor adaptation, this method can also represent the meta-loss landscape of a number of neural network architectures simultaneously. We search this meta-loss landscape with the same method to simultaneously train and design a binary neural network.
Given a project plan and the goal, can we predict the plan's success rate? The key challenge is to learn the feature vectors of billions of the plan's components for effective prediction. However, existing methods did not model the behavior outcomes but component proximities. In this work, we define a measurement of behavior outcomes, which forms a test tube-shaped region to represent "success", in a vector space. We propose a novel representation learning method to learn the embeddings of behavior components (including contexts, plans, and goals) by preserving the behavior outcome information. Experiments on real datasets show that our proposed method significantly improves the performance of goal prediction as well as context recommendation over the state-of-the-art.
Pattern formation is a ubiquitous phenomenon that describes the generation of orderly outcomes by self-organization. In both physical society and online social media, patterns formed by social interactions are mainly driven by information flow. Despite an increasing number of studies aiming to understand the spreads of information flow, little is known about the geometry of these spreading patterns and how they were formed during the spreading. In this paper, by exploring 432 million information flow patterns extracted from a large-scale online social media dataset, we uncover a wide range of complex geometric patterns characterized by a three-dimensional metric space. In contrast, the existing understanding of spreading patterns are limited to fanning-out or narrow tree-like geometries. We discover three key ingredients that govern the formation of complex geometric patterns of information flow. As a result, we propose a stochastic process model incorporating these ingredients, demonstrating that it successfully reproduces the diverse geometries discovered from the empirical spreading patterns. Our discoveries provide a theoretical foundation for the microscopic mechanisms of information flow, potentially leading to wide implications for prediction, control and policy decisions in social media.
Unifying Inter-region Autocorrelation and Intra-region Structures for Spatial Embedding via Collective Adversarial Learning
Unsupervised spatial representation learning aims to automatically identify effective features of geographic entities (i.e., regions) from unlabeled yet structural geographical data. Existing network embedding methods can partially address the problem by: (1) regarding a region as a node in order to reformulate the problem into node embedding; (2) regarding a region as a graph in order to reformulate the problem into graph embedding. However, these studies can be improved by preserving (1) intra-region geographic structures, which are represented by multiple spatial graphs, leading to a reformulation of collective learning from relational graphs; (2) inter-region spatial autocorrelations, which are represented by pairwise graph regularization, leading to a reformulation of adversarial learning. Moreover, field data in real systems are usually lack of labels, an unsupervised fashion helps practical deployments. Along these lines, we develop an unsupervised Collective Graph-regularized dual-Adversarial Learning (CGAL) framework for multi-view graph representation learning and also a Graph-regularized dual-Adversarial Learning (GAL) framework for single-view graph representation learning. Finally, our experimental results demonstrate the enhanced effectiveness of our method.
Universal Representation Learning of Knowledge Bases by Jointly Embedding Instances and Ontological Concepts
Many large-scale knowledge bases simultaneously represent two views of knowledge graphs (KGs): an ontology view for abstract and commonsense concepts, and an instance view for specific entities that are instantiated from ontological concepts. Existing KG embedding models, however, merely focus on representing one of the two views alone. In this paper, we propose a novel two-view KG embedding model, JOIE, with the goal to produce better knowledge embedding and enable new applications that rely on multi-view knowledge. JOIE employs both cross-view and intra-view modeling that learn on multiple facets of the knowledge base. The cross-view association model is learned to bridge the embeddings of ontological concepts and their corresponding instance-view entities. The intra-view models are trained to capture the structured knowledge of instance and ontology views in separate embedding spaces, with a hierarchy-aware encoding technique enabled for ontologies with hierarchies. We explore multiple representation techniques for the two model components and investigate with nine variants of JOIE. Our model is trained on large-scale knowledge bases that consist of massive instances and their corresponding ontological concepts connected via a (small) set of cross-view links. Experimental results on public datasets show that the best variant of JOIE significantly outperforms previous models on instance-view triple prediction task as well as ontology population on ontology-view KG. In addition, our model successfully extends the use of KG embeddings to entity typing with promising performance.
Predicting urban traffic is of great importance to intelligent transportation systems and public safety, yet is very challenging because of two aspects: 1) complex spatio-temporal correlations of urban traffic, including spatial correlations between locations along with temporal correlations among timestamps; 2) diversity of such spatio-temporal correlations, which vary from location to location and depend on the surrounding geographical information, e.g., points of interests and road networks. To tackle these challenges, we proposed a deep-meta-learning based model, entitled ST-MetaNet, to collectively predict traffic in all location at once. ST-MetaNet employs a sequence-to-sequence architecture, consisting of an encoder to learn historical information and a decoder to make predictions step by step. In specific, the encoder and decoder have the same network structure, consisting of a recurrent neural network to encode the traffic, a meta graph attention network to capture diverse spatial correlations, and a meta recurrent neural network to consider diverse temporal correlations. Extensive experiments were conducted based on two real-world datasets to illustrate the effectiveness of ST-MetaNet beyond several state-of-the-art methods.
SESSION: Applied Data Science Track Papers
Booking.com is the world's largest online travel agent where millions of guests find their accommodation and millions of accommodation providers list their properties including hotels, apartments, bed and breakfasts, guest houses, and more. During the last years we have applied Machine Learning to improve the experience of our customers and our business. While most of the Machine Learning literature focuses on the algorithmic or mathematical aspects of the field, not much has been published about how Machine Learning can deliver meaningful impact in an industrial environment where commercial gains are paramount. We conducted an analysis on about 150 successful customer facing applications of Machine Learning, developed by dozens of teams in Booking.com, exposed to hundreds of millions of users worldwide and validated through rigorous Randomized Controlled Trials. Following the phases of a Machine Learning project we describe our approach, the many challenges we found, and the lessons we learned while scaling up such a complex technology across our organization. Our main conclusion is that an iterative, hypothesis driven process, integrated with other disciplines was fundamental to build 150 successful products enabled by Machine Learning.
Tags of a Point of Interest (POI) can facilitate location-based services from many aspects like location search and place recommendation. However, many POI tags are often incomplete or imprecise, which may lead to performance degradation of tag-dependent applications. In this paper, we study the POI tag refinement problem which aims to automatically fill in the missing tags as well as correct noisy tags for POIs. We propose a tri-adaptive collaborative learning framework to search for an optimal POI-tag score matrix. The framework integrates three components to collaboratively (i) model the similarity matching between POI and tag, (ii) recover the POI-tag pattern via matrix factorization and (iii) learn to infer the most possible tags by maximum likelihood estimation. We devise an adaptively joint training process to optimize the model and regularize each component simultaneously. And the final refinement results are the consensus of multiple views from different components. We also discuss how to utilize various data sources to construct features for tag refinement, including user profile data, query data on Baidu Maps and basic properties of POIs. Finally, we conduct extensive experiments to demonstrate the effectiveness of our framework. And we further present a case study of the deployment of our framework on Baidu Maps.
The bin packing problem is one of the most fundamental optimization problems. Owing to its hardness as a combinatorial optimization problem class and its wide range of applications in different domains, different variations of the problem are emerged and many heuristics have been proposed for obtaining approximate solutions.
In this paper, we solve a Multi-Level Bin Packing (MLBP) problem in the real make-to-order industry scenario. Existing solutions are not applicable to the problem due to: 1. the final packing may consist multiple levels of sub-packings; 2. the geometry shapes of objects as well as the packing constraints may be unknown. We design an automatic packing framework which extracts the packing knowledge from historical records to support packing without geometry shape and constraint information. Furthermore, we propose a dynamic programming approach to find the optimal solution for normal size problems; and a heuristic multi-level fuzzy-matching algorithm for large size problems. An inverted index is used to accelerate strategy search. The proposed auto packing framework has been deployed in Huawei Process & Engineering System to assist the packing engineers. It achieves a performance of accelerating the execution time of processing 5,000 packing orders to about $8$ minutes with an average successful packing rate as $80.54%$, which releases at least $30%$ workloads of packing workers.
Related search query recommendation is a standard feature in many modern search engines. Interesting and relevant queries often increase the active time of users and improve the overall search experience. However, conventional approaches based on tag extraction, keywords matching or click graph link analysis suffer from the common problem of limited coverage and generalizability, which means the system could only make suggestions for a small portion of well-formed search queries.
In this work, we propose a deep generative approach to construct a related search query for recommendation in a word-by-word fashion, given either an input query or the title of a document. We propose a novel two-stage learning framework that partitions the task into two simpler sub-problems, namely, relevant context words discovery and context-dependent query generation. We carefully design a Relevant Words Generator (RWG) based on recurrent neural networks and a Dual-Vocabulary Sequence-to-Sequence (DV-Seq2Seq) model to address these problems. We also propose automated strategies that have retrieved three large datasets with $500$K to $1$ million instances, from a search click graph constructed based on $8$ days of search histories in Tencent QQ Browser, for model training. By leveraging the dynamically discovered context words, our proposed framework outperforms other Seq2Seq generative baselines on a wide range of BLEU, ROUGE and Exact Match (EM) metrics.
Recent works on ride-sharing order dispatching have highlighted the importance of taking into account both the spatial and temporal dynamics in the dispatching process for improving the transportation system efficiency. At the same time, deep reinforcement learning has advanced to the point where it achieves superhuman performance in a number of fields. In this work, we propose a deep reinforcement learning based solution for order dispatching and we conduct large scale online A/B tests on DiDi's ride-dispatching platform to show that the proposed method achieves significant improvement on both total driver income and user experience related metrics.
In particular, we model the ride dispatching problem as a Semi Markov Decision Process to account for the temporal aspect of the dispatching actions. To improve the stability of the value iteration with nonlinear function approximators like neural networks, we propose Cerebellar Value Networks (CVNet) with a novel distributed state representation layer. We further derive a regularized policy evaluation scheme for CVNet that penalizes large Lipschitz constant of the value network for additional robustness against adversarial perturbation and noises. Finally, we adapt various transfer learning methods to CVNet for increased learning adaptability and efficiency across multiple cities. We conduct extensive offline simulations based on real dispatching data as well as online AB tests through the DiDi's platform. Results show that CVNet consistently outperforms other recently proposed dispatching methods. We finally show that the performance can be further improved through the efficient use of transfer learning.
Population Based Training (PBT) is a recent approach that jointly optimizes neural network weights and hyperparameters which periodically copies weights of the best performers and mutates hyperparameters during training. Previous PBT implementations have been synchronized glass-box systems. We propose a general, black-box PBT framework that distributes many asynchronous "trials" (a small number of training steps with warm-starting) across a cluster, coordinated by the PBT controller. The black-box design does not make assumptions on model architectures, loss functions or training procedures. Our system supports dynamic hyperparameter schedules to optimize both differentiable and non-differentiable metrics. We apply our system to train a state-of-the-art WaveNet generative model for human voice synthesis. We show that our PBT system achieves better accuracy and faster convergence compared to existing methods, given the same computational resource.
Electronic Health Records (EHR) containing longitudinal information about millions of patient lives are increasingly being utilized by organizations across the healthcare spectrum. Studies on EHR data have enabled real world applications like understanding of disease progression, outcomes analysis, and comparative effectiveness research. However, often every study is independently commissioned, data is gathered by surveys or specifically purchased per study by a long and often painful process. This is followed by an arduous repetitive cycle of analysis, model building, and generation of insights. This process can take anywhere between 1 - 3 years. In this paper, we present a robust end-to-end machine learning based SaaS system to perform analysis on a very large EHR dataset. The framework consists of a proprietary EHR datamart spanning ~55 million patient lives in USA and over ~20 billion data points. To the best of our knowledge, this framework is the largest in the industry to analyze medical records at this scale, with such efficacy and ease. We developed an end-to-end ML framework with carefully chosen components to support EHR analysis at scale and suitable for further downstream clinical analysis. Specifically, it consists of a ridge regularized Survival Support Vector Machine (SSVM) with a clinical kernel, coupled with Chi-square distance-based feature selection, to uncover relevant risk factors by exploiting the weak correlations in EHR. Our results on multiple real use cases indicate that the framework identifies relevant factors effectively without expert supervision. The framework is stable, generalizable over outcomes, and also found to contribute to better out-of-bound prediction over known expert features. Importantly, the ML methodologies used are interpretable which is critical for acceptance of our system in the targeted user base. With the system being operational, all of these studies were completed within a time frame of 3-4 weeks compared to the industry standard 12-36 months. As such our system can accelerate analysis and discovery, result in better ROI due to reduced investments as well as quicker turn around of studies.
Retinopathy of Prematurity (ROP) is a leading cause for childhood blindness worldwide. An automated ROP detection system could significantly improve the chance of a child receiving proper diagnosis and treatment. We propose a means of producing a continuous severity score in an automated fashion, regressed from both (a) diagnostic class labels as well as (b) comparison outcomes. Our generative model combines the two sources, and successfully addresses inherent variability in diagnostic outcomes. In particular, our method exhibits an excellent predictive performance of both diagnostic and comparison outcomes over a broad array of metrics, including AUC, precision, and recall.
While marketing budget allocation has been studied for decades in traditional business, nowadays online business brings much more challenges due to the dynamic environment and complex decision-making process. In this paper, we present a novel unified framework for marketing budget allocation. By leveraging abundant data, the proposed data-driven approach can help us to overcome the challenges and make more informed decisions. In our approach, a semi-black-box model is built to forecast the dynamic market response and an efficient optimization method is proposed to solve the complex allocation task. First, the response in each market-segment is forecasted by exploring historical data through a semi-black-box model, where the capability of logit demand curve is enhanced by neural networks. The response model reveals relationship between sales and marketing cost. Based on the learned model, budget allocation is then formulated as an optimization problem, and we design efficient algorithms to solve it in both continuous and discrete settings. Several kinds of business constraints are supported in one unified optimization paradigm, including cost upper bound, profit lower bound, or ROI lower bound. The proposed framework is easy to implement and readily to handle large-scale problems. It has been successfully applied to many scenarios in Alibaba Group. The results of both offline experiments and online A/B testing demonstrate its effectiveness.
Concepts embody the knowledge of the world and facilitate the cognitive processes of human beings. Mining concepts from web documents and constructing the corresponding taxonomy are core research problems in text understanding and support many downstream tasks such as query analysis, knowledge base construction, recommendation, and search. However, we argue that most prior studies extract formal and overly general concepts from Wikipedia or static web pages, which are not representing the user perspective.
In this paper, we describe our experience of implementing and deploying ConcepT in Tencent QQ Browser. It discovers user-centered concepts at the right granularity conforming to user interests, by mining a large amount of user queries and interactive search click logs. The extracted concepts have the proper granularity, are consistent with user language styles and are dynamically updated. We further present our techniques to tag documents with user-centered concepts and to construct a topic-concept-instance taxonomy, which has helped to improve search as well as news feeds recommendation in Tencent QQ Browser. We performed extensive offline evaluation to demonstrate that our approach could extract concepts of higher quality compared to several other existing methods.
Our system has been deployed in Tencent QQ Browser. Results from online A/B testing involving a large number of real users suggest that the Impression Efficiency of feeds users increased by 6.01% after incorporating the user-centered concepts into the recommendation framework of Tencent QQ Browser.
Since air pollution seriously affects human heath and daily life, the air quality prediction has attracted increasing attention and become an active and important research topic. In this paper, we present AccuAir, our winning solution to the KDD Cup 2018 of Fresh Air, where the proposed solution has won the 1st place in two tracks, and the 2nd place in the other one. Our solution got the best accuracy on average in all the evaluation days. The task is to accurately predict the air quality (as indicated by the concentration of PM2.5, PM10 or O3) of the next 48 hours for each monitoring station in Beijing and London. Aiming at a cutting-edge solution, we first presents an analysis of the air quality data, identifying the fundamental challenges, such as the long-term but suddenly changing air quality, and complex spatial-temporal correlations in different stations. To address the challenges, we carefully design both global and local air quality features, and develop three prediction models including LightGBM, Gated-DNN and Seq2Seq, each with novel ingredients developed for better solving the problem. Specifically, a spatial-temporal gate is proposed in our Gated-DNN model, to effectively capture the spatial-temporal correlations as well as temporal relatedness, making the prediction more sensitive to spatial and temporal signals. In addition, the Seq2Seq model is adapted in such a way that the encoder summarizes useful historical features while the decoder concatenate weather forecast as input, which significantly improves prediction accuracy. Assembling all these components together, the ensemble of three models outperforms all competing methods in terms of the prediction accuracy of 31 days average, 10 days average and 24-48 hours.
Assessing the impact of the individual actions performed by soccer players during games is a crucial aspect of the player recruitment process. Unfortunately, most traditional metrics fall short in addressing this task as they either focus on rare actions like shots and goals alone or fail to account for the context in which the actions occurred. This paper introduces (1) a new language for describing individual player actions on the pitch and (2) a framework for valuing any type of player action based on its impact on the game outcome while accounting for the context in which the action happened. By aggregating soccer players' action values, their total offensive and defensive contributions to their team can be quantified. We show how our approach considers relevant contextual information that traditional player evaluation metrics ignore and present a number of use cases related to scouting and playing style characterization in the 2016/2017 and 2017/2018 seasons in Europe's top competitions.
Machine learning models are bounded by the credibility of ground truth data used for both training and testing. Regardless of the problem domain, this ground truth annotation is objectively manual and tedious as it needs considerable amount of human intervention. With the advent of Active Learning with multiple annotators, the burden can be somewhat mitigated by actively acquiring labels of most informative data instances. However, multiple annotators with varying degrees of expertise poses new set of challenges in terms of quality of the label received and availability of the annotator. Due to limited amount of ground truth information addressing the variabilities of Activity of Daily Living (ADLs), activity recognition models using wearable and mobile devices are still not robust enough for real-world deployment. In this paper, we first propose an active learning combined deep model which updates its network parameters based on the optimization of a joint loss function. We then propose a novel annotator selection model by exploiting the relationships among the users while considering their heterogeneity with respect to their expertise, physical and spatial context. Our proposed model leverages model-free deep reinforcement learning in a partially observable environment setting to capture the action-reward interaction among multiple annotators. Our experiments in real-world settings exhibit that our active deep model converges to optimal accuracy with fewer labeled instances and achieves ~8% improvement in accuracy in fewer iterations.
Many datasets feature seemingly disparate entries that actually refer to the same entity. Reconciling these entries, or "matching," is challenging, especially in situations where there are errors in the data. In certain contexts, the situation is even more complicated: an active adversary may have a vested interest in having the matching process fail. By leveraging eight years of data, we investigate one such adversarial context: matching different online anonymous marketplace vendor handles to unique sellers. Using a combination of random forest classifiers and hierarchical clustering on a set of features that would be hard for an adversary to forge or mimic, we manage to obtain reasonable performance (over 75% precision and recall on labels generated using heuristics), despite generally lacking any ground truth for training. Our algorithm performs particularly well for the top 30% of accounts by sales volume, and hints that 22,163 accounts with at least one confirmed sale map to 15,652 distinct sellers---of which 12,155 operate only one account, and the remainder between 2 and 11 different accounts. Case study analysis further confirms that our algorithm manages to identify non-trivial matches, as well as impersonation attempts.
Sponsored search has more than 20 years of history, and it has been proven to be a successful business model for online advertising. Based on the pay-per-click pricing model and the keyword targeting technology, the sponsored system runs online auctions to determine the allocations and prices of search advertisements. In the traditional setting, advertisers should manually create lots of ad creatives and bid on some relevant keywords to target their audience. Due to the huge amount of search traffic and a wide variety of ad creations, the limits of manual optimizations from advertisers become the main bottleneck for improving the efficiency of this market. Moreover, as many emerging advertising forms and supplies are growing, it's crucial for sponsored search platform to pay more attention to the ROI metrics of ads for getting the marketing budgets of advertisers. In this paper, we present the AiAds system developed at Baidu, which use machine learning techniques to build an automated and intelligent advertising system. By designing and implementing the automated bidding strategy, the intelligent targeting and the intelligent creation models, the AiAds system can transform the manual optimizations into multiple automated tasks and optimize these tasks in advanced methods. AiAds is a brand-new architecture of sponsored search system which changes the bidding language and allocation mechanism, breaks the limit of keyword targeting with end-to-end ad retrieval framework and provides global optimization of ad creation. This system can increase the advertiser's campaign performance, the user experience and the revenue of the advertising platform simultaneously and significantly. We present the overall architecture and modeling techniques for each module of the system and share our lessons learned in solving several of key challenges. Finally, online A/B test and long-term grouping experiment demonstrate the advancement and effectiveness of this system.
Recently, much attention has been paid to the usage of knowledge graph within the context of recommender systems to alleviate the data sparsity and cold-start problems. However, when incorporating entities from a knowledge graph to represent users, most existing works are unaware of the relationships between these entities and users. As a result, the recommendation results may suffer a lot from some unrelated entities.
In this paper, we investigate how to explore these relationships which are essentially determined by the interactions among entities. Firstly, we categorize the interactions among entities into two types: inter-entity-interaction and intra-entity-interaction. Inter-entity-interaction is the interactions among entities that affect their importances to represent users. And intra-entity-interaction is the interactions within an entity that describe the different characteristics of this entity when involved in different relations.
Then, considering these two types of interactions, we propose a novel model named Attention-enhanced Knowledge-aware User Preference Model (AKUPM) for click-through rate (CTR) prediction. More specifically, a self-attention network is utilized to capture the inter-entity-interaction by learning appropriate importance of each entity w.r.t the user. Moreover, the intra-entity-interaction is modeled by projecting each entity into its connected relation spaces to obtain the suitable characteristics. By doing so, AKUPM is able to figure out the most related part of incorporated entities (i.e., filter out the unrelated entities). Extensive experiments on two real-world public datasets demonstrate that AKUPM achieves substantial gains in terms of common evaluation metrics (e.g., AUC, ACC and [email protected]) over several state-of-the-art baselines.
AlphaStock: A Buying-Winners-and-Selling-Losers Investment Strategy using Interpretable Deep Reinforcement Attention Networks
Recent years have witnessed the successful marriage of finance innovations and AI techniques in various finance applications including quantitative trading (QT). Despite great research efforts devoted to leveraging deep learning (DL) methods for building better QT strategies, existing studies still face serious challenges especially from the side of finance, such as the balance of risk and return, the resistance to extreme loss, and the interpretability of strategies, which limit the application of DL-based strategies in real-life financial markets. In this work, we propose AlphaStock, a novel reinforcement learning (RL) based investment strategy enhanced by interpretable deep attention networks, to address the above challenges. Our main contributions are summarized as follows: i) We integrate deep attention networks with a Sharpe ratio-oriented reinforcement learning framework to achieve a risk-return balanced investment strategy; ii) We suggest modeling interrelationships among assets to avoid selection bias and develop a cross-asset attention mechanism; iii) To our best knowledge, this work is among the first to offer an interpretable investment strategy using deep reinforcement learning models. The experiments on long-periodic U.S. and Chinese markets demonstrate the effectiveness and robustness of AlphaStock over diverse market states. It turns out that AlphaStock tends to select the stocks as winners with high long-term growth, low volatility, high intrinsic value, and being undervalued recently.
We develop an algorithm that accurately detects Atrial Fibrillation (AF) episodes from photoplethysmograms (PPG) recorded in ambulatory free-living conditions. We collect and annotate a dataset containing more than 4000 hours of PPG recorded from a wrist-worn device. Using a 50-layer convolutional neural network, we achieve a test AUC of 95% in presence of motion artifacts inherent to PPG signals. Such continuous and accurate detection of AF has the potential to transform consumer wearable devices into clinically useful medical monitoring tools.
Online retailers execute a very large number of price updates when compared to brick-and-mortar stores. Even a few mis-priced items can have a significant business impact and result in a loss of customer trust. Early detection of anomalies in an automated real-time fashion is an important part of such a pricing system. In this paper, we describe unsupervised and supervised anomaly detection approaches we developed and deployed for a large-scale online pricing system at Walmart. Our system detects anomalies both in batch and real-time streaming settings, and the items flagged are reviewed and actioned based on priority and business impact. We found that having the right architecture design was critical to facilitate model performance at scale, and business impact and speed were important factors influencing model selection, parameter choice, and prioritization in a production environment for a large-scale system. We conducted analyses on the performance of various approaches on a test set using real-world retail data and fully deployed our approach into production. We found that our approach was able to detect the most important anomalies with high precision.
The application to search ranking is one of the biggest machine learning success stories at Airbnb. Much of the initial gains were driven by a gradient boosted decision tree model. The gains, however, plateaued over time. This paper discusses the work done in applying neural networks in an attempt to break out of that plateau. We present our perspective not with the intention of pushing the frontier of new modeling techniques. Instead, ours is a story of the elements we found useful in applying neural networks to a real life product. Deep learning was steep learning for us. To other teams embarking on similar journeys, we hope an account of our struggles and triumphs will provide some useful pointers. Bon voyage!
Feature crossing captures interactions among categorical features and is useful to enhance learning from tabular data in real-world businesses. In this paper, we present AutoCross, an automatic feature crossing tool provided by 4Paradigm to its customers, ranging from banks, hospitals, to Internet corporations. By performing beam search in a tree-structured space, AutoCross enables efficient generation of high-order cross features, which is not yet visited by existing works. Additionally, we propose successive mini-batch gradient descent and multi-granularity discretization to further improve efficiency and effectiveness, while ensuring simplicity so that no machine learning expertise or tedious hyper-parameter tuning is required. Furthermore, the algorithms are designed to reduce the computational, transmitting, and storage costs involved in distributed computing. Experimental results on both benchmark and real-world business datasets demonstrate the effectiveness and efficiency of AutoCross. It is shown that AutoCross can significantly enhance the performance of both linear and deep models.
Neural architecture search (NAS) has been proposed to automatically tune deep neural networks, but existing search algorithms, e.g., NASNet, PNAS, usually suffer from expensive computational cost. Network morphism, which keeps the functionality of a neural network while changing its neural architecture, could be helpful for NAS by enabling more efficient training during the search. In this paper, we propose a novel framework enabling Bayesian optimization to guide the network morphism for efficient neural architecture search. The framework develops a neural network kernel and a tree-structured acquisition function optimization algorithm to efficiently explores the search space. Extensive experiments on real-world benchmark datasets have been done to demonstrate the superior performance of the developed framework over the state-of-the-art methods. Moreover, we build an open-source AutoML system based on our method, namely Auto-Keras. The code and documentation are available at https://autokeras.com. The system runs in parallel on CPU and GPU, with an adaptive search strategy for different GPU memory limits.
Dialogue summarization extracts useful information from a dialogue. It helps people quickly capture the highlights of a dialogue without going through long and sometimes twisted utterances. For customer service, it saves human resources currently required to write dialogue summaries. A main challenge of dialogue summarization is to design a mechanism to ensure the logic, integrity, and correctness of the summaries. In this paper, we introduce auxiliary key point sequences to solve this problem. A key point sequence describes the logic of the summary. In our training procedure, a key point sequence acts as an auxiliary label. It helps the model learn the logic of the summary. In the prediction procedure, our model predicts the key point sequence first and then uses it to guide the prediction of the summary. Along with the auxiliary key point sequence, we propose a novel Leader-Writer network. The Leader net predicts the key point sequence, and the Writer net predicts the summary based on the decoded key point sequence. The Leader net ensures the summary is logical and integral. The Writer net focuses on generating fluent sentences. We test our model on customer service scenarios. The results show that our model outperforms other models not only on BLEU and ROUGE-L score but also on logic and integrity.
Real-Time Bidding (RTB) is an important paradigm in display advertising, where advertisers utilize extended information and algorithms served by Demand Side Platforms (DSPs) to improve advertising performance. A common problem for DSPs is to help advertisers gain as much value as possible with budget constraints. However, advertisers would routinely add certain key performance indicator (KPI) constraints that the advertising campaign must meet due to practical reasons. In this paper, we study the common case where advertisers aim to maximize the quantity of conversions, and set cost-per-click (CPC) as a KPI constraint. We convert such a problem into a linear programming problem and leverage the primal-dual method to derive the optimal bidding strategy. To address the applicability issue, we propose a feedback control-based solution and devise the multivariable control system. The empirical study based on real-word data from Taobao.com verifies the effectiveness and superiority of our approach compared with the state of the art in the industry practices.
Worldwide displacement due to war and conflict is at all-time high. Unfortunately, determining if, when, and where people will move is a complex problem. This paper proposes integrating both publicly available organic data from social media and newspapers with more traditional indicators of forced migration to determine when and where people will move. We combine movement and organic variables with spatial and temporal variation within different Bayesian models and show the viability of our method using a case study involving displacement in Iraq. Our analysis shows that incorporating open-source generated conversation and event variables maintains or improves predictive accuracy over traditional variables alone. This work is an important step toward understanding how to leverage organic big data for societal--scale problems.
Buying or Browsing?: Predicting Real-time Purchasing Intent using Attention-based Deep Network with Multiple Behavior
E-commerce platforms are becoming a primary place for people to find, compare and ultimately purchase products. One of the fundamental questions that arises in e-commerce is to predict user purchasing intent, which is an important part of user understanding and allows for providing better services for both sellers and customers. However, previous work cannot predict real-time user purchasing intent with a high accuracy, limited by the representation capability of traditional browse-interactive behavior adopted. In this paper, we propose a novel end-to-end deep network, named Deep Intent Prediction Network (DIPN), to predict real-time user purchasing intent. In particular, besides the traditional browse-interactive behavior, we collect a new type of user interactive behavior, called touch-interactive behavior, which can capture more fine-grained real-time user features. To combine these behavior effectively, we propose a hierarchical attention mechanism, where the bottom attention layer focuses on the inner parts of each behavior sequence while the top attention layer learns the inter-view relations between different behavior sequences. In addition, we propose to train DIPN with multi-task learning to better distinguish user behavior patterns. In the experiments conducted on a large-scale industrial dataset, DIPN significantly outperforms the baseline solutions. Notably, DIPN gains about 18.96% improvement on AUC than the state-of-the-art solution only using traditional browse-interactive behavior sequences. Moreover, DIPN has been deployed in the operational system of Taobao. Online A/B testing results with more than 12.9 millions of users reveal the potential of knowing users' real-time purchasing intent.
Yahoo's native advertising (also known as Gemini native) serves billions of ad impressions daily, reaching a yearly run-rate of many hundred of millions USD. Driving Gemini native models for predicting both click probability (pCTR) and conversion probability (pCONV) is OFFSET - a feature enhanced collaborative-filtering (CF) based event prediction algorithm. The predicted pCTRs are then used in Gemini native auctions to determine which ads to present for each serving event. A fast growing segment of Gemini native is Carousel ads that include several cards (or assets) which are used to populate several slots within the ad. Since Carousel ad slots are not symmetrical and some are more conspicuous than others, it is beneficial to render assets to slots in a way that maximizes revenue.
In this work we present a post-auction successive elimination based approach for ranking assets according to their click trough rate (CTR) and render the carousel accordingly, placing higher CTR assets in more conspicuous slots. After a successful online bucket showing 8.6% CTR and 4.3% CPM (or revenue) lifts over a control bucket that uses predefined advertisers assets-to-slots mapping, the carousel asset optimization (CAO) system was pushed to production and is serving all Gemini native traffic since. A few months after CAO deployment, we have already measured an almost 40% increase in carousel ads revenue. Moreover, the entire revenue growth is related to CAO traffic increase due to additional advertiser demand, which demonstrates a high advertisers' satisfaction of the product.
Software frameworks for neural networks play a key role in the development and application of deep learning methods. In this paper, we introduce the Chainer framework, which intends to provide a flexible, intuitive, and high performance means of implementing the full range of deep learning models needed by researchers and practitioners. Chainer provides acceleration using Graphics Processing Units with a familiar NumPy-like API through CuPy, supports general and dynamic models in Python through Define-by-Run, and also provides add-on packages for state-of-the-art computer vision models as well as distributed training.
Characterizing and Detecting Malicious Accounts in Privacy-Centric Mobile Social Networks: A Case Study
Malicious accounts are one of the biggest threats to the security and privacy of online social networks (OSNs). In this work, we study a new type of OSN, called privacy-centric mobile social network (PC-MSN), such as KakaoTalk and LINE, which has attracted billions of users recently. The design of PC-MSN is inspired to protect their users' privacy from strangers: (1) a stranger is not easy to send a friend request to a user who does not want to make friends with strangers; and (2) strangers cannot view a user's post. Such a design mitigates the security issue of malicious accounts. At the same time, it also brings the battleground between attackers and defenders to an earlier stage, i.e., making friendship, than the one studied in previous works. Also, previous defense proposals mostly rely on certain assumptions on the attacker, which may not be robust in the new PC-MSNs. As a result, previous malicious accounts detection approaches are less effective on a PC-MSN.
To mitigate this issue, we study the patterns in friend requests to distinguish malicious accounts, and perform a systematic study over 1 million labeled data from WLink, a real PC-MSN with billions of users, to confirm our hypothesis. Based on the results, we propose dozens of new features and leverage machine learning to detect malicious accounts. We evaluate our method and compare it with existing methods, and the results show that our method achieves a precision of 99.5% and a recall of 98.4%, which significantly outperform previous state-of-the-art methods. Importantly, we qualitatively analyze the robustness of the designed features, and our evaluation shows that using only robust features can achieve the same level of performance as using all features. WLink has deployed our detection method. Our method can detect 0.59 million malicious accounts daily, which is 6 times higher than the previous deployment on WLink, with a precision of over 90%.
While mobile social apps have become increasingly important in people's daily life, we have limited understanding on what motivates users to engage with these apps. In this paper, we answer the question whether users' in-app activity patterns help inform their future app engagement (e.g., active days in a future time window)? Previous studies on predicting user app engagement mainly focus on various macroscopic features (e.g., time-series of activity frequency), while ignoring fine-grained inter-dependencies between different in-app actions at the microscopic level. Here we propose to formalize individual user's in-app action transition patterns as a temporally evolving action graph, and analyze its characteristics in terms of informing future user engagement. Our analysis suggested that action graphs are able to characterize user behavior patterns and inform future engagement. We derive a number of high-order graph features to capture in-app usage patterns and construct interpretable models for predicting trends of engagement changes and active rates. To further enhance predictive power, we design an end-to-end, multi-channel neural model to encode both temporal action graphs, activity sequences, and other macroscopic features. Experiments on predicting user engagement for 150k Snapchat new users over a 28-day period demonstrate the effectiveness of the proposed prediction models. The analysis and prediction framework is also deployed at Snapchat to deliver real world business insights. Our proposed framework is also general and can be applied to any online platform.
Decision Trees (DTs) like LambdaMART have been one of the most effective types of learning-to-rank algorithms in the past decade. They typically work well with hand-crafted dense features (e.g., BM25 scores). Recently, Neural Networks (NNs) have shown impressive results in leveraging sparse and complex features (e.g., query and document keywords) directly when a large amount of training data is available. While there is a large body of work on how to use NNs for semantic matching between queries and documents, relatively less work has been conducted to compare NNs with DTs for general learning-to-rank tasks, where dense features are also available and DTs can achieve state-of-the-art performance. In this paper, we study how to combine DTs and NNs to effectively bring the benefits from both sides in the learning-to-rank setting. Specifically, we focus our study on personal search where clicks are used as the primary labels with unbiased learning-to-rank algorithms and a significantly large amount of training data is easily available. Our combination methods are based on ensemble learning. We design 12 variants and compare them based on two aspects, ranking effectiveness and ease-of-deployment, using two of the largest personal search services: Gmail search and Google Drive search. We show that direct application of existing ensemble methods can not achieve both aspects. We thus design a novel method that uses NNs to compensate DTs via boosting. We show that such a method is not only easier to deploy, but also gives comparable or better ranking accuracy.
A large payment network contains millions of merchants and billions of transactions, and the merchants are described in a large number of attributes with incomplete values. Understanding its community structures is crucial to ensure its sustainable and long lasting. Knowing a merchant's community is also important from many applications - risk management, compliance, legal and marketing. To detect communities, an algorithm has to take advances from both attribute and topological information. Further, the method has to be able to handle incomplete and complex attributes. In this paper, we propose a framework named AGGMMR to effectively address the challenges come from scalability, mixed attributes, and incomplete value. We evaluate our proposed framework on four benchmark datasets against five strong baselines. More importantly, we provide a case study of running AGGMMR on a large network from PayPal which contains $100 million$ merchants with $1.5 billion$ transactions. The results demonstrate AGGMMR's effectiveness and practicability.
Knowledge bases (KBs) are the backbone of many ubiquitous applications and are thus required to exhibit high precision. However, for KBs that store subjective attributes of entities, e.g., whether a movie is kid friendly, simply estimating precision is complicated by the inherent ambiguity in measuring subjective phenomena. In this work, we develop a method for constructing KBs with tunable precision--i.e., KBs that can be made to operate at a specific false positive rate, despite storing both difficult-to-evaluate subjective attributes and more traditional factual attributes. The key to our approach is probabilistically modeling user consensus with respect to each entity-attribute pair, rather than modeling each pair as either True or False. Uncertainty in the model is explicitly represented and used to control the KB's precision. We propose three neural networks for fitting the consensus model and evaluate each one on data from Google Maps--a large KB of locations and their subjective and factual attributes. The results demonstrate that our learned models are well-calibrated and thus can successfully be used to control the KB's precision. Moreover, when constrained to maintain 95% precision, the best consensus model matches the F-score of a baseline that models each entity-attribute pair as a binary variable and does not support tunable precision. When unconstrained, our model dominates the same baseline by 12% F-score. Finally, we perform an empirical analysis of attribute-attribute correlations and show that leveraging them effectively contributes to reduced uncertainty and better performance in attribute prediction.
Contextual anomalies arise only under special internal or external stimuli in a system, often making it infeasible to detect them by a rule-based approach. Labelling the underlying problem sources is hard because complex, time-dependent relationships between the inputs arise. We propose a novel unsupervised approach that combines tools from deep learning and signal processing, working in a purely data-driven way. Many systems show a desirable target behaviour which can be used as a proxy quantity removing the need to manually label data. The methodology was evaluated on real-life test car traces in the form of multivariate state message sequences. We successfully identified contextual anomalies during the cars' timeout process along with possible explanations. Novel input encodings allow us to summarise the entire system context including the timing such that more information is available during the decision process.
Conversion Prediction Using Multi-task Conditional Attention Networks to Support the Creation of Effective Ad Creatives
Accurately predicting conversions in advertisements is generally a challenging task, because such conversions do not occur frequently. In this paper, we propose a new framework to support creating high-performing ad creatives, including the accurate prediction of ad creative text conversions before delivering to the consumer. The proposed framework includes three key ideas: multi-task learning, conditional attention, and attention highlighting. Multi-task learning is an idea for improving the prediction accuracy of conversion, which predicts clicks and conversions simultaneously, to solve the difficulty of data imbalance. Furthermore, conditional attention focuses attention of each ad creative with the consideration of its genre and target gender, thus improving conversion prediction accuracy. Attention highlighting visualizes important words and/or phrases based on conditional attention. We evaluated the proposed framework with actual delivery history data (14,000 creatives displayed more than a certain number of times from Gunosy Inc.), and confirmed that these ideas improve the prediction performance of conversions, and visualize noteworthy words according to the creatives' attributes.
Click-through rate (CTR) prediction is a critical task in online advertising systems. A large body of research considers each ad independently, but ignores its relationship to other ads that may impact the CTR. In this paper, we investigate various types of auxiliary ads for improving the CTR prediction of the target ad. In particular, we explore auxiliary ads from two viewpoints: one is from the spatial domain, where we consider the contextual ads shown above the target ad on the same page; the other is from the temporal domain, where we consider historically clicked and unclicked ads of the user. The intuitions are that ads shown together may influence each other, clicked ads reflect a user's preferences, and unclicked ads may indicate what a user dislikes to certain extent. In order to effectively utilize these auxiliary data, we propose the Deep Spatio-Temporal neural Networks (DSTNs) for CTR prediction. Our model is able to learn the interactions between each type of auxiliary data and the target ad, to emphasize more important hidden information, and to fuse heterogeneous data in a unified framework. Offline experiments on one public dataset and two industrial datasets show that DSTNs outperform several state-of-the-art methods for CTR prediction. We have deployed the best-performing DSTN in Shenma Search, which is the second largest search engine in China. The A/B test results show that the online CTR is also significantly improved compared to our last serving model.
Weather forecasting is usually solved through numerical weather prediction (NWP), which can sometimes lead to unsatisfactory performance due to inappropriate setting of the initial states. In this paper, we design a data-driven method augmented by an effective information fusion mechanism to learn from historical data that incorporates prior knowledge from NWP. We cast the weather forecasting problem as an end-to-end deep learning problem and solve it by proposing a novel negative log-likelihood error (NLE) loss function. A notable advantage of our proposed method is that it simultaneously implements single-value forecasting and uncertainty quantification, which we refer to as deep uncertainty quantification (DUQ). Efficient deep ensemble strategies are also explored to further improve performance. This new approach was evaluated on a public dataset collected from weather stations in Beijing, China. Experimental results demonstrate that the proposed NLE loss significantly improves generalization compared to mean squared error (MSE) loss and mean absolute error (MAE) loss. Compared with NWP, this approach significantly improves accuracy by 47.76%, which is a state-of-the-art result on this benchmark dataset.
DeepHoops: Evaluating Micro-Actions in Basketball Using Deep Feature Representations of Spatio-Temporal Data
Basketball is one of a number of sports which, within the past decade, have seen an explosion in quantitative metrics and methods for evaluating players and teams. However, it is still challenging to evaluate individual off-ball events (e.g., screens, cuts away from the ball etc.) in terms of how they contribute to the success of a possession. In this study, we develop a deep learning framework DeepHoops to process a unique dataset composed of spatio-temporal tracking data from NBA games in order to generate a running stream of predictions on the expected points to be scored as a possession progresses. We frame the problem as a multi-class sequence classification problem in which our model estimates probabilities of terminal actions taken by players (e.g. take field goal, turnover, foul etc.) at each moment of a possession based on a sequence of ball and player court locations preceding the said moment. Each of these terminal actions is associated with an expected point value, which is used to estimate the expected points to be scored. One of the challenges associated with this problem is the high imbalance in the action classes. To solve this problem, we parameterize a downsampling scheme for the training phase. We demonstrate that DeepHoops is well-calibrated, estimating accurately the probabilities of each terminal action and we further showcase the model's capability to evaluate individual actions (potentially off-ball) within a possession that are not captured by boxscore statistics.
Rooftop solar deployments are an excellent source for generating clean energy. As a result, their popularity among homeowners has grown significantly over the years. Unfortunately, estimating the solar potential of a roof requires homeowners to consult solar consultants, who manually evaluate the site. Recently there have been efforts to automatically estimate the solar potential for any roof within a city. However, current methods work only for places where LIDAR data is available, thereby limiting their reach to just a few places in the world. In this paper, we propose DeepRoof, a data-driven approach that uses widely available satellite images to assess the solar potential of a roof. Using satellite images, DeepRoof determines the roof's geometry and leverages publicly available real-estate and solar irradiance data to provide a pixel-level estimate of the solar potential for each planar roof segment. Such estimates can be used to identify ideal locations on the roof for installing solar panels. Further, we evaluate our approach on an annotated roof dataset, validate the results with solar experts and compare it to a LIDAR-based approach. Our results show that DeepRoof can accurately extract the roof geometry such as the planar roof segments and their orientation, achieving a true positive rate of 91.1% in identifying roofs and a low mean orientation error of 9.3 degree. We also show that DeepRoof's median estimate of the available solar installation area is within 11% of a LIDAR-based approach.
Event crowd management has been a significant research topic with high social impact. When some big events happen such as an earthquake, typhoon, and national festival, crowd management becomes the first priority for governments (e.g. police) and public service operators (e.g. subway/bus operator) to protect people's safety or maintain the operation of public infrastructures. However, under such event situations, human behavior will become very different from daily routines, which makes prediction of crowd dynamics at big events become highly challenging, especially at a citywide level. Therefore in this study, we aim to extract the deep trend only from the current momentary observations and generate an accurate prediction for the trend in the short future, which is considered to be an effective way to deal with the event situations. Motivated by these, we build an online system called DeepUrbanEvent which can iteratively take citywide crowd dynamics from the current one hour as input and report the prediction results for the next one hour as output. A novel deep learning architecture built with recurrent neural networks is designed to effectively model these highly-complex sequential data in an analogous manner to video prediction tasks. Experimental results demonstrate the superior performance of our proposed methodology to the existing approaches. Lastly, we apply our prototype system to multiple big real-world events and show that it is highly deployable as an online crowd management system.
Detecting Anomalies in Space using Multivariate Convolutional LSTM with Mixtures of Probabilistic PCA
Detecting an anomaly is not only important for many terrestrial applications on Earth but also for space applications. Especially, satellite missions are highly risky because unexpected hardware and software failures can occur due to sudden or unforeseen space environment changes. Anomaly detection and spacecraft health monitoring systems have heavily relied on human expertise to investigate whether they are a true anomaly or not. Also, it is practically infeasible to produce labels on data due to the enormous amount of telemetries generated from a satellite. In this work, we propose a data-driven anomaly detection algorithm for Korea Multi-Purpose Satellite 2 (KOMPSAT-2). We develop a Multivariate Convolution LSTM with Mixtures of Probabilistic Principal Component Analyzers, where our approach uses both neural networks and probabilistic clustering to improve the anomaly detection performance. We evaluated our approach with a total of 22 million telemetry samples collected for 10 months from KOMPSAT-2. We also compare our approach with other state-of-the-art approaches. We show that our proposed approach is 35.8% better in precision, and 18.2% better in F-1 score than the best baseline approach. We plan to deploy our algorithm in the second half of 2019 to actually apply real operation of KOMPSAT-2.
Product reviews and ratings on e-commerce websites provide customers with detailed insights about various aspects of the product such as quality, usefulness, etc. Since they influence customers' buying decisions, product reviews have become a fertile ground for abuse by sellers (colluding with reviewers) to promote their own products or to tarnish the reputation of competitor's products. In this paper, our focus is on detecting such abusive entities (both sellers and reviewers) by applying tensor decomposition on the product reviews data. While tensor decomposition is mostly unsupervised, we formulate our problem as a semi-supervised binary multi-target tensor decomposition, to take advantage of currently known abusive entities. We empirically show that our multi-target semi-supervised model achieves higher precision and recall in detecting abusive entities as compared to unsupervised techniques. Finally, we show that our proposed stochastic partial natural gradient inference for our model empirically achieves faster convergence than stochastic gradient and Online-EM with sufficient statistics.
Developing Measures of Cognitive Impairment in the Real World from Consumer-Grade Multimodal Sensor Streams
The ubiquity and remarkable technological progress of wearable consumer devices and mobile-computing platforms (smart phone, smart watch, tablet), along with the multitude of sensor modalities available, have enabled continuous monitoring of patients and their daily activities. Such rich, longitudinal information can be mined for physiological and behavioral signatures of cognitive impairment and provide new avenues for detecting MCI in a timely and cost-effective manner. In this work, we present a platform for remote and unobtrusive monitoring of symptoms related to cognitive impairment using several consumer-grade smart devices. We demonstrate how the platform has been used to collect a total of 16TB of data during the Lilly Exploratory Digital Assessment Study, a 12-week feasibility study which monitored 31 people with cognitive impairment and 82 without cognitive impairment in free living conditions. We describe how careful data unification, time-alignment, and imputation techniques can handle missing data rates inherent in real-world settings and ultimately show utility of these disparate data in differentiating symptomatics from healthy controls based on features computed purely from device data.
Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners
Accurately learning what delivers value to customers is difficult. Online Controlled Experiments (OCEs), aka A/B tests, are becoming a standard operating procedure in software companies to address this challenge as they can detect small causal changes in user behavior due to product modifications (e.g. new features). However, like any data analysis method, OCEs are sensitive to trustworthiness and data quality issues which, if go unaddressed or unnoticed, may result in making wrong decisions. One of the most useful indicators of a variety of data quality issues is a Sample Ratio Mismatch (SRM) ? the situation when the observed sample ratio in the experiment is different from the expected. Just like fever is a symptom for multiple types of illness, an SRM is a symptom for a variety of data quality issues. While a simple statistical check is used to detect an SRM, correctly identifying the root cause and preventing it from happening in the future is often extremely challenging and time consuming. Ignoring the SRM without knowing the root cause may result in a bad product modification appearing to be good and getting shipped to users, or vice versa. The goal of this paper is to make diagnosing, fixing, and preventing SRMs easier. Based on our experience of running OCEs in four different software companies in over 25 different products used by hundreds of millions of users worldwide, we have derived a taxonomy for different types of SRMs. We share examples, detection guidelines, and best practices for preventing SRMs of each type. We hope that the lessons and practical tips we describe in this paper will speed up SRM investigations and prevent some of them. Ultimately, this should lead to improved decision making based on trustworthy experiment analysis.
In talent recruitment, the job interview aims at selecting the right candidates for the right jobs through assessing their skills and experiences in relation to the job positions. While tremendous efforts have been made in improving job interviews, a long-standing challenge is how to design appropriate interview questions for comprehensively assessing the competencies that may be deemed relevant and representative for person-job fit. To this end, in this research, we focus on the development of a personalized question recommender system, namely DuerQuiz, for enhancing the job interview assessment. DuerQuiz is a fully deployed system, in which a knowledge graph of job skills, Skill-Graph, has been built for comprehensively modeling the relevant competencies that should be assessed in the job interview. Specifically, we first develop a novel skill entity extraction approach based on a bidirectional Long Short-Term Memory (LSTM) with a Conditional Random Field (CRF) layer (LSTM-CRF) neural network enhanced with adapted gate mechanism. In particular, to improve the reliability of extracted skill entities, we design a label propagation method based on more than 10 billion click-through data from the large-scale Baidu query logs. Furthermore, we discover the hypernym-hyponym relations between skill entities and construct the Skill-Graph by leveraging the classifier trained with extensive contextual features. Finally, we design a personalized question recommendation algorithm based on the Skill-Graph for improving the efficiency and effectiveness of job interview assessment. Extensive experiments on real-world recruitment data clearly validate the effectiveness of DuerQuiz, which had been deployed for generating written exercises in the 2018 Baidu campus recruitment event and received remarkable performances in terms of efficiency and effectiveness for selecting outstanding talents compared with a traditional non-personalized human-only assessment approach.
Ancillaries have become a major source of revenue and profitability in the travel industry. Yet, conventional pricing strategies are based on business rules that are poorly optimized and do not respond to changing market conditions. This paper describes the dynamic pricing model developed by Deepair solutions, an AI technology provider for travel suppliers. We present a pricing model that provides dynamic pricing recommendations specific to each customer interaction and optimizes expected revenue per customer. The unique nature of personalized pricing provides the opportunity to search over the market space to find the optimal price-point of each ancillary for each customer, without violating customer privacy.
In this paper, we present and compare three approaches for dynamic pricing of ancillaries, with increasing levels of sophistication: (1) a two-stage forecasting and optimization model using a logistic mapping function; (2) a two-stage model that uses a deep neural network for forecasting, coupled with a revenue maximization technique using discrete exhaustive search; (3) a single-stage end-to-end deep neural network that recommends the optimal price. We describe the performance of these models based on both offline and online evaluations. We also measure the real-world business impact of these approaches by deploying them in an A/B test on an airline's internet booking website. We show that traditional machine learning techniques outperform human rule-based approaches in an online setting by improving conversion by 36% and revenue per offer by 10%. We also provide results for our offline experiments which show that deep learning algorithms outperform traditional machine learning techniques for this problem. Our end-to-end deep learning model is currently being deployed by the airline in their booking system.
In this paper we present a novel approach to credit scoring of retail customers in the banking industry based on deep learning methods. We used RNNs on fine grained transnational data to compute credit scores for the loan applicants. We demonstrate that our approach significantly outperforms the baselines based on the customer data of a large European bank. We also conducted a pilot study on loan applicants of the bank, and the study produced significant financial gains for the organization. In addition, our method has several other advantages described in the paper that are very significant for the bank.
Data analysis and machine learning methods have great potential to aid in planetary exploration. Spacecraft often operate at great distances from the Earth, and the ability to autonomously detect features of interest onboard can enable content-sensitive downlink prioritization to increase mission science return. We describe algorithms that we designed to assist in three specific scientific investigations to be conducted during flybys of Jupiter's moon Europa: the detection of thermal anomalies, compositional anomalies, and plumes of icy matter from Europa's subsurface ocean. We also share the unique constraints imposed by the onboard computing environment and several lessons learned in our collaboration with planetary scientists and mission designers.
Optimization-based models have been used to predict cellular behavior for over 25 years. The constraints in these models are derived from genome annotations, measured macromolecular composition of cells, and by measuring the cell's growth rate and metabolism in different conditions. The cellular goal (the optimization problem that the cell is trying to solve) can be challenging to derive experimentally for many organisms, including human or mammalian cells, which have complex metabolic capabilities and are not well understood. Existing approaches to learning goals from data include (a) estimating a linear objective function, or (b) estimating linear constraints that model complex biochemical reactions and constrain the cell's operation. The latter approach is important because often the known reactions are not enough to explain observations; therefore, there is a need to extend automatically the model complexity by learning new reactions. However, this leads to nonconvex optimization problems, and existing tools cannot scale to realistically large metabolic models. Hence, constraint estimation is still used sparingly despite its benefits for modeling cell metabolism, which is important for developing novel antimicrobials against pathogens, discovering cancer drug targets, and producing value-added chemicals. Here, we develop the first approach to estimating constraint reactions from data that can scale to realistically large metabolic models. Previous tools were used on problems having less than 75 reactions and 60 metabolites, which limits real-life-size applications. We perform extensive experiments using 75 large-scale metabolic network models for different organisms (including bacteria, yeasts, and mammals) and show that our algorithm can recover cellular constraint reactions. The recovered constraints enable accurate prediction of metabolic states in hundreds of growth environments not seen in training data, and we recover useful cellular goals even when some measurements are missing.
Recommender systems are one of the most pervasive applications of machine learning in industry, with many services using them to match users to products or information. As such it is important to ask: what are the possible fairness risks, how can we quantify them, and how should we address them? In this paper we offer a set of novel metrics for evaluating algorithmic fairness concerns in recommender systems. In particular we show how measuring fairness based on pairwise comparisons from randomized experiments provides a tractable means to reason about fairness in rankings from recommender systems. Building on this metric, we offer a new regularizer to encourage improving this metric during model training and thus improve fairness in the resulting rankings. We apply this pairwise regularization to a large-scale, production recommender system and show that we are able to significantly improve the system's pairwise fairness.
Fairness-Aware Ranking in Search & Recommendation Systems with Application to LinkedIn Talent Search
We present a framework for quantifying and mitigating algorithmic bias in mechanisms designed for ranking individuals, typically used as part of web-scale search and recommendation systems. We first propose complementary measures to quantify bias with respect to protected attributes such as gender and age. We then present algorithms for computing fairness-aware re-ranking of results. For a given search or recommendation task, our algorithms seek to achieve a desired distribution of top ranked results with respect to one or more protected attributes. We show that such a framework can be tailored to achieve fairness criteria such as equality of opportunity and demographic parity depending on the choice of the desired distribution. We evaluate the proposed algorithms via extensive simulations over different parameter choices, and study the effect of fairness-aware ranking on both bias and utility measures. We finally present the online A/B testing results from applying our framework towards representative ranking in LinkedIn Talent Search, and discuss the lessons learned in practice. Our approach resulted in tremendous improvement in the fairness metrics (nearly three fold increase in the number of search queries with representative results) without affecting the business metrics, which paved the way for deployment to 100% of LinkedIn Recruiter users worldwide. Ours is the first large-scale deployed framework for ensuring fairness in the hiring domain, with the potential positive impact for more than 630M LinkedIn members.
Most current distributed machine learning systems try to scale up model training by using a data-parallel architecture that divides the computation for different samples among workers. We study distributed machine learning from a different motivation, where the information about the same samples, e.g., users and objects, are owned by several parities that wish to collaborate but do not want to share raw data with each other.
We propose an asynchronous stochastic gradient descent (SGD) algorithm for such a feature distributed machine learning (FDML) problem, to jointly learn from distributed features, with theoretical convergence guarantees under bounded asynchrony. Our algorithm does not require sharing the original features or even local model parameters between parties, thus preserving the data locality. The system can also easily incorporate differential privacy mechanisms to preserve a higher level of privacy. We implement the FDML system in a parameter server architecture and compare our system with fully centralized learning (which violates data locality) and learning based on only local features, through extensive experiments performed on both a public data set a9a, and a large dataset of 5,000,000 records and 8700 decentralized features from three collaborating apps at Tencent including Tencent MyApp, Tecent QQ Browser and Tencent Mobile Safeguard. Experimental results have demonstrated that the proposed FDML system can be used to significantly enhance app recommendation in Tencent MyApp by leveraging user and item features from other apps, while preserving the locality and privacy of features in each individual app to a high degree.
Social media platforms bring together content creators and content consumers through recommender systems like newsfeed. The focus of such recommender systems has thus far been primarily on modeling the content consumer preferences and optimizing for their experience. However, it is equally critical to nurture content creation by prioritizing the creators' interests, as quality content forms the seed for sustainable engagement and conversations, bringing in new consumers while retaining existing ones. In this work, we propose a modeling approach to predict how feedback from content consumers incentivizes creators. We then leverage this model to optimize the newsfeed experience for content creators by reshaping the feedback distribution, leading to a more active content ecosystem. Practically, we discuss how we balance the user experience for both consumers and creators, and how we carry out online A/B tests with strong network effects. We present a deployed use case on the LinkedIn newsfeed, where we used this approach to improve content creation significantly without compromising the consumers' experience.
Audience Look-alike Targeting is an online advertising technique in which an advertiser specifies a set of seed customers and tasks the advertising platform with finding an expanded audience of similar users. We will describe a two-stage embedding-based audience expansion model that is deployed in production at Pinterest. For the first stage we trained a global user embedding model on sitewide user activity logs. In the second stage, we use transfer learning and statistical techniques to create lightweight seed list representations in the embedding space for each advertiser. We create a (user, seed list) affinity scoring function that makes use of these lightweight advertiser representations. We describe the end-to-end system that computes and serves this model at scale. Finally, we propose an ensemble approach that combines single-advertiser classifiers with the embedding-based technique. We show offline evaluation and online experiments to prove that the expanded audience generated by the ensemble model has the best results for all seed list sizes.
An important aspect of health monitoring is effective logging of food consumption. This can help management of diet-related diseases like obesity, diabetes, and even cardiovascular diseases. Moreover, food logging can help fitness enthusiasts, and people who wanting to achieve a target weight. However, food-logging is cumbersome, and requires not only taking additional effort to note down the food item consumed regularly, but also sufficient knowledge of the food item consumed (which is difficult due to the availability of a wide variety of cuisines). With increasing reliance on smart devices, we exploit the convenience offered through the use of smart phones and propose a smart-food logging system: FoodAI, which offers state-of-the-art deep-learning based image recognition capabilities. FoodAI has been developed in Singapore and is particularly focused on food items commonly consumed in Singapore. FoodAI models were trained on a corpus of 400,000 food images from 756 different classes.
In this paper we present extensive analysis and insights into the development of this system. FoodAI has been deployed as an API service and is one of the components powering Healthy 365, a mobile app developed by Singapore's Heath Promotion Board. We have over 100 registered organizations (universities, companies, start-ups) subscribing to this service and actively receive several API requests a day. FoodAI has made food logging convenient, aiding smart consumption and a healthy lifestyle.
Deep Reinforcement Learning has been applied in a number of fields to directly optimize non-differentiable reward functions, including in sequence to sequence settings using Self Critical Sequence Training (SCST). Previously, SCST has primarily been applied to bring conditional language models closer to the distribution of their training set, as in traditional neural machine translation and abstractive summarization. We frame the generation of search engine text ads as a sequence to sequence problem, and consider two related goals: to generate ads similar to those a human would write, and to generate ads with high click-through rates. We jointly train a model to minimize cross-entropy on an existing corpus of Landing Page/Text Ad pairs using typical sequence to sequence training techniques while also optimizing the expected click-through rate (CTR) as predicted by an existing oracle model using SCST. Through joint training we achieve a 6.7% increase in expected CTR without a meaningful drop in ROUGE score. Human experiments demonstrate that SCST training produces significantly more attractive ads without reducing grammatical quality.
Prediction of glaucomatous visual field loss has significant clinical benefits because it can help with early detection of glaucoma as well as decision-making for treatments. Glaucomatous visual loss is conventionally captured through visual field sensitivity (VF ) measurement, which is costly and time-consuming. Thus, existing approaches mainly predict future VF utilizing limited VF data collected in the past. Recently, optical coherence tomography (OCT) has been adopted to measure retinal layers thickness (RT ) for considerably more low-cost treatment assistance. There then arises an important question in the context of ophthalmology: are RT measurements beneficial for VF prediction? In this paper, we propose a novel method to demonstrate the benefits provided by RT measurements. The challenge is management of the two heterogeneities of VF data and RT data as RT data are collected according to different clinical schedules and lie in a different space to VF data. To tackle these heterogeneities, we propose latent progression patterns (LPPs), a novel type of representations for glaucoma progression. Along with LPPs, we propose a method to transform VF series to an LPP based on matrix factorization and a method to transform RT series to an LPP based on deep neural networks. Partial VF and RT information is integrated in LPPs to provide accurate prediction. The proposed framework is named deeply-regularized latent-space linear regression (\em DLLR). We empirically demonstrate that our proposed method outperforms the state-of-the-art technique by 12% for the best case in terms of the mean of the root mean square error on a real dataset.
In this paper, we present Smart Compose, a novel system for generating interactive, real-time suggestions in Gmail that assists users in writing mails by reducing repetitive typing. In the design and deployment of such a large-scale and complicated system, we faced several challenges including model selection, performance evaluation, serving and other practical issues. At the core of Smart Compose is a large-scale neural language model. We leveraged state-of-the-art machine learning techniques for language model training which enabled high-quality suggestion prediction, and constructed novel serving infrastructure for high-throughput and real-time inference. Experimental results show the effectiveness of our proposed system design and deployment approach. This system is currently being served in Gmail.
In this paper we consider the problem of estimating the difficulty of parking at a particular time and place; this problem is a critical sub-component for any system providing parking assistance to users. We describe an approach to this problem that is currently in production in Google Maps, providing inferences in cities across the world. We present a wide range of features intended to capture different aspects of parking difficulty and study their effectiveness both alone and in combination. We also evaluate various model architectures for the prediction problem. Finally, we present challenges faced in estimating parking difficulty in different regions of the world, and the approaches we have taken to address them.
Recognizing entities that follow or closely resemble a regular expression (regex) pattern is an important task in information extraction. Common approaches for extraction of such entities require humans to either write a regex recognizing an entity or manually label entity mentions in a document corpus. While human effort is critical to build an entity recognition model, surprisingly little is known about how to best invest that effort given a limited time budget. To get an answer, we consider an iterative human-in-the-loop (HIL) framework that allows users to write a regex or manually label entity mentions, followed by training and refining a classifier based on the provided information. We demonstrate on 5 entity recognition tasks that classification accuracy improves over time with either approach. When a user is allowed to choose between regex construction and manual labeling, we discover that (1) if the time budget is low, spending all time for regex construction is often advantageous, (2) if the time budget is high, spending all time for manual labeling seems to be superior, and (3) between those two extremes, writing regexes followed by manual labeling is typically the best approach. Our code and data is available at https://github.com/nymph332088/HILRecognizer.
Transportation recommendation is one important map service in navigation applications. Previous transportation recommendation solutions fail to deliver satisfactory user experience because their recommendations only consider routes in one transportation mode (uni-modal, e.g., taxi, bus, cycle) and largely overlook situational context. In this work, we propose Hydra, a recommendation system that offers multi-modal transportation planning and is adaptive to various situational context (e.g., nearby point-of-interest (POI) distribution and weather). We leverage the availability of existing routing engines and big urban data, and design a novel two-level framework that integrates uni-modal and multi-modal (e.g., taxi-bus, bus-cycle) routes as well as heterogeneous urban data for intelligent multi-modal transportation recommendation. In addition to urban context features constructed from multi-source urban data, we learn the latent representations of users, origin-destination (OD) pairs and transportation modes based on user implicit feedbacks, which captures the collaborative transportation mode preferences of users and OD pairs. A gradient boosting tree based model is then introduced to recommend the proper route among various uni-modal and multi-modal transportation routes. We also optimize the framework to support real-time, large-scale route query and recommendation. We deploy Hydra on Baidu Maps, one of the world's largest map services. Real-world urban-scale experiments demonstrate the effectiveness and efficiency of our proposed system. Since its deployment in August 2018, Hydra has answered over a hundred million route recommendation queries made by over ten million distinct users with 82.8% relative improvement of user click ratio.
Water managers in the western United States (U.S.) rely on longterm forecasts of temperature and precipitation to prepare for droughts and other wet weather extremes. To improve the accuracy of these longterm forecasts, the U.S. Bureau of Reclamation and the National Oceanic and Atmospheric Administration (NOAA) launched the Subseasonal Climate Forecast Rodeo, a year-long real-time forecasting challenge in which participants aimed to skillfully predict temperature and precipitation in the western U.S. two to four weeks and four to six weeks in advance. Here we present and evaluate our machine learning approach to the Rodeo and release our SubseasonalRodeo dataset, collected to train and evaluate our forecasting system.
Our system is an ensemble of two nonlinear regression models. The first integrates the diverse collection of meteorological measurements and dynamic model forecasts in the SubseasonalRodeo dataset and prunes irrelevant predictors using a customized multitask feature selection procedure. The second uses only historical measurements of the target variable (temperature or precipitation) and introduces multitask nearest neighbor features into a weighted local linear regression. Each model alone is significantly more accurate than the debiased operational U.S. Climate Forecasting System (CFSv2), and our ensemble skill exceeds that of the top Rodeo competitor for each target variable and forecast horizon. Moreover, over 2011-2018, an ensemble of our regression models and debiased CFSv2 improves debiased CFSv2 skill by 40-50% for temperature and 129-169% for precipitation. We hope that both our dataset and our methods will help to advance the state of the art in subseasonal forecasting.
Understanding users' context is essential for successful recommendations, especially for Online-to-Offline (O2O) recommendation, such as Yelp, Groupon, and Koubei. Different from traditional recommendation where individual preference is mostly static, O2O recommendation should be dynamic to capture variation of users' purposes across time and location. However, precisely inferring users' real-time contexts information, especially those implicit ones, is extremely difficult, and it is a central challenge for O2O recommendation. In this paper, we propose a new approach, called Mixture Attentional Constrained Denoise AutoEncoder (MACDAE), to infer implicit contexts and consequently, to improve the quality of real-time O2O recommendation. In MACDAE, we first leverage the interaction among users, items, and explicit contexts to infer users' implicit contexts, then combine the learned implicit-context representation into an end-to-end model to make the recommendation. MACDAE works quite well in the real system. We conducted both offline and online evaluations of the proposed approach. Experiments on several real-world datasets (Yelp, Dianping, and Koubei) show our approach could achieve significant improvements over state-of-the-arts. Furthermore, online A/B test suggests a 2.9% increase for click-through rate and 5.6% improvement for conversion rate in real-world traffic. Our model has been deployed in the product of "Guess You Like" recommendation in Koubei.
IntentGC: A Scalable Graph Convolution Framework Fusing Heterogeneous Information for Recommendation
The remarkable progress of network embedding has led to state-of-the-art algorithms in recommendation. However, the sparsity of user-item interactions (i.e., explicit preferences) on websites remains a big challenge for predicting users' behaviors. Although research efforts have been made in utilizing some auxiliary information (e.g., social relations between users) to solve the problem, the existing rich heterogeneous auxiliary relationships are still not fully exploited. Moreover, previous works relied on linearly combined regularizers and suffered parameter tuning. In this work, we collect abundant relationships from common user behaviors and item information, and propose a novel framework named IntentGC to leverage both explicit preferences and heterogeneous relationships by graph convolutional networks. In addition to the capability of modeling heterogeneity, IntentGC can learn the importance of different relationships automatically by the neural model in a nonlinear sense. To apply IntentGC to web-scale applications, we design a faster graph convolutional model named IntentNet by avoiding unnecessary feature interactions. Empirical experiments on two large-scale real-world datasets and online A/B tests in Alibaba demonstrate the superiority of our method over state-of-the-art algorithms. We also release the source code of our work at https://github.com/peter14121/intentgc-models.
Most large Internet companies run internal promotions to cross-promote their different products and/or to educate members on how to obtain additional value from the products that they already use. This in turn drives engagement and/or revenue for the company. However, since these internal promotions can distract a member away from the product or page where these are shown, there is a non-zero cannibalization loss incurred for showing these internal promotions. This loss has to be carefully weighed against the gain from showing internal promotions. This can be a complex problem if different internal promotions optimize for different objectives. In that case, it is difficult to compare not just the gain from a conversion through an internal promotion against the loss incurred for showing that internal promotion, but also the gains from conversions through different internal promotions. Hence, we need a principled approach for deciding which internal promotion (if any) to serve to a member in each opportunity to serve an internal promotion. This approach should optimize not just for the net gain to the company, but also for the member's experience. In this paper, we discuss our approach for optimization of internal promotions at LinkedIn. In particular, we present a cost-benefit analysis of showing internal promotions, our formulation of internal promotion optimization as a constrained optimization problem, the architecture of the system for solving the optimization problem and serving internal promotions in real-time, and experimental results from online A/B tests.
Increasing rates of opioid drug abuse and heightened prevalence of online support communities underscore the necessity of employing data mining techniques to better understand drug addiction using these rapidly developing online resources. In this work, we obtained data from Reddit, an online collection of forums, to gather insight into drug use/misuse using text snippets from users narratives. Specifically, using users' posts, we trained a binary classifier which predicts a user's transitions from casual drug discussion forums to drug recovery forums. We also proposed a Cox regression model that outputs likelihoods of such transitions. In doing so, we found that utterances of select drugs and certain linguistic features contained in one's posts can help predict these transitions. Using unfiltered drug-related posts, our research delineates drugs that are associated with higher rates of transitions from recreational drug discussion to support/recovery discussion, offers insight into modern drug culture, and provides tools with potential applications in combating the opioid crisis.
Investment Behaviors Can Tell What Inside: Exploring Stock Intrinsic Properties for Stock Trend Prediction
Stock trend prediction, aiming at predicting future price trend of stocks, plays a key role in seeking maximized profit from the stock investment. Recent years have witnessed increasing efforts in applying machine learning techniques, especially deep learning, to pursue more promising stock prediction. While deep learning has given rise to significant improvement, human investors still retain the leading position due to their understanding on stock intrinsic properties, which can imply invaluable principles for stock prediction. In this paper, we propose to extract and explore stock intrinsic properties to enhance stock trend prediction. Fortunately, we discover that the repositories of investment behaviors within mutual fund portfolio data form up a gold mine to extract latent representations of stock properties, since such collective investment behaviors can reflect the professional fund managers' common beliefs on stock intrinsic properties. Powered by extracted stock properties, we further propose to model the dynamic market state and trend using stock representations so as to generate the dynamic correlation between the stock and the market, and then we aggregate such correlation with dynamic stock indicators to achieve more accurate stock prediction. Extensive experiments on real-world stock market data demonstrate the effectiveness of stock properties extracted from collective investment behaviors in the task of stock prediction.
Materials discovery is crucial for making scientific advances in many domains. Collections of data from experiments and first-principle computations have spurred interest in applying machine learning methods to create predictive models capable of mapping from composition and crystal structures to materials properties. Generally, these are regression problems with the input being a 1D vector composed of numerical attributes representing the material composition and/or crystal structure. While neural networks consisting of fully connected layers have been applied to such problems, their performance often suffers from the vanishing gradient problem when network depth is increased. Hence, predictive modeling for such tasks has been mainly limited to traditional machine learning techniques such as Random Forest. In this paper, we study and propose design principles for building deep regression networks composed of fully connected layers with numerical vectors as input. We introduce a novel deep regression network with individual residual learning, IRNet, that places shortcut connections after each layer so that each layer learns the residual mapping between its output and input. We use the problem of learning properties of inorganic materials from numerical attributes derived from material composition and/or crystal structure to compare IRNet's performance against that of other machine learning techniques. Using multiple datasets from the Open Quantum Materials Database (OQMD) and Materials Project for training and evaluation, we show that IRNet provides significantly better prediction performance than the state-of-the-art machine learning approaches currently used by domain scientists. We also show that IRNet's use of individual residual learning leads to better convergence during the training phase than when shortcut connections are between multi-layer stacks while maintaining the same number of parameters.
Video is one of the richest sources of information available online but extracting deep insights from video content at internet scale is still an open problem, both in terms of depth and breadth of understanding, as well as scale. Over the last few years, the field of video understanding has made great strides due to the availability of large-scale video datasets and core advances in image, audio, and video modeling architectures. However, the state-of-the-art architectures on small scale datasets are frequently impractical to deploy at internet scale, both in terms of the ability to train such deep networks on hundreds of millions of videos, and to deploy them for inference on billions of videos. In this paper, we present a MapReduce-based training framework, which exploits both data parallelism and model parallelism to scale training of complex video models. The proposed framework uses alternating optimization and full-batch fine-tuning, and supports large Mixture-of-Experts classifiers with hundreds of thousands of mixtures, which enables a trade-off between model depth and breadth, and the ability to shift model capacity between shared (generalization) layers and per-class (specialization) layers. We demonstrate that the proposed framework is able to reach state-of-the-art performance on the largest public video datasets, YouTube-8M and Sports-1M, and can scale to 100 times larger datasets.
Large-scale User Visits Understanding and Forecasting with Deep Spatial-Temporal Tensor Factorization Framework
Understanding and forecasting user visits is of great importance for a variety of tasks, e.g., online advertising, which is one of the most profitable business models for Internet services. Publishers sell advertising spaces in advance with user visit volume and attributes guarantees. There are usually tens of thousands of attribute combinations in an online advertising system. The key problem is how to accurately forecast the number of user visits for each attribute combination. Many traditional work characterizing temporal trends of every single time series are quite inefficient for large-scale time series. Recently, a number of models based on deep learning or matrix factorization have been proposed for high-dimensional time series forecasting. However, most of them neglect correlations among attribute combinations, or are tailored for specific applications, resulting in poor adaptability for different business scenarios.Besides, sophisticated deep learning models usually cause high time and space complexity. There is still a lack of an efficient highly scalable and adaptable solution for accurate high-dimensional time series forecasting. To address this issue, in this work, we conduct a thorough analysis on large-scale user visits data and propose a novel deep spatial-temporal tensor factorization framework, which provides a general design for high-dimensional time series forecasting. We deployed the proposed framework in Tencent online guaranteed delivery advertising system, and extensively evaluated the effectiveness and efficiency of the framework in two different large-scale application scenarios. The results show that our framework outperforms existing methods in prediction accuracy. Meanwhile, it significantly reduces the parameter number and is resistant to incomplete data with up to 20% missing values.
At Pinterest, we utilize image embeddings throughout our search and recommendation systems to help our users navigate through visual content by powering experiences like browsing of related content and searching for exact products for shopping. In this work we describe a multi-task deep metric learning system to learn a single unified image embedding which can be used to power our multiple visual search products. The solution we present not only allows us to train for multiple application objectives in a single deep neural network architecture, but takes advantage of correlated information in the combination of all training data from each application to generate a unified embedding that outperforms all specialized embeddings previously deployed for each product.
We discuss the challenges of handling images from different domains such as camera photos, high quality web images, and clean product catalog images. We also detail how to jointly train for multiple product objectives and how to leverage both engagement data and human labeled data. In addition, our trained embeddings can also be binarized for efficient storage and retrieval without compromising precision and recall. Through comprehensive evaluations on offline metrics, user studies, and online A/B experiments, we demonstrate that our proposed unified embedding improves both relevance and engagement of our visual search products for both browsing and searching purposes when compared to existing specialized embeddings. Finally, the deployment of the unified embedding at Pinterest has drastically reduced the operation and engineering cost of maintaining multiple embeddings while improving quality.
Precision psychiatry is a new research field that uses advanced data mining over a wide range of neural, behavioral, psychological, and physiological data sources for classification of mental health conditions. This study presents a computational framework for predicting sleep efficiency of insomnia sufferers. A smart band experiment is conducted to collect heterogeneous data, including sleep records, daily activities, and demographics, whose missing values are imputed via Improved Generative Adversarial Imputation Networks (Imp-GAIN). Equipped with the imputed data, we predict sleep efficiency of individual users with a proposed interpretable LSTM-Attention (LA Block) neural network model. We also propose a model, Pairwise Learning-based Ranking Generation (PLRG), to rank users with high insomnia potential in the next day. We discuss implications of our findings from the perspective of a psychiatric practitioner. Our computational framework can be used for other applications that analyze and handle noisy and incomplete time-series human activity data in the domain of precision psychiatry.
Digital Adherence Technologies (DATs) are an increasingly popular method for verifying patient adherence to many medications. We analyze data from one city served by 99DOTS, a phone-call-based DAT deployed for Tuberculosis (TB) treatment in India where nearly 3 million people are afflicted with the disease each year. The data contains nearly 17,000 patients and 2.1M dose records. We lay the groundwork for learning from this real-world data, including a method for avoiding the effects of unobserved interventions in training data used for machine learning. We then construct a deep learning model, demonstrate its interpretability, and show how it can be adapted and trained in three different clinical scenarios to better target and improve patient care. In the real-time risk prediction setting our model could be used to proactively intervene with 21% more patients and before 76% more missed doses than current heuristic baselines. For outcome prediction, our model performs 40% better than baseline methods, allowing cities to target more resources to clinics with a heavier burden of patients at risk of failure. Finally, we present a case study demonstrating how our model can be trained in an end-to-end decision focused learning setting to achieve 15% better solution quality in an example decision problem faced by health workers.
Lightning as a natural phenomenon poses serious threats to human life, aviation and electrical infrastructures. Lightning prediction plays a vital role in lightning disaster reduction. Existing prediction methods, usually based on numerical weather models, rely on lightning parameterization schemes for forecasting. These methods, however, have two drawbacks. Firstly, simulations of the numerical weather models usually have deviations in space and time domains, which introduces irreparable biases to subsequent parameterization processes. Secondly, the lightning parameterization schemes are designed manually by experts in meteorology, which means these schemes can hardly benefit from abundant historical data. In this work, we propose a data-driven model based on neural networks, referred to as LightNet, for lightning prediction. Unlike the conventional prediction methods which are fully based on numerical weather models, LightNet introduces recent lightning observations in an attempt to calibrate the simulations and assist the prediction. LightNet first extracts spatiotemporal features of the simulations and observations via dual encoders. These features are then combined by a fusion module. Finally, the fused features are fed into a spatiotemporal decoder to make forecasts. We conduct experimental evaluations on a real-world North China lightning dataset, which shows that LightNet achieves a threefold improvement in equitable threat score for six-hour prediction compared with three established forecast methods.
Machine Learning is transitioning from an art and science into a technology available to every developer. In the near future, every application on every platform will incorporate trained models to encode data-based decisions that would be impossible for developers to author. This presents a significant engineering challenge, since currently data science and modeling are largely decoupled from standard software development processes. This separation makes incorporating machine learning capabilities inside applications unnecessarily costly and difficult, and furthermore discourage developers from embracing ML in first place. In this paper we present ML.NET, a framework developed at Microsoft over the last decade in response to the challenge of making it easy to ship machine learning models in large software applications. We present its architecture, and illuminate the application demands that shaped it. Specifically, we introduce DataView, the core data abstraction of ML.NET which allows it to capture full predictive pipelines efficiently and consistently across training and inference lifecycles. We close the paper with a surprisingly favorable performance study of ML.NET compared to more recent entrants, and a discussion of some lessons learned.
Mathematical Notions vs. Human Perception of Fairness: A Descriptive Approach to Fairness for Machine Learning
Fairness for Machine Learning has received considerable attention, recently. Various mathematical formulations of fairness have been proposed, and it has been shown that it is impossible to satisfy all of them simultaneously. The literature so far has dealt with these impossibility results by quantifying the tradeoffs between different formulations of fairness. Our work takes a different perspective on this issue. Rather than requiring all notions of fairness to (partially) hold at the same time, we ask which one of them is the most appropriate given the societal domain in which the decision-making model is to be deployed. We take a descriptive approach and set out to identify the notion of fairness that best captures lay people's perception of fairness. We run adaptive experiments designed to pinpoint the most compatible notion of fairness with each participant's choices through a small number of tests. Perhaps surprisingly, we find that the most simplistic mathematical definition of fairness---namely, demographic parity---most closely matches people's idea of fairness in two distinct application scenarios. This conclusion remains intact even when we explicitly tell the participants about the alternative, more complicated definitions of fairness, and we reduce the cognitive burden of evaluating those notions for them. Our findings have important implications for the Fair ML literature and the discourse on formalizing algorithmic fairness.
In the recent political climate, the topic of news quality has drawn attention both from the public and the academic communities. The growing distrust of traditional news media makes it harder to find a common base of accepted truth. In this work, we design and build MediaRank (urlwww.media-rank.com ), a fully automated system to rank over 50,000 online news sources around the world. MediaRank collects and analyzes one million news webpages and two million related tweets everyday. We base our algorithmic analysis on four properties journalists have established to be associated with reporting quality: peer reputation, reporting bias/breadth, bottomline financial pressure, and popularity. Our major contributions of this paper include: (i) Open, interpretable quality rankings for over 50,000 of the world's major news sources. Our rankings are validated against 35 published news rankings, including French, German, Russian, and Spanish language sources. MediaRank scores correlate positively with 34 of 35 of these expert rankings. (ii) New computational methods for measuring influence and bottomline pressure. To the best of our knowledge, we are the first to study the large-scale news reporting citation graph in-depth. We also propose new ways to measure the aggressiveness of advertisements and identify social bots, establishing a connection between both types of bad behavior. (iii) Analyzing the effect of media source bias and significance. We prove that news sources cite others despite different political views in accord with quality measures. However, in four English-speaking countries (US, UK, Canada, and Australia), the highest ranking sources all disproportionately favor left-wing parties, even when the majority of news sources exhibited conservative slants.
With the prevalence of mobile e-commerce nowadays, a new type of recommendation services, called intent recommendation, is widely used in many mobile e-commerce Apps, such as Taobao and Amazon. Different from traditional query recommendation and item recommendation, intent recommendation is to automatically recommend user intent according to user historical behaviors without any input when users open the App. Intent recommendation becomes very popular in the past two years, because of revealing user latent intents and avoiding tedious input in mobile phones. Existing methods used in industry usually need laboring feature engineering. Moreover, they only utilize attribute and statistic information of users and queries, and fail to take full advantage of rich interaction information in intent recommendation, which may result in limited performances. In this paper, we propose to model the complex objects and rich interactions in intent recommendation as a Heterogeneous Information Network. Furthermore, we present a novel M etapath-guided E mbedding method for I ntent Rec ommendation~(called MEIRec). In order to fully utilize rich structural information, we design a metapath-guided heterogeneous Graph Neural Network to learn the embeddings of objects in intent recommendation. In addition, in order to alleviate huge learning parameters in embeddings, we propose a uniform term embedding mechanism, in which embeddings of objects are made up with the same term embedding space. Offline experiments on real large-scale data show the superior performance of the proposed MEIRec, compared to representative methods.Moreover, the results of online experiments on Taobao e-commerce platform show that MEIRec not only gains a performance improvement of 1.54% on CTR metric, but also attracts up to 2.66% of new users to search queries.
In recent years, large amounts of health data, such as patient Electronic Health Records (EHR), are becoming readily available. This provides an unprecedented opportunity for knowledge discovery and data mining algorithms to dig insights from them, which can, later on, be helpful to the improvement of the quality of care delivery. Predictive modeling of clinical risks, including in-hospital mortality, hospital readmission, chronic disease onset, condition exacerbation, etc., from patient EHR, is one of the health data analytic problems that attract lots of the interests. The reason is not only because the problem is important in clinical settings, but also is challenging when working with EHR such as sparsity, irregularity, temporality, etc. Different from applications in other domains such as computer vision and natural language processing, the data samples in medicine (patients) are relatively limited, which creates lots of troubles for building effective predictive models, especially for complicated ones such as deep learning. In this paper, we propose~\textttMetaPred, a meta-learning framework for clinical risk prediction from longitudinal patient EHR. In particular, in order to predict the target risk with limited data samples, we train a meta-learner from a set of related risk prediction tasks which learns how a good predictor is trained. The meta-learned can then be directly used in target risk prediction, and the limited available samples in the target domain can be used for further fine-tuning the model performance. The effectiveness of \textttMetaPred is tested on a real patient EHR repository from Oregon Health & Science University. We are able to demonstrate that with Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) as base predictors, \textttMetaPred can achieve much better performance for predicting target risk with low resources comparing with the predictor trained on the limited samples available for this risk alone.
Baidu runs the largest commercial web search engine in China, serving hundreds of millions of online users every day in response to a great variety of queries. In order to build a high-efficiency sponsored search engine, we used to adopt a three-layer funnel-shaped structure to screen and sort hundreds of ads from billions of ad candidates subject to the requirement of low response latency and the restraints of computing resources. Given a user query, the top matching layer is responsible for providing semantically relevant ad candidates to the next layer, while the ranking layer at the bottom concerns more about business indicators (e.g., CPM, ROI, etc.) of those ads. The clear separation between the matching and ranking objectives results in a lower commercial return. The Mobius project has been established to address this serious issue. It is our first attempt to train the matching layer to consider CPM as an additional optimization objective besides the query-ad relevance, via directly predicting CTR (click-through rate) from billions of query-ad pairs. Specifically, this paper will elaborate on how we adopt active learning to overcome the insufficiency of click history at the matching layer when training our neural click networks offline, and how we use the SOTA ANN search technique for retrieving ads more efficiently (Here "ANN'' stands for approximate nearest neighbor search). We contribute the solutions to Mobius-V1 as the first version of our next generation query-ad matching system.
In this paper we present a deployed image recognition system used in a large scale commerce search engine, which we call MSURU. It is designed to process product images uploaded daily to Facebook Marketplace. Social commerce is a growing area within Facebook and understanding visual representations of product content is important for search and recommendation applications on Marketplace. In this paper, we present techniques we used to develop efficient large-scale image classifiers using weakly supervised search log data. We perform extensive evaluation of presented techniques, explain practical experience of developing large-scale classification systems and discuss challenges we faced. Our system, MSURU out-performed current state of the art system developed at Facebook  by 16% in e-commerce domain. MSURU is deployed to production with significant improvements in search success rate and active interactions on Facebook Marketplace.
We propose a novel data-driven approach for solving multi-horizon probabilistic forecasting tasks that predicts the full distribution of a time series on future horizons. We illustrate that temporal patterns hidden in historical information play an important role in accurate forecasting of long time series. Traditional methods rely on setting up temporal dependencies manually to explore related patterns in historical data, which is unrealistic in forecasting long-term series on real-world data. Instead, we propose to explicitly learn constructing hidden patterns' representations with deep neural networks and attending to different parts of the history for forecasting the future.
In this paper, we propose an end-to-end deep-learning framework for multi-horizon time series forecasting, with temporal attention mechanisms to better capture latent patterns in historical data which are useful in predicting the future. Forecasts of multiple quantiles on multiple future horizons can be generated simultaneously based on the learned latent pattern features. We also propose a multimodal fusion mechanism which is used to combine features from different parts of the history to better represent the future. Experiment results demonstrate our approach achieves state-of-the-art performance on two large-scale forecasting datasets in different domains.
Online gaming is a multi-billion dollar industry that entertains a large, global population. However, one unfortunate phenomenon known as real money trading harms the competition and the fun. Real money trading is an interesting economic activity used to exchange assets in a virtual world with real world currencies, leading to imbalance of game economy and inequality of wealth and opportunity. Game operation teams have been devoting much efforts on real money trading detection, however, it still remains a challenging task. To overcome the limitation from traditional methods conducted by game operation teams, we propose, MVAN, the first multi-view attention networks for detecting real money trading with multi-view data sources. We present a multi-graph attention network (MGAT) in the graph structure view, a behavior attention network (BAN) in the vertex content view, a portrait attention network (PAN) in the vertex attribute view and a data source attention network (DSAN) in the data source view. Experiments conducted on real-world game logs from a commercial NetEase MMORPG( JusticePC) show that our method consistently performs promising results compared with other competitive methods over time and verifiy the importance and rationality of attention mechanisms. MVAN is deployed to several MMORPGs in NetEase in practice and achieving remarkable performance improvement and acceleration. Our method can easily generalize to other types of related tasks in real world, such as fraud detection, drug tracking and money laundering tracking etc.
In the clinical domain, it is important to understand whether an adverse drug reaction (ADR) is caused by a particular medication. Clinical judgement studies help judge the causal relation between a medication and its ADRs. In this study, we present the first attempt to automatically infer the causality between a drug and an ADR from electronic health records (EHRs) by answering the Naranjo questionnaire, the validated clinical question answering set used by domain experts for ADR causality assessment. Using physicians' annotation as the gold standard, our proposed joint model, which uses multi-task learning to predict the answers of a subset of the Naranjo questionnaire, significantly outperforms the baseline pipeline model with a good margin, achieving a macro-weighted f-score between 0.3652-0.5271 and micro-weighted f-score between 0.9523-0.9918.
Nonparametric Mixture of Sparse Regressions on Spatio-Temporal Data -- An Application to Climate Prediction
Climate prediction is a very challenging problem. Many institutes around the world try to predict climate variables by building climate models called General Circulation Models (GCMs), which are based on mathematical equations that describe the physical processes. The prediction abilities of different GCMs may vary dramatically across different regions and time. Motivated by the need of identifying which GCMs are more useful for a particular region and time, we introduce a clustering model combining Dirichlet Process (DP) mixture of sparse linear regression with Markov Random Fields (MRFs). This model incorporates DP to automatically determine the number of clusters, imposes MRF constraints to guarantee spatio-temporal smoothness, and selects a subset of GCMs that are useful for prediction within each spatio-temporal cluster with a spike-and-slab prior. We derive an effective Gibbs sampling method for this model. Experimental results are provided for both synthetic and real-world climate data.
What did it feel like to walk through a city from the past? In this work, we describe Nostalgin (Nostalgia Engine), a method that can faithfully reconstruct cities from historical images. Unlike existing work in city reconstruction, we focus on the task of reconstructing 3D cities from historical images. Working with historical image data is substantially more difficult, as there are significantly fewer buildings available and the details of the camera parameters which captured the images are unknown. Nostalgin can generate a city model even if there is only a single image per facade, regardless of viewpoint or occlusions. To achieve this, our novel architecture combines image segmentation, rectification, and inpainting. We motivate our design decisions with experimental analysis of individual components of our pipeline, and show that we can improve on baselines in both speed and visual realism. We demonstrate the efficacy of our pipeline by recreating two 1940s Manhattan city blocks. We aim to deploy Nostalgin as an open source platform where users can generate immersive historical experiences from their own photos.
News recommendation is very important to help users find interested news and alleviate information overload. Different users usually have different interests and the same user may have various interests. Thus, different users may click the same news article with attention on different aspects. In this paper, we propose a neural news recommendation model with personalized attention (NPA). The core of our approach is a news representation model and a user representation model. In the news representation model we use a CNN network to learn hidden representations of news articles based on their titles. In the user representation model we learn the representations of users based on the representations of their clicked news articles. Since different words and different news articles may have different informativeness for representing news and users, we propose to apply both word- and news-level attention mechanism to help our model attend to important words and news articles. In addition, the same news article and the same word may have different informativeness for different users. Thus, we propose a personalized attention network which exploits the embedding of user ID to generate the query vector for the word- and news-level attentions. Extensive experiments are conducted on a real-world news recommendation dataset collected from MSN news, and the results validate the effectiveness of our approach on news recommendation.
Linking entities from different sources is a fundamental task in building open knowledge graphs. Despite much research conducted in related fields, the challenges of linkinglarge-scale heterogeneous entity graphs are far from resolved. Employing two billion-scale academic entity graphs (Microsoft Academic Graph and AMiner) as sources for our study, we propose a unified framework --- LinKG --- to address the problem of building a large-scale linked entity graph. LinKG is coupled with three linking modules, each of which addresses one category of entities. To link word-sequence-based entities (e.g., venues), we present a long short-term memory network-based method for capturing the dependencies. To link large-scale entities (e.g., papers), we leverage locality-sensitive hashing and convolutional neural networks for scalable and precise linking. To link entities with ambiguity (e.g., authors), we propose heterogeneous graph attention networks to model different types of entities. Our extensive experiments and systematical analysis demonstrate that LinKG can achieve linking accuracy with an F1-score of 0.9510, significantly outperforming the state-of-the-art. LinKG has been deployed to Microsoft Academic Search and AMiner to integrate the two large graphs. We have published the linked results---the Open Academic Graph (OAG)\footnote\urlhttps://www.openacademic.ai/oag/ , making it the largest publicly available heterogeneous academic graph to date.
Smart reply systems have been developed for various messaging platforms. In this paper, we introduce Uber's smart reply system: one-click-chat (OCC), which is a key enhanced feature on top of the Uber in-app chat system. It enables driver-partners to quickly respond to rider messages using smart replies. The smart replies are dynamically selected according to conversation content using machine learning algorithms. Our system consists of two major components: intent detection and reply retrieval, which are very different from standard smart reply systems where the task is to directly predict a reply. It is designed specifically for mobile applications with short and non-canonical messages. Reply retrieval utilizes pairings between intent and reply based on their popularity in chat messages as derived from historical data. For intent detection, a set of embedding and classification techniques are experimented with, and we choose to deploy a solution using unsupervised distributed embedding and nearest-neighbor classifier. It has the advantage of only requiring a small amount of labeled training data, simplicity in developing and deploying to production, and fast inference during serving and hence highly scalable. At the same time, it performs comparably with deep learning architectures such as word-level convolutional neural network. Overall, the system achieves a high accuracy of 76% on intent detection. Currently, the system is deployed in production for English-speaking countries and 71% of in-app communications between riders and driver-partners adopted the smart replies to speedup the communication process.
In manufacturing, a golden batch is an idealized realization of the perfect process to produce the desired item, typically represented as a multidimensional time series of pressures, temperatures, flow-rates and so forth. The golden batch is sometimes produced from first-principle models, but it is typically created by recording a batch produced by the most experienced engineers on carefully cleaned and calibrated machines. In most cases, the golden batch is only used in post-mortem analysis of a product with an unexpectedly inferior quality, as plant managers attempt to understand where and when the last production attempt went wrong. In this work, we make two contributions to golden batch processing. We introduce an online algorithm that allows practitioners to understand if the process is currently deviating from the golden batch in real-time, allowing engineers to intervene and potentially save the batch. This may be done, for example, by cooling a boiler that is running unexpectedly hot. In addition, we show that our ideas can greatly expand the purview of golden batch monitoring beyond industrial manufacturing. In particular, we show that golden batch monitoring can be used for anomaly detection, attention focusing, and personalized training/skill assessment in a host of novel domains.
Online purchase forecasting is of great importance in e-commerce platforms, which is the basis of how to present personalized interesting product lists to individual customers. However, predicting online purchases is not trivial as it is influenced by many factors including: (i) the complex temporal pattern with hierarchical inter-correlations; (ii) arbitrary category dependencies. To address these factors, we develop a Graph Multi-Scale Pyramid Networks (GMP) framework to fully exploit users' latent behavioral patterns with both multi-scale temporal dynamics and arbitrary inter-dependencies among product categories. In GMP, we first design a multi-scale pyramid modulation network architecture which seamlessly preserves the underlying hierarchical temporal factors--governing users' purchase behaviors. Then, we employ convolution recurrent neural network to encode the categorical temporal pattern at each scale. After that, we develop a resolution-wise recalibration gating mechanism to automatically re-weight the importance of each scale-view representations. Finally, a context-graph neural network module is proposed to adaptively uncover complex dependencies among category-specific purchases. Extensive experiments on real-world e-commerce datasets demonstrate the superior performance of our method over state-of-the-art baselines across various settings.
The purpose of this study is to introduce new design-criteria for next-generation hyperparameter optimization software. The criteria we propose include (1) define-by-run API that allows users to construct the parameter search space dynamically, (2) efficient implementation of both searching and pruning strategies, and (3) easy-to-setup, versatile architecture that can be deployed for various purposes, ranging from scalable distributed computing to light-weight experiment conducted via interactive interface. In order to prove our point, we will introduce Optuna, an optimization software which is a culmination of our effort in the development of a next generation optimization software. As an optimization software designed with define-by-run principle, Optuna is particularly the first of its kind. We will present the design-techniques that became necessary in the development of the software that meets the above criteria, and demonstrate the power of our new design through experimental results and real world applications. Our software is available under the MIT license (https://github.com/pfnet/optuna/).
We study a novel problem of sponsored search (SS) for E-Commerce platforms: how we can attract query users to click product advertisements (ads) by presenting them features of products that attract them. This not only benefits merchants and the platform, but also improves user experience. The problem is challenging due to the following reasons: (1) We need to carefully manipulate the ad content without affecting user search experience. (2) It is difficult to obtain users' explicit feedback of their preference in product features. (3) Nowadays, a great portion of the search traffic in E-Commerce platforms is from their mobile apps (e.g., nearly 90% in Taobao). The situation would get worse in the mobile setting due to limited space. We are focused on the mobile setting and propose to manipulate ad titles by adding a few selling point keywords (SPs) to attract query users. We model it as a personalized attractive SP prediction problem and carry out both large-scale offline evaluation and online A/B tests in Taobao. The contributions include: (1) We explore various exhibition schemes of SPs. (2) We propose a surrogate of user explicit feedback for SP preference. (3) We also explore multi-task learning and various additional features to boost the performance. A variant of our best model has already been deployed in Taobao, leading to a 2% increase in revenue per thousand impressions and an opt-out rate of merchants less than 4%.
Personalization in marketing aims at improving the shopping experience of customers by tailoring services to individuals. In order to achieve this, businesses must be able to make personalized predictions regarding the next purchase. That is, one must forecast the exact list of items that will comprise the next purchase, \ie, the so-called market basket. Despite its relevance to firm operations, this problem has received surprisingly little attention in prior research, largely due to its inherent complexity. In fact, state-of-the-art approaches are limited to intuitive decision rules for pattern extraction, so that repeat purchases or co-purchases can be identified. However, the simplicity of the pre-coded rules impedes performance, since decision rules operate in an autoregressive fashion: the rules can only make inferences from past purchases of a single customer without taking into account the knowledge transfer that takes place between customers.
In contrast, our research overcomes the limitations of pre-set rules by contributing a novel predictor of market baskets from sequential purchase histories: our predictions are based on similarity matching in order to identify similar purchase habits among the complete shopping histories of all customers. Our contributions are as follows: (1) We propose similarity matching based on subsequential dynamic time warping (SDTW) as a novel predictor of market baskets. Thereby, we can effectively identify cross-customer patterns. (2) We leverage the Wasserstein distance for measuring the similarity among embedded purchase histories. If desired, this can further be interpreted as a proxy to the prediction quality. (3) We develop a fast approximation algorithm for computing a lower bound of the Wasserstein distance in our setting. An extensive series of computational experiments demonstrates the effectiveness of our approach. The accuracy of identifying the exact market baskets based on state-of-the-art decision rules from the literature is outperformed by a factor of 4.0. This contributes to a further personalization in the provision of retail services. The actual use cases are widespread and include making customers tailored offerings in marketing, extending recommender systems for the purpose of suggesting personalized market baskets, and triggering product deliveries before purchase in order to accelerate delivery.
Text embedding is a fundamental component for extracting text features in production-level data mining and machine learning systems given textual information is the most ubiqutious signals. However, practitioners often face the tradeoff between effectiveness of underlying embedding algorithms and cost of training and maintaining various embedding results in large-scale applications. In this paper, we propose a multitask text embedding solution called PinText for three major vertical surfaces including homefeed, related pins, and search in Pinterest, which consolidates existing text embedding algorithms into a single solution and produces state-of-the-art performance. Specifically, we learn word level semantic vectors by enforcing that the similarity between positive engagement pairs is larger than the similarity between a randomly sampled background pairs. Based on the learned semantic vectors, we derive embedding vector of a user, a pin, or a search query by simply averaging its word level vectors. In this common compact vector space, we are able to do unified nearest neighbor search with hashing by Hadoop jobs or dockerized images on Kubernetes cluster. Both offline evaluation and online experiments show effectiveness of this PinText system and save storage cost of multiple open-sourced embeddings significantly.
Increasing demand for fashion recommendation raises a lot of challenges for online shopping platforms and fashion communities. In particular, there exist two requirements for fashion outfit recommendation: the Compatibility of the generated fashion outfits, and the Personalization in the recommendation process. In this paper, we demonstrate these two requirements can be satisfied via building a bridge between outfit generation and recommendation. Through large data analysis, we observe that people have similar tastes in individual items and outfits. Therefore, we propose a Personalized Outfit Generation (POG) model, which connects user preferences regarding individual items and outfits with Transformer architecture. Extensive offline and online experiments provide strong quantitative evidence that our method outperforms alternative methods regarding both compatibility and personalization metrics. Furthermore, we deploy POG on a platform named Dida in Alibaba to generate personalized outfits for the users of the online application iFashion.
This work represents a first step towards an industrial-scale fashion outfit generation and recommendation solution, which goes beyond generating outfits based on explicit queries, or merely recommending from existing outfit pools. As part of this work, we release a large-scale dataset consisting of 1.01 million outfits with rich context information, and 0.28 billion user click actions from 3.57 million users. To the best of our knowledge, this dataset is the largest, publicly available, fashion related dataset, and the first to provide user behaviors relating to both outfits and fashion items.
Click-through rate (CTR) prediction is critical for industrial applications such as recommender system and online advertising. Practically, it plays an important role for CTR modeling in these applications by mining user interest from rich historical behavior data. Driven by the development of deep learning, deep CTR models with ingeniously designed architecture for user interest modeling have been proposed, bringing remarkable improvement of model performance over offline metric. However, great efforts are needed to deploy these complex models to online serving system for realtime inference, facing massive traffic request. Things turn to be more difficult when it comes to long sequential user behavior data, as the system latency and storage cost increase approximately linearly with the length of user behavior sequence.
In this paper, we face directly the challenge of long sequential user behavior modeling and introduce our hands-on practice with the co-design of machine learning algorithm and online serving system for CTR prediction task. (i) From serving system view, we decouple the most resource-consuming part of user interest modeling from the entire model by designing a separate module named UIC (User Interest Center). UIC maintains the latest interest state for each user, whose update depends on realtime user behavior trigger event, rather than on traffic request. Hence UIC is latency free for realtime CTR prediction. (ii) From machine learning algorithm view, we propose a novel memory-based architecture named MIMN (Multi-channel user Interest Memory Network) to capture user interests from long sequential behavior data, achieving superior performance over state-of-the-art models. MIMN is implemented in an incremental manner with UIC module.
Theoretically, the co-design solution of UIC and MIMN enables us to handle the user interest modeling with unlimited length of sequential behavior data. Comparison between model performance and system efficiency proves the effectiveness of proposed solution. To our knowledge, this is one of the first industrial solutions that are capable of handling long sequential user behavior data with length scaling up to thousands. It now has been deployed in the display advertising system in Alibaba.
Precipitation nowcasting is a short-range forecast of rain/snow (up to 2 hours), often displayed on top of the geographical map by the weather service. Modern precipitation nowcasting algorithms rely on the extrapolation of observations by ground-based radars via optical flow techniques or neural network models. Dependent on these radars, typical nowcasting is limited to the regions around their locations. We have developed a method for precipitation nowcasting based on geostationary satellite imagery and incorporated the resulting data into the Yandex.Weather precipitation map (including an alerting service with push notifications for products in the Yandex ecosystem), thus expanding its coverage and paving the way to a truly global nowcasting service.
Conversion prediction plays an important role in online advertis- ing since Cost-Per-Action (CPA) has become one of the primary campaign performance objectives in the industry. Unlike click pre- diction, conversions have different types in nature, and each type may be associated with different decisive factors. In this paper, we formulate conversion prediction as a multi-task learning problem, so that the prediction models for different types of conversions can be learned together. These models share feature representa- tions, but have their specific parameters, providing the benefit of information-sharing across all tasks. We then propose Multi-Task Field-weighted Factorization Machine (MT-FwFM) to solve these tasks jointly. Our experiment results show that, compared with two state-of-the-art models, MT-FwFM improve the AUC by 0.74% and 0.84% on two types of conversions, and the weighted AUC across all conversion types is also improved by 0.50%.
Progress on the UN Sustainable Development Goals (SDGs) is hampered by a persistent lack of data regarding key social, environmental, and economic indicators, particularly in developing countries. For example, data on poverty - the first of seventeen SDGs - is both spatially sparse and infrequently collected in Sub-Saharan Africa due to the high cost of surveys. Here we propose a novel method for estimating socioeconomic indicators using open-source, geolocated textual information from Wikipedia articles. We demonstrate that modern NLP techniques can be used to predict community-level asset wealth and education outcomes using nearby geolocated Wikipedia articles. When paired with nightlights satellite imagery, our method outperforms all previously published benchmarks for this prediction task, indicating the potential of Wikipedia to inform both research in the social sciences and future policy decisions.
Predicting Evacuation Decisions using Representations of Individuals' Pre-Disaster Web Search Behavior
Predicting the evacuation decisions of individuals before the disaster strikes is crucial for planning first response strategies. In addition to the studies on post-disaster analysis of evacuation behavior, there are various works that attempt to predict the evacuation decisions beforehand. Most of these predictive methods, however, require real time location data for calibration, which are becoming much harder to obtain due to the rising privacy concerns. Meanwhile, web search queries of anonymous users have been collected by web companies. Although such data raise less privacy concerns, they have been under-utilized for various applications. In this study, we investigate whether web search data observed prior to the disaster can be used to predict the evacuation decisions. More specifically, we utilize a session-based query encoder that learns the representations of each user's web search behavior prior to evacuation. Our proposed approach is empirically tested using web search data collected from users affected by a major flood in Japan. Results are validated using location data collected from mobile phones of the same set of users as ground truth. We show that evacuation decisions can be accurately predicted (84%) using only the users' pre-disaster web search data as input. This study proposes an alternative method for evacuation prediction that does not require highly sensitive location data, which can assist local governments to prepare effective first response strategies.
Health research has an increasing focus on promoting well-being and positive mental health, to prevent disease and to more effectively treat disorders. The availability of rich multi-modal datasets and advances in machine learning methods are now enabling data science research to begin to objectively assess well-being. However, most existing studies focus on detecting the current state or predicting the future state of well-being using stand-alone health behaviors. There is a need for methods that can handle a complex combination of health behaviors, as arise in real-world data.
In this paper, we present a framework to 1) map multi-modal messy data collected in the "wild" to meaningful feature representations of health behavior, and 2) uncover latent patterns comprising multiple health behaviors that best predict well-being. We show how to use supervised latent Dirichlet allocation (sLDA) to model the observed behaviors, and we apply variational inference to uncover the latent patterns. Implementing and evaluating the model on 5,397 days of data from a group of 244 college students, we find that these latent patterns are indeed predictive of self-reported stress, one of the largest components affecting well-being. We investigate the modifiable behaviors present in these patterns and uncover some ways in which the factors work together to influence well-being. This work contributes a new method using objective data analysis to help individuals monitor their well-being using real-world measurements. Insights from this study advance scientific knowledge on how combinations of daily modifiable human behaviors relate to human well-being.
In this paper, we propose a novel end-to-end approach for AI-assisted code completion called Pythia. It generates ranked lists of method and API recommendations which can be used by software developers at edit time. The system is currently deployed as part of Intellicode extension in Visual Studio Code IDE. Pythia exploits state-of-the-art large-scale deep learning models trained on code contexts extracted from abstract syntax trees. It is designed to work at a high throughput predicting the best matching code completions on the order of 100 ms.
We describe the architecture of the system, perform comparisons to frequency-based approach and invocation-based Markov Chain language model, and discuss challenges serving Pythia models on lightweight client devices.
The offline evaluation results obtained on 2700 Python open source software GitHub repositories show a top-5 accuracy of 92%, surpassing the baseline models by 20% averaged over classes, for both intra and cross-project settings.
The two most common ways to activate intelligent voice assistants (IVAs) are button presses and trigger phrases. This paper describes a new way to invoke IVAs on smartwatches: simply raise your hand and speak naturally. To achieve this experience, we designed an accurate, low-power detector that works on a wide range of environments and activity scenarios with minimal impact to battery life, memory footprint, and processor utilization. The raise to speak (RTS) detector consists of four main compo- nents: an on-device gesture convolutional neural network (CNN) that uses accelerometer data to detect specific poses; an on-device speech CNN to detect proximal human speech; a policy model to combine signals from the motion and speech detector; and an off-device false trigger mitigation (FTM) system to reduce unin- tentional invocations trigged by the on-device detector. Majority of the components of the detector run on-device to preserve user privacy. The RTS detector was released in watchOS 5.0 and is running on millions of devices worldwide.
Web-based services often run randomized experiments to improve their products. A popular way to run these experiments is to use geographical regions as units of experimentation, since this does not require tracking of individual users or browser cookies. Since users may issue queries from multiple geographical locations, geo-regions cannot be considered independent and interference may be present in the experiment. In this paper, we study this problem, and first present GeoCUTS, a novel algorithm that forms geographical clusters to minimize interference while preserving balance in cluster size. We use a random sample of anonymized traffic from Google Search to form a graph representing user movements, then construct a geographically coherent clustering of the graph. Our main technical contribution is a statistical framework to measure the effectiveness of clusterings. Furthermore, we perform empirical evaluations showing that the performance of GeoCUTS is comparable to hand-crafted geo-regions with respect to both novel and existing metrics.
Genealogy research is the study of family history using available resources such as historical records. Ancestry provides its customers with one of the world's largest online genealogical index with billions of records from a wide range of sources, including vital records such as birth and death certificates, census records, court and probate records among many others. Search at Ancestry aims to return relevant records from various record types, allowing our subscribers to build their family trees, research their family history, and make meaningful discoveries about their ancestors from diverse perspectives.
In a modern search engine designed for genealogical study, the appropriate ranking of search results to provide highly relevant information represents a daunting challenge. In particular, the disparity in historical records makes it inherently difficult to score records in an equitable fashion. Herein, we provide an overview of our solutions to overcome such record disparity problems in the Ancestry search engine. Specifically, we introduce customized coordinate ascent (customized CA) to speed up ranking within a specific record type. We then propose stochastic search (SS) that linearly combines ranked results federated across contents from various record types. Furthermore, we propose a novel information retrieval metric, normalized cumulative entropy (NCE), to measure the diversity of results. We demonstrate the effectiveness of these two algorithms in terms of relevance (by NDCG) and diversity (by NCE) if applicable in the offline experiments using real customer data at Ancestry.
Recently, deep learning models play more and more important roles in contents recommender systems. However, although the performance of recommendations is greatly improved, the "Matthew effect" becomes increasingly evident. While the head contents get more and more popular, many competitive long-tail contents are difficult to achieve timely exposure because of lacking behavior features. This issue has badly impacted the quality and diversity of recommendations. To solve this problem, look-alike algorithm is a good choice to extend audience for high quality long-tail contents. But the traditional look-alike models which widely used in online advertising are not suitable for recommender systems because of the strict requirement of both real-time and effectiveness. This paper introduces a real-time attention based look-alike model (RALM) for recommender systems, which tackles the challenge of conflict between real-time and effectiveness. RALM realizes real-time look-alike audience extension benefiting from seeds-to-user similarity prediction and improves the effectiveness through optimizing user representation learning and look-alike learning modeling. For user representation learning, we propose a novel neural network structure named attention merge layer to replace the concatenation layer, which significantly improves the expressive ability of multi-fields feature learning. On the other hand, considering the various members of seeds, we design global attention unit and local attention unit to learn robust and adaptive seeds representation with respect to a certain target user. At last, we introduce seeds clustering mechanism which not only reduces the time complexity of attention units prediction but also minimizes the loss of seeds information at the same time. According to our experiments, RALM shows superior effectiveness and performance than popular look-alike models. RALM has been successfully deployed in "Top Stories" Recommender System of WeChat, leading to great improvement on diversity and quality of recommendations. As far as we know, this is the first real-time look-alike model applied in recommender systems.
Social networks are quickly becoming the primary medium for discussing what is happening around real-world events. The information that is generated on social platforms like Twitter can produce rich data streams for immediate insights into ongoing matters and the conversations around them. To tackle the problem of event detection, we model events as a list of clusters of trending entities over time. We describe a real-time system for discovering events that is modular in design and novel in scale and speed: it applies clustering on a large stream with millions of entities per minute and produces a dynamically updated set of events. In order to assess clustering methodologies, we build an evaluation dataset derived from a snapshot of the full Twitter Firehose and propose novel metrics for measuring clustering quality. Through experiments and system profiling, we highlight key results from the offline and online pipelines. Finally, we visualize a high profile event on Twitter to show the importance of modeling the evolution of events, especially those detected from social data streams.
Billions of people are using smartphones everyday and they often face problems and troubles with both the hardware as well as the software. Such problems lead to frustrated users and low customer satisfaction. Developing an automatic machine learning-based solution that would detect that the user has a problem and would engage in troubleshooting has the potential to significantly improve customer satisfaction and retention. Here, we design and implement a system that based on the user's smartphone activity detects that the user has a problem and requires help. Our system automatically detects a user has a problem and then helps with the troubleshooting by recommending possible solutions to the identified problem. We train our system based on large-scale customer support center data and show that it can both detect that a user has a problem as well as predict the category of the problem (89.7% accuracy) and quickly provide a solution (in 10.4ms). Our system has been deployed in commercial service since January, 2019. Online evaluation result showed that machine learning based approach outperforms the existing method by approximately 30% regarding the user problem solving rate.
The limited attentional resource of users is a bottleneck to delivery of push notifications in today's mobile and ubiquitous computing environments. Adaptive mobile notification scheduling, which detects opportune timings based on mobile sensing and machine learning, has been proposed as a way of alleviating this problem. However, it is still not clear if such adaptive notifications are effective in a large-scale product deployment with real-world situations and configurations, such as users' context changes, personalized content in notifications, and sudden external factors that users commonly experience (such as breaking news). In this paper, we construct a new interruptibility estimation and adaptive notification scheduling with redesigned technical components. From the deploy study of the system to the real product stack of Yahoo! JAPAN Android application and evaluation with 382,518 users for 28 days, we confirmed several significant results, including the maximum 60.7% increase in the users' click rate, 10 times more gain compared to the previous system, significantly better gain in the personalized notification content, and unexpectedly better performance in a situation with exceptional breaking news notifications. With these results, the proposed system has officially been deployed and enabled to all the users of Yahoo! JAPAN product environment where more than 10 million Android app users are enjoying its benefit.
Bidding in real-time auctions can be a difficult stochastic control task; especially in a very uncertain market and if underdelivery incurs strong penalties. Most current works and implementations focus on optimally delivering a campaign given a reasonable forecast of the market. Practical implementations have a feedback loop to adjust and be robust to forecasting errors, but no implementation, to the best of our knowledge, uses a model of market risk and actively anticipates market shifts. Solving such stochastic control problems in practice is actually very challenging and an approximate solution based on a Recurrent Neural Network (RNN) architecture is both effective and practical for implementation in a production environment. The RNN bidder provisions everything it needs to avoid underdelivery. It also deliberately falls short of its goal when buying the missing impressions would cost more than the penalty for not reaching it.
Recommender systems play a crucial role in our daily lives. Feed streaming mechanism has been widely used in the recommender system, especially on the mobile Apps. The feed streaming setting provides users the interactive manner of recommendation in never-ending feeds. In such a manner, a good recommender system should pay more attention to user stickiness, which is far beyond classical instant metrics and typically measured by long-term user engagement. Directly optimizing long-term user engagement is a non-trivial problem, as the learning target is usually not available for conventional supervised learning methods. Though reinforcement learning~(RL) naturally fits the problem of maximizing the long term rewards, applying RL to optimize long-term user engagement is still facing challenges: user behaviors are versatile to model, which typically consists of both instant feedback (eg. clicks) and delayed feedback (eg. dwell time, revisit); in addition, performing effective off-policy learning is still immature, especially when combining bootstrapping and function approximation. To address these issues, in this work, we introduce a RL framework --- FeedRec to optimize the long-term user engagement. FeedRec includes two components: 1)~a Q-Network which designed in hierarchical LSTM takes charge of modeling complex user behaviors, and 2)~a S-Network, which simulates the environment, assists the Q-Network and voids the instability of convergence in policy learning. Extensive experiments on synthetic data and a real-world large scale data show that FeedRec effectively optimizes the long-term user engagement and outperforms state-of-the-arts.
The revenue of online display advertising in the U.S. is projected to be 7.9 billion U.S. dollars by 2022. One main way of display advertising is through real-time bidding (RTB). In RTB, an ad exchange runs a second price auction among multiple advertisers to sell each ad impression. Publishers usually set up a reserve price, the lowest price acceptable for an ad impression. If there are bids higher than the reserve price, then the revenue is the higher price between the reserve price and the second highest bid; otherwise, the revenue is zero. Thus, a higher reserve price can potentially increase the revenue, but with higher risks associated. In this paper, we study the problem of estimating the failure rate of a reserve price, i.e., the probability that a reserve price fails to be outbid. The solution to this problem have managerial implications to publishers to set appropriate reserve prices in order to minimizes the risks and optimize the expected revenue. This problem is highly challenging since most publishers do not know the historical highest bidding prices offered by RTB advertisers. To address this problem, we develop a parametric survival model for reserve price failure rate prediction. The model is further improved by considering user and page interactions, and header bidding information. The experimental results demonstrate the effectiveness of the proposed approach.
Industry devices (i.e., entities) such as server machines, spacecrafts, engines, etc., are typically monitored with multivariate time series, whose anomaly detection is critical for an entity's service quality management. However, due to the complex temporal dependence and stochasticity of multivariate time series, their anomaly detection remains a big challenge. This paper proposes OmniAnomaly, a stochastic recurrent neural network for multivariate time series anomaly detection that works well robustly for various devices. Its core idea is to capture the normal patterns of multivariate time series by learning their robust representations with key techniques such as stochastic variable connection and planar normalizing flow, reconstruct input data by the representations, and use the reconstruction probabilities to determine anomalies. Moreover, for a detected entity anomaly, OmniAnomaly can provide interpretations based on the reconstruction probabilities of its constituent univariate time series. The evaluation experiments are conducted on two public datasets from aerospace and a new server machine dataset (collected and released by us) from an Internet company. OmniAnomaly achieves an overall F1-Score of 0.86 in three real-world datasets, signicantly outperforming the best performing baseline method by 0.09. The interpretation accuracy for OmniAnomaly is up to 0.89.
Satellite-based positioning system such as GPS often suffers from large amount of noise that degrades the positioning accuracy dramatically especially in real-time applications. In this work, we consider a data-mining approach to enhance the GPS signal. We build a large-scale high precision GPS receiver grid system to collect real-time GPS signals for training. The Gaussian Process (GP) regression is chosen to model the vertical Total Electron Content (vTEC) distribution of the ionosphere of the Earth. Our experiments show that the noise in the real-time GPS signals often exceeds the breakdown point of the conventional robust regression methods resulting in sub-optimal system performance. We propose a three-step approach to address this challenge. In the first step we perform a set of signal validity tests to separate the signals into clean and dirty groups. In the second step, we train an initial model on the clean signals and then reweigting the dirty signals based on the residual error. A final model is retrained on both the clean signals and the reweighted dirty signals. In the theoretical analysis, we prove that the proposed three-step approach is able to tolerate much higher noise level than the vanilla robust regression methods if two reweighting rules are followed. We validate the superiority of the proposed method in our real-time high precision positioning system against several popular state-of-the-art robust regression methods. Our method achieves centimeter positioning accuracy in the benchmark region with probability $78.4%$ , outperforming the second best baseline method by a margin of $8.3%$. The benchmark takes 6 hours on 20,000 CPU cores or 14 years on a single CPU.
Railway points are among the key components of railway infrastructure. As a part of signal equipment, points control the routes of trains at railway junctions, having a significant impact on the reliability, capacity, and punctuality of rail transport. Meanwhile, they are also one of the most fragile parts in railway systems. Points failures cause a large portion of railway incidents. Traditionally, maintenance of points is based on a fixed time interval or raised after the equipment failures. Instead, it would be of great value if we could forecast points' failures and take action beforehand, minimising any negative effect. To date, most of the existing prediction methods are either lab-based or relying on specially installed sensors which makes them infeasible for large-scale implementation. Besides, they often use data from only one source. We, therefore, explore a new way that integrates multi-source data which are ready to hand to fulfil this task. We conducted our case study based on Sydney Trains rail network which is an extensive network of passenger and freight railways. Unfortunately, the real-world data are usually incomplete due to various reasons, e.g., faults in the database, operational errors or transmission faults. Besides, railway points differ in their locations, types and some other properties, which means it is hard to use a unified model to predict their failures. Aiming at this challenging task, we firstly constructed a dataset from multiple sources and selected key features with the help of domain experts. In this paper, we formulate our prediction task as a multiple kernel learning problem with missing kernels. We present a robust multiple kernel learning algorithm for predicting points failures. Our model takes into account the missing pattern of data as well as the inherent variance on different sets of railway points. Extensive experiments demonstrate the superiority of our algorithm compared with other state-of-the-art methods.
Seasonal-adjustment Based Feature Selection Method for Predicting Epidemic with Large-scale Search Engine Logs
Search engine logs have a great potential in tracking and predicting outbreaks of infectious disease. More precisely, one can use the search volume of some search terms to predict the infection rate of an infectious disease in nearly real-time. However, conducting accurate and stable prediction of outbreaks using search engine logs is a challenging task due to the following two-way instability characteristics of the search logs. First, the search volume of a search term may change irregularly in the short-term, for example, due to environmental factors such as the amount of media or news. Second, the search volume may also change in the long-term due to the demographic change of the search engine. That is to say, if a model is trained with such search logs with ignoring such characteristic, the resulting prediction would contain serious mispredictions when these changes occur. In this work, we proposed a novel feature selection method to overcome this instability problem. In particular, we employ a seasonal-adjustment method that decomposes each time series into three components: seasonal, trend and irregular component and build prediction models for each component individually. We also carefully design a feature selection method to select proper search terms to predict each component. We conducted comprehensive experiments on ten different kinds of infectious diseases. The experimental results show that the proposed method outperforms all comparative methods in prediction accuracy for seven of ten diseases, in both now-casting and forecasting setting. Also, the proposed method is more successful in selecting search terms that are semantically related to target diseases.
This paper introduces Seeker, a system that allows users to adaptively refine search rankings in real time, through a series of feedbacks in the form of likes and dislikes. When searching online, users may not know how to accurately describe their product of choice in words. An alternative approach is to search an embedding space, allowing the user to query using a representation of the item (like a tune for a song, or a picture for an object). However, this approach requires the user to possess an example representation of their desired item. Additionally, most current search systems do not allow the user to dynamically adapt the results with further feedback. On the other hand, users often have a mental picture of the desired item and are able to answer ordinal questions of the form: "Is this item similar to what you have in mind?" With this assumption, our algorithm allows for users to provide sequential feedback on search results to adapt the search feed. We show that our proposed approach works well both qualitatively and quantitatively. Unlike most previous representation-based search systems, we can quantify the quality of our algorithm by evaluating humans-in-the-loop experiments.
We study the problem of semantic matching in product search, that is, given a customer query, retrieve all semantically related products from the catalog. Pure lexical matching via an inverted index falls short in this respect due to several factors: a) lack of understanding of hypernyms, synonyms, and antonyms, b) fragility to morphological variants (e.g. "woman" vs. "women"), and c) sensitivity to spelling errors. To address these issues, we train a deep learning model for semantic matching using customer behavior data. Much of the recent work on large-scale semantic search using deep learning focuses on ranking for web search. In contrast, semantic matching for product search presents several novel challenges, which we elucidate in this paper. We address these challenges by a) developing a new loss function that has an inbuilt threshold to differentiate between random negative examples, impressed but not purchased examples, and positive examples (purchased items), b) using average pooling in conjunction with n-grams to capture short-range linguistic patterns, c) using hashing to handle out of vocabulary tokens, and d) using a model parallel training architecture to scale across 8 GPUs. We present compelling offline results that demonstrate at least 4.7% improvement in [email protected] and 14.5% improvement in mean average precision (MAP) over baseline state-of-the-art semantic search methods using the same tokenization method. Moreover, we present results and discuss learnings from online A/B tests which demonstrate the efficacy of our method.
Smartphones have started to be used as self reporting tools for mental health state as they accompany individuals during their days and can therefore gather temporally fine grained data. However, the analysis of self reported mood data offers challenges related to non-homogeneity of mood assessment among individuals due to the complexity of the feeling and the reporting scales, as well as the noise and sparseness of the reports when collected in the wild. In this paper, we propose a new end-to-end ML model inspired by video frame prediction and machine translation, that forecasts future sequences of mood from previous self-reported moods collected in the real world using mobile devices. Contrary to traditional time series forecasting algorithms, our multi-task encoder-decoder recurrent neural network learns patterns from different users, allowing and improving the prediction for users with limited number of self-reports. Unlike traditional feature-based machine learning algorithms, the encoder-decoder architecture enables to forecast a sequence of future moods rather than one single step. Meanwhile, multi-task learning exploits some unique characteristics of the data (mood is bi-dimensional), achieving better results than when training single-task networks or other classifiers.
Our experiments using a real-world dataset of 33,000 user-weeks revealed that (i) 3 weeks of sparsely reported mood is the optimal number to accurately forecast mood, (ii) multi-task learning models both dimensions of mood "valence and arousal" with higher accuracy than separate or traditional ML models, and (iii) mood variability, personality traits and day of the week play a key role in the performance of our model. We believe this work provides psychologists and developers of future mobile mental health applications with a ready-to-use and effective tool for early diagnosis of mental health issues at scale.
Cold-start problems are long-standing challenges for practical recommendations. Most existing recommendation algorithms rely on extensive observed data and are brittle to recommendation scenarios with few interactions. This paper addresses such problems usingfew-shot learning andmeta learning. Our approach is based on the insight that having a good generalization from a few examples relies on both a generic model initialization and an effective strategy for adapting this model to newly arising tasks. To accomplish this, we combine the scenario-specific learning with a model-agnostic sequential meta-learning and unify them into an integrated end-to-end framework, namely S cenario-specific S equential Meta learner (or s^2Meta). By doing so, ourmeta-learner produces a generic initial model through aggregating contextual information from a variety of prediction tasks while effectively adapting to specific tasks by leveraging learning-to-learn knowledge. Extensive experiments on various real-world datasets demonstrate that our proposed model can achieve significant gains over the state-of-the-arts for cold-start problems in online recommendation. Deployment is at the Guess You Like session, the front page of the Mobile Taobao; and the illustration video can also be watched from the link\footnote\urlhttps://youtu.be/TNHLZqWnQwc .
Pattern discovery in geo-spatiotemporal data (such as traffic and weather data) is about finding patterns of collocation, co-occurrence, cascading, or cause and effect between geospatial entities. Using simplistic definitions of spatiotemporal neighborhood (a common characteristic of the existing general-purpose frameworks) is not semantically representative of geo-spatiotemporal data. We therefore introduce a new geo-spatiotemporal pattern discovery framework which defines a semantically correct definition of neighborhood; and then provides two capabilities, one to explore propagation patterns and the other to explore influential patterns. Propagation patterns reveal common cascading forms of geospatial entities in a region. Influential patterns demonstrate the impact of temporally long-term geospatial entities on their neighborhood. We apply this framework on a large dataset of traffic and weather data at countrywide scale, collected for the contiguous United States over two years. Our important findings include the identification of 90 common propagation patterns of traffic and weather entities (e.g., rain --> accident --> congestion), which results in identification of four categories of states within the US; and interesting influential patterns with respect to the "location", "duration", and "type" of long-term entities (e.g., a major construction --> more traffic incidents). These patterns and the categorization of the states provide useful insights on the driving habits and infrastructure characteristics of different regions in the US, and could be of significant value for applications such as urban planning and personalized insurance.
We develop and analyze empirical Bayes Stein-type estimators for use in the estimation of causal effects in large-scale online experiments. While online experiments are generally thought to be distinguished by their large sample size, we focus on the multiplicity of treatment groups. The typical analysis practice is to use simple differences-in-means (perhaps with covariate adjustment) as if all treatment arms were independent. In this work we develop consistent, small bias, shrinkage estimators for this setting. In addition to achieving lower mean squared error these estimators retain important frequentist properties such as coverage under most reasonable scenarios. Modern sequential methods of experimentation and optimization such as multi-armed bandit optimization (where treatment allocations adapt over time to prior responses) benefit from the use of our shrinkage estimators. Exploration under empirical Bayes focuses more efficiently on near-optimal arms, improving the resulting decisions made under uncertainty. We demonstrate these properties by examining seventeen routine experiments conducted on Facebook from April to June 2017.
Email is ubiquitous in the workplace. Naturally, machine learning models that make third-party email clients "smarter" can dramatically impact employees' productivity and efficiency. Motivated by this potential, we study the task of professional role inference from email data, which is crucial for email prioritization and contact recommendation systems. The central question we address is: Given limited data about employees, as is common in third-party email applications, can we infer where in the organizational hierarchy these employees belong based on their email behavior? Toward our goal, in this paper we study professional role inference on a unique new email dataset comprising billions of email exchanges across thousands of organizations. Taking a network approach in which nodes are employees and edges represent email communication, we propose EMBER, or EMBedding Email-based Roles, which finds email-centric embeddings of network nodes to be used in professional role inference tasks. EMBER automatically captures behavioral similarity between employees in the email network, leading to embeddings that naturally distinguish employees of different hierarchical roles. EMBER often outperforms the state-of-the-art by 2-20% in role inference accuracy and 2.5-344x in speed. We also use EMBER with our unique dataset to study how inferred professional roles compare between organizations of different sizes and sectors, gaining new insights into organizational hierarchy.
Product brands employ shopper marketing (SM) strategies to convert shoppers along the path to purchase. Traditional marketing mix models (MMMs), which leverage regression techniques and historical data, can be used to predict the component of sales lift due to SM tactics. The resulting predictive model is a critical input to plan future SM strategies. The implementation of traditional MMMs, however, requires significant ad-hoc manual intervention due to their limited flexibility in (i) explicitly capturing the temporal link between decisions; (ii) accounting for the interaction between business rules and past (sales and decision) data during the attribution of lift to SM; and (iii) ensuring that future decisions adhere to business rules. These issues necessitate MMMs with tailored structures for specific products and retailers, each requiring significant hand-engineering to achieve satisfactory performance -- a major implementation challenge. We propose an SM Optimization and Inverse Learning Engine (SMOILE) that combines optimization and inverse reinforcement learning to streamline implementation. SMOILE learns a model of lift by viewing SM tactic choice as a sequential process, leverages inverse reinforcement learning to explicitly couple sales and decision data, and employs an optimization approach to handle a wide-array of business rules. Using a unique dataset containing sales and SM spend information across retailers and products, we illustrate how SMOILE standardizes the use of data to prescribe future SM decisions. We also track an industry benchmark to showcase the importance of encoding SM lift and decision structures to mitigate spurious results when uncovering the impact of SM decisions.
The main mission of LinkedIn is to connect 610M+ members to the right opportunities. To find the right opportunities, LinkedIn needs to understand each member's skill set and their expertise levels accurately. However, estimating members' skill expertise is challenging due to lack of ground-truth. So far, the industry relied on either hand-created small scale data, or large scale social gestures containing a lot of social bias (e.g., endorsements).
In this paper, we develop the Social Skill Validation, a novel framework of collecting validations for members' skill expertise at the scale of billions of member-skill pairs. Unlike social gestures, we collect signals in an anonymous way to ensure objectiveness. We also develop a machine learning model to make smart suggestions to collect validations more efficiently.
With the social skill validation data, we discover the insights on how people evaluate other people in professional social networks. For example, we find that the members with higher seniority do not necessarily get positive evaluations compared to more junior members. We evaluate the value of social skill validation data on predicting who is hired for a job requiring a certain skill, and model using social skill validation outperforms the state-of-the art methods on skill expertise estimation by 10%. Our experiments show that the Social Skill Validation we built provides a novel way to estimate the members' skill expertise accurately at large scale and offers a benchmark to validate social theories on peer evaluation.
Real-valued data sequences are often affected by structured noise in addition to random noise. For example, in pressure transient analysis (PTA), semi-log derivatives of log-log diagnostic plots show such contamination of structured noise; especially under multiphase flow condition. In PTA data, structured noise refers to the response to some physical phenomena which is not originated at the reservoir, such as fluid segregation in wellbore or pressure leak due to a brief opening of a valve. Such noisy responses commonly appear to mix up with flow regimes, hindering further reservoir flow analysis. In this paper, we use the Singular Spectrum Analysis (SSA) to decompose PTA data into additive components; subsequently we use the eigenvalues associated with the decomposed components to identify the components that contain most of the structured noise information. We develop a semisupervised process that requires minimal expert supervision in tuning the solitary parameter of our algorithm using only one pressure buildup scenario. An empirical evaluation using real pressure data from oil and gas wells shows that our approach can detect a multitude of structured noise with 74.25% accuracy.
Sepsis is a condition caused by the body's overwhelming and life-threatening response to infection, which can lead to tissue damage, organ failure, and finally death. Today, sepsis is one of the leading causes of mortality among populations in intensive care units (ICUs). Sepsis is difficult to predict, diagnose, and treat, as it involves analyzing different sets of multivariate time-series, usually with problems of missing data, different sampling frequencies, and random noise. Here, we propose a new dynamic-behavior-based model, which we call a Temporal Probabilistic proFile (TPF), for classification and prediction tasks of multivariate time series. In the TPF method, the raw, time-stamped data are first abstracted into a series of higher-level, meaningful concepts, which hold over intervals characterizing time periods. We then discover frequently repeating temporal patterns within the data. Using the discovered patterns, we create a probabilistic distribution of the temporal patterns of the overall entity population, of each target class in it, and of each entity. We then exploit TPFs as meta-features to classify the time series of new entities, or to predict their outcome, by measuring their TPF distance, either to the aggregated TPF of each class, or to the individual TPFs of each of the entities, using negative cross entropy. Our experimental results on a large benchmark clinical data set show that TPFs improve sepsis prediction capabilities, and perform better than other machine learning approaches.
Learning-to-Rank deals with maximizing the utility of a list of examples presented to the user, with items of higher relevance being prioritized. It has several practical applications such as large-scale search, recommender systems, document summarization and question answering. While there is widespread support for classification and regression based learning, support for learning-to-rank in deep learning has been limited. We introduce TensorFlow Ranking, the first open source library for solving large-scale ranking problems in a deep learning framework. It is highly configurable and provides easy-to-use APIs to support different scoring mechanisms, loss functions and evaluation metrics in the learning-to-rank setting. Our library is developed on top of TensorFlow and can thus fully leverage the advantages of this platform. TensorFlow Ranking has been deployed in production systems within Google; it is highly scalable, both in training and in inference, and can be used to learn ranking models over massive amounts of user activity data, which can include heterogeneous dense and sparse features. We empirically demonstrate the effectiveness of our library in learning ranking functions for large-scale search and recommendation applications in Gmail and Google Drive. We also show that ranking models built using our model scale well for distributed training, without significant impact on metrics. The proposed library is available to the open source community, with the hope that it facilitates further academic research and industrial applications in the field of learning-to-rank.
Despite the progress within the last decades, weather forecasting is still a challenging and computationally expensive task. Current satellite-based approaches to predict thunderstorms are usually based on the analysis of the observed brightness temperatures in different spectral channels and emit a warning if a critical threshold is reached. Recent progress in data science however demonstrates that machine learning can be successfully applied to many research fields in science, especially in areas dealing with large datasets. We therefore present a new approach to the problem of predicting thunderstorms based on machine learning. The core idea of our work is to use the error of two-dimensional optical flow algorithms applied to images of meteorological satellites as a feature for machine learning models. We interpret that optical flow error as an indication of convection potentially leading to thunderstorms and lightning. To factor in spatial proximity we use various manual convolution steps. We also consider effects such as the time of day or the geographic location. We train different tree classifier models as well as a neural network to predict lightning within the next few hours (called nowcasting in meteorology) based on these features. In our evaluation section we compare the predictive power of the different models and the impact of different features on the classification result. Our results show a high accuracy of 96% for predictions over the next 15 minutes which slightly decreases with increasing forecast period but still remains above 83% for forecasts of up to five hours. The high false positive rate of nearly 6% however needs further investigation to allow for an operational use of our approach.
The Identification and Estimation of Direct and Indirect Effects in A/B Tests through Causal Mediation Analysis
E-commerce companies have a number of online products, such as organic search, sponsored search, and recommendation modules, to fulfill customer needs. Although each of these products provides a unique opportunity for users to interact with a portion of the overall inventory, they are all similar channels for users and compete for limited time and monetary budgets of users. To optimize users' overall experiences on an E-commerce platform, instead of understanding and improving different products separately, it is important to gain insights into the evidence that a change in one product would induce users to change their behaviors in others, which may be due to the fact that these products are functionally similar. In this paper, we introduce causal mediation analysis as a formal statistical tool to reveal the underlying causal mechanisms. Existing literature provides little guidance on cases where multiple unmeasured causally-dependent mediators exist, which are common in A/B tests. We seek a novel approach to identify in those scenarios direct and indirect effects of the treatment. In the end, we demonstrate the effectiveness of the proposed method in data from Etsy's real A/B tests and shed lights on complex relationships between different products.
Your name tells a lot about you: your gender, ethnicity and so on. It has been shown that name embeddings are more effective in representing names than traditional substring features. However, our previous name embedding model is trained on private email data and are not publicly accessible. In this paper, we explore learning name embeddings from public Twitter data. We argue that Twitter embeddings have two key advantages: (i) they can and will be publicly released to support research community. (ii) even with a smaller training corpus, Twitter embeddings achieve similar performances on multiple tasks comparing to email embeddings.
As a test case to show the power of name embeddings, we investigate the modeling of lifespans. We find it interesting that adding name embeddings can further improve the performances of models using demographic features, which are traditionally used for lifespan modeling. Through residual analysis, we observe that fine-grained groups (potentially reflecting socioeconomic status) are the latent contributing factors encoded in name embeddings. These were previously hidden to demographic models, and may help to enhance the predictive power of a wide class of research studies.
Large companies need to monitor various metrics (for example, Page Views and Revenue) of their applications and services in real time. At Microsoft, we develop a time-series anomaly detection service which helps customers to monitor the time-series continuously and alert for potential incidents on time. In this paper, we introduce the pipeline and algorithm of our anomaly detection service, which is designed to be accurate, efficient and general. The pipeline consists of three major modules, including data ingestion, experimentation platform and online compute. To tackle the problem of time-series anomaly detection, we propose a novel algorithm based on Spectral Residual (SR) and Convolutional Neural Network (CNN). Our work is the first attempt to borrow the SR model from visual saliency detection domain to time-series anomaly detection. Moreover, we innovatively combine SR and CNN together to improve the performance of SR model. Our approach achieves superior experimental results compared with state-of-the-art baselines on both public datasets and Microsoft production data.
Point-of-Interest (POI) recommender systems play a vital role in people's lives by recommending unexplored POIs to users and have drawn extensive attention from both academia and industry. Despite their value, however, they still suffer from the challenges of capturing complicated user preferences and fine-grained user-POI relationship for spatio-temporal sensitive POI recommendation. Existing recommendation algorithms, including both shallow and deep approaches, usually embed the visiting records of a user into a single latent vector to model user preferences: this has limited power of representation and interpretability. In this paper, we propose a novel topic-enhanced memory network (TEMN), a deep architecture to integrate the topic model and memory network capitalising on the strengths of both the global structure of latent patterns and local neighbourhood-based features in a nonlinear fashion. We further incorporate a geographical module to exploit user-specific spatial preference and POI-specific spatial influence to enhance recommendations. The proposed unified hybrid model is widely applicable to various POI recommendation scenarios. Extensive experiments on real-world WeChat datasets demonstrate its effectiveness (improvement ratio of 3.25% and 29.95% for context-aware and sequential recommendation, respectively). Also, qualitative analysis of the attention weights and topic modeling provides insight into the model's recommendation process and results.
An essential step in the customer care routine of cellular service carriers is determining whether an individual user is impacted by on-going service issues. This is traditionally done by monitoring the network and the services. However, user feedback data, generated when users call customer care agents with problems, is a complementary source of data for this purpose. User feedback data is particularly valuable as it provides the user perspective of the service issues. However, this data is extremely noisy, due to range of issues that users have and the diversity of the language used by care agents. In this paper, we present LOTUS, a system that identifies users impacted by a common root cause (such as a network outage) from user feedback. LOTUS is based on novel algorithmic framework that tightly couples co-training and spatial scan statistics. To model the text in the user feedback, LOTUS also incorporates custom-built language models using deep sequence learning. Through experimental analysis on synthetic and live data, we demonstrate the accuracy of LOTUS. LOTUS has been deployed for several months, and has identified the impact over 200 events.
Quality product descriptions are critical for providing competitive customer experience in an E-commerce platform. An accurate and attractive description not only helps customers make an informed decision but also improves the likelihood of purchase. However, crafting a successful product description is tedious and highly time-consuming. Due to its importance, automating the product description generation has attracted considerable interest from both research and industrial communities. Existing methods mainly use templates or statistical methods, and their performance could be rather limited. In this paper, we explore a new way to generate personalized product descriptions by combining the power of neural networks and knowledge base. Specifically, we propose a KnOwledge Based pErsonalized (or KOBE) product description generation model in the context of E-commerce.
In KOBE, we extend the encoder-decoder framework, the Transformer, to a sequence modeling formulation using self-attention. In order to make the description both informative and personalized, KOBE considers a variety of important factors during text generation, including product aspects, user categories, and knowledge base. Experiments on real-world datasets demonstrate that the proposed method outperforms the baseline on various metrics. KOBE can achieve an improvement of 9.7% over state-of-the-arts in terms of BLEU. We also present several case studies as the anecdotal evidence to further prove the effectiveness of the proposed approach. The framework has been deployed in Taobao, the largest online E-commerce platform in China.
Our research tackles the challenge of milk production resource use efficiency in dairy farms with machine learning methods. Reproduction is a key factor for dairy farm performance since cows milk production begin with the birth of a calf. Therefore, detecting estrus, the only period when the cow is susceptible to pregnancy, is crucial for farm efficiency. Our goal is to enhance estrus detection (performance, interpretability), especially on the currently undetected silent estrus (35% of total estrus), and allow farmers to rely on automatic estrus detection solutions based on affordable data (activity, temperature). In this paper, we first propose a novel approach with real-world data analysis to address both behavioral and silent estrus detection through machine learning methods. Second, we present LCE, a local cascade based algorithm that significantly outperforms a typical commercial solution for estrus detection, driven by its ability to detect silent estrus. Then, our study reveals the pivotal role of activity sensors deployment in estrus detection. Finally, we propose an approach relying on global and local (behavioral versus silent) algorithm interpretability (SHAP) to reduce the mistrust in estrus detection solutions.
Trajectory data has been widely used in many urban applications. Sharing trajectory data with effective supervision is a vital task, as it contains private information of moving objects. However, malicious data users can modify trajectories in various ways to avoid data distribution tracking by the hashing-based data signatures, e.g., MD5. Moreover, the existing trajectory data protection scheme can only protect trajectories from either spatial or temporal modifications. Finally, so far there is no authoritative third party for trajectory data sharing process, as trajectory data is too sensitive. To this end, we propose a novel trajectory copyright protection scheme, which can protect trajectory data from comprehensive types of data modifications/attacks. Three main techniques are employed to effectively guarantee the robustness and comprehensiveness of the proposed data sharing scheme: 1) the identity information is embedded distributively across a set of sub-trajectories partitioned based on the spatio-temporal regions; 2) the centroid distance of the sub-trajectories is served as a stable trajectory attribute to embed the information; and 3) the blockchain technique is used as a trusted third party to log all data transaction history for data distribution tracking in a decentralized manner. Extensive experiments were conducted based on two real-world trajectory datasets to demonstrate the effectiveness of our proposed scheme.
This paper considers the automation of a typical complex advertisement scheduling system in broadcast television (TV) networks. Compared to traditional TV advertisement scheduling, we consider the case where not all requests for advertising slots are known at the same time, and time-consuming negotiations related to balancing the TV network's and advertisers' priorities have to be minimized. Although there are existing works that automatically provide schedules using mathematical optimization, the applicability of these techniques to our problem is limited due to the cumbersome formulations necessary for handling vague conditions and aesthetic domain-specific rules necessary for advertisers' satisfaction.
To automate the system, we propose a data-driven approach that uses intention learning on top of mathematical optimization and clustering to imitate the decision-making process of scheduling experts. The scheduling of TV ads is automated via mathematical programming, and the expert objectives and constraints are learned from historical demonstrations using inverse optimization. The clustering of TV ads and the learning of associated intentions are used to deal with the cold start problem related to ad requests from new companies or products. Our proposed system is validated on actual dataset from a Japanese TV network. Experiments show that our system can more closely reproduce the experts' schedules compared to standard optimization approaches, demonstrating the potential of our work in reducing personnel costs and improving advertisers' experience. Based on its promising results, our proposed system is being prepared for commercial deployment.
Two-Sided Fairness for Repeated Matchings in Two-Sided Markets: A Case Study of a Ride-Hailing Platform
Ride hailing platforms, such as Uber, Lyft, Ola or DiDi, have traditionally focused on the satisfaction of the passengers, or on boosting successful business transactions. However, recent studies provide a multitude of reasons to worry about the drivers in the ride hailing ecosystem. The concerns range from bad working conditions and worker manipulation to discrimination against minorities. With the sharing economy ecosystem growing, more and more drivers financially depend on online platforms and their algorithms to secure a living. It is pertinent to ask what a fair distribution of income on such platforms is and what power and means the platform has in shaping these distributions.
In this paper, we analyze job assignments of a major taxi company and observe that there is significant inequality in the driver income distribution. We propose a novel framework to think about fairness in the matching mechanisms of ride hailing platforms. Specifically, our notion of fairness relies on the idea that, spread over time, all drivers should receive benefits proportional to the amount of time they are active in the platform. We postulate that by not requiring every match to be fair, but rather distributing fairness over time, we can achieve better overall benefit for the drivers and the passengers. We experiment with various optimization problems and heuristics to explore the means of achieving two-sided fairness, and investigate their caveats and side-effects. Overall, our work takes the first step towards rethinking fairness in ride hailing platforms with an additional emphasis on the well-being of drivers.
Recent years witness the merge of social networks and user-generated content (UGC) platforms. In these new platforms, users establish links to others not only driven by their social relationships in the physical world but also driven by the contents published by others. During this merging process, social networks gradually integrate both social and content links and become unprecedentedly complicated, with the motivation to exploit both the advantages of social viscosity and content attractiveness to reach the best customer retention situation. However, due to the lack of fine-grained data recording such merging phenomena, the co-driven mechanism of social and content links in churn remains unexplored. How do social and content factors jointly influence customers' churn? What is the best ratio of social and content links for retention? Is there a model to capture this co-driven mechanism in churn phenomena? In this paper, we collect a real-world dataset with more than 5.77 million users and 1.15 billion links, with each link being tagged as a social one or a content one. We find that both social and content links have a significant impact on users' churn and they work jointly as a complicated mixture effect. As a result, we propose a novel survival model, which incorporates both social and content factors, to predict churn probability over time. Our model successfully fits the churn distribution in reality and accurately predicts the churn rate of different subpopulations in the future. By analyzing the modeling parameters, we try to strike a balance between social-driven and content-driven links in a user's social network to reach the lowest churn rate. Our model and findings may have potential implications for the design of future social media.
Paths of online users towards a purchase event (conversion) can be very complex, and guiding them through their journey is an integral part of online advertising. Studies in marketing indicate that a conversion event is typically preceded by one or more purchase funnel stages, viz., unaware, aware, interest, consideration, and intent. Intuitively, some online activities, including web searches, site visits and ad interactions, can serve as markers for the user's funnel stage. Identifying such markers can potentially refine conversion prediction, guide the design of ad creatives (text and images), and lead to higher ad effectiveness. We explore this hypothesis through a set of experiments designed for two tasks: (i) conversion prediction given a user's activity trail, and (ii) funnel stage specific targeting and creatives. To address challenges in the two tasks, we propose an attention based recurrent neural network (RNN) which ingests a user activity trail, and predicts the user's conversion probability along with attention weights for each activity (analogous to its position in the funnel). Specifically, we propose novel attention mechanisms, which maintain a global weight for each activity across all user trails, and also indicate the activity's funnel stage. Use of the proposed attention mechanisms for the first task of conversion prediction shows significant AUC lifts of 0.9% on a public dataset (RecSys 2015 challenge), and up to 3.6% on three proprietary datasets from a major advertising platform (Yahoo Gemini). To address the second task, the activity weights from the proposed mechanisms are used to automatically assign users to funnel stages via a scalable scoring method. Offline evaluation shows that such activity weights are more aligned with editorially tagged activity-funnel stages compared to weights from existing attention mechanisms and simpler conversion models like logistic regression. In addition, results of online ad campaigns in Yahoo Gemini with funnel specific user targeting and ad creatives show strong performance lifts further validating the connection across online activities, purchase funnel stages, stage-specific custom creatives, and conversions.
Aesthetic style is the crux of many purchasing decisions. When considering an item for purchase, buyers need to be aligned not only with the functional aspects (e.g. description, category, ratings) of an item's specification, but also its stylistic and aesthetic aspects (e.g. modern, classical, retro) as well. Style becomes increasingly important on e-commerce sites like Etsy, an online marketplace for handmade and vintage goods, where hundreds of thousands of items can differ by style and aesthetic alone. As such, it is important for industry recommender systems to properly model style when understanding shoppers' buying preference. In past work, because of its abstract nature, style is often approached in an unsupervised manner, represented by nameless latent factors or embeddings. As a result, there has been no previous work on predictive models nor analysis devoted to understanding how style, or even the presence of style, impacts a buyer's purchase decision. In this paper, we discuss a novel process by which we leverage 43 named styles given by merchandising experts in order to bootstrap large-scale style prediction and analysis of how style impacts purchase decision. We train a supervised, style-aware deep neural network that is shown to predict item style with high accuracy, while generating style-aware embeddings that can be used in downstream recommendation tasks. We share in our analysis, based on over a year's worth of transaction data and show that these findings are crucial to understanding how to more explicitly leverage style signal in industry-scale recommender systems.
As patients' access to their doctors' clinical notes becomes common, translating professional, clinical jargon to layperson-understandable language is essential to improve patient-clinician communication. Such translation yields better clinical outcomes by enhancing patients' understanding of their own health conditions, and thus improving patients' involvement in their own care. Existing research has used dictionary-based word replacement or definition insertion to approach the need. However, these methods are limited by expert curation, which is hard to scale and has trouble generalizing to unseen datasets that do not share an overlapping vocabulary. In contrast, we approach the clinical word and sentence translation problem in a completely unsupervised manner. We show that a framework using representation learning, bilingual dictionary induction and statistical machine translation yields the best precision at 10 of 0.827 on professional-to-consumer word translation, and mean opinion scores of 4.10 and 4.28 out of 5 for clinical correctness and layperson readability, respectively, on sentence translation. Our fully-unsupervised strategy overcomes the curation problem, and the clinically meaningful evaluation reduces biases from inappropriate evaluators, which are critical in clinical machine learning.
Urban flow monitoring systems play important roles in smart city efforts around the world. However, the ubiquitous deployment of monitoring devices, such as CCTVs, induces a long-lasting and enormous cost for maintenance and operation. This suggests the need for a technology that can reduce the number of deployed devices, while preventing the degeneration of data accuracy and granularity. In this paper, we aim to infer the real-time and fine-grained crowd flows throughout a city based on coarse-grained observations. This task is challenging due to the two essential reasons: the spatial correlations between coarse- and fine-grained urban flows, and the complexities of external impacts. To tackle these issues, we develop a method entitled UrbanFM based on deep neural networks. Our model consists of two major parts: 1) an inference network to generate fine-grained flow distributions from coarse-grained inputs by using a feature extraction module and a novel distributional upsampling module; 2) a general fusion subnet to further boost the performance by considering the influences of different external factors. Extensive experiments on two real-world datasets validate the effectiveness and efficiency of our method, demonstrating its state-of-the-art performance on this problem.
When a new cyber-vulnerability is detected, a Common Vulnerability and Exposure (CVE) number is attached to it. Malicious "exploits'' may use these vulnerabilities to carry out attacks. Unlike works which study if a CVE will be used in an exploit, we study the problem of predicting when an exploit is first seen. This is an important question for system administrators as they need to devote scarce resources to take corrective action when a new vulnerability emerges. Moreover, past works assume that CVSS scores (released by NIST) are available for predictions, but we show on average that 49% of real world exploits occur before CVSS scores are published. This means that past works, which use CVSS scores, miss almost half of the exploits. In this paper, we propose a novel framework to predict when a vulnerability will be exploited via Twitter discussion, without using CVSS score information. We introduce the unique concept of a family of CVE-Author-Tweet (CAT) graphs and build a novel set of features based on such graphs. We define recurrence relations capturing "hotness" of tweets, "expertise" of Twitter users on CVEs, and "availability" of information about CVEs, and prove that we can solve these recurrences via a fix point algorithm. Our second innovation adopts Hawkes processes to estimate the number of tweets/retweets related to the CVEs. Using the above two sets of novel features, we propose two ensemble forecast models FEEU (for classification) and FRET (for regression) to predict when a CVE will be exploited. Compared with natural adaptations of past works (which predict if an exploit will be used), FEEU increases F1 score by 25.1%, while FRET decreases MAE by 37.2%.
The Amazon video homepage is the primary gateway for customers looking to explore the large collection of content, and finding something interesting to watch. Typically, the page is personalized for a customer, and consists of a series of widgets or carousels, with each widget containing multiple items (e.g., movies, TV shows etc). Ranking the widgets needs to maximize relevance, and maintain diversity, while simultaneously satisfying business constraints. We present the first unified framework for dealing with relevance, diversity, and business constraints simultaneously. Towards this end, we derive a novel primal-dual algorithm which incorporates local diversity constraints as well as global business constraints for whole page optimization. Through extensive offline experiments and an online A/B test, we show that our proposed method achieves significantly higher user engagement compared to existing methods, while also simultaneously satisfying business constraints. For instance, in an online A/B test, our framework improved key metrics such as customer streaming minutes by 0.77% and customer distinct streaming days by 0.32% over a state-of-the-art submodular diversity model.
SESSION: Applied Data Science Invited Talks
Machine learning (ML) has had a tremendous impact in across the world over the last decade. As we think about ML solving complex tasks, sometimes at super-human levels, it is easy to forget that there is no machine learning without humans in the loop. Humans define tasks and metrics, develop and program algorithms, collect and label data, debug and optimize systems, and are (usually) ultimately the users of the ML-based applications we are developing.
In this talk, we will cover 4 human-centered perspectives in the ML development process, along with methods and systems, to empower humans to maximize the ultimate impact of their ML-based applications.
Data science in modern applications is pushing the limits of tools and organizations. The scale of data, the breadth of required skill sets, and the complexity of workflows all cause organizations to stumble when developing data-powered applications and moving them to production. This talk will discuss these challenges and Databricks' efforts to overcome them within open source software projects like Apache Spark and MLflow.
Apache Spark has simplified large-scale ETL and analytics, and its Project Hydrogen helps to bridge the gap between Spark and ML tools such as TensorFlow and Horovod. MLflow, an open source platform for managing ML lifecycles, facilitates experimentation, reproducibility and deployment. We will present insights from our collaborations on these projects, as well as our perspective at Databricks in facilitating data science for a wide variety of organizations and applications.
Small businesses are the lifeblood of the U.S. economy, representing an astounding 99.9 percent of all businesses, creating two-thirds of net new jobs, and accounting for 44 percent of economic activity. Yet, 50 percent of small businesses go out of business in the first 5 years.
What's behind this dismal statistic? Among the top contributing factors is cash flow management. Owners who cannot efficiently manage the inflow and outflow of cash are almost certain to fail. And, those who can are more likely to break through the statistical 5-year barrier to build thriving businesses.
In this talk, we'll describe novel applications of artificial intelligence and large-scale machine learning aimed at addressing the problem of forecasting cash flow for small businesses. These are sparse, high-dimensional correlated time series. We'll present new results on forecasting this type of time series, using scalable Gaussian Processes with kernels formed through the use of deep learning. These methods yield highly accurate predictions but also include a principled approach for generating confidence intervals.
An increasing number of machine learning tasks require dealing with large graph datasets, which capture rich and complex relation- ship among potentially billions of elements. Graph Neural Network (GNN) becomes an effective way to address the graph learning problem by converting the graph data into a low dimensional space while keeping both the structural and property information to the maximum extent and constructing a neural network for training and referencing. However, it is challenging to provide an efficient graph storage and computation capabilities to facilitate GNN training and enable development of new GNN algorithms. In this paper, we present a comprehensive graph neural network system, namely AliGraph, which consists of distributed graph storage, optimized sampling operators and runtime to efficiently support not only existing popular GNNs but also a series of in-house developed ones for different scenarios. The system is currently deployed at Alibaba to support a variety of business scenarios, including product recommendation and personalized search at Alibaba's E-Commerce platform. By conducting extensive experiments on a real-world dataset with 492.90 million vertices, 6.82 billion edges and rich attributes, Ali- Graph performs an order of magnitude faster in terms of graph building (5 minutes vs hours reported from the state-of-the-art PowerGraph platform). At training, AliGraph runs 40%-50% faster with the novel caching strategy and demonstrates around 12 times speed up with the improved runtime. In addition, our in-house developed GNN models all showcase their statistically significant superiorities in terms of both effectiveness and efficiency (e.g., 4.12% 17.19% lift by F1 scores).
Data scientists today spend significant portions of their time in finding and preparing data rather than analyzing, building and deploying models. The challenges and opportunities will be explored by using a design thinking approach, to increase Analytics / ML throughput at scale.
The advent of advanced modeling for general machine learning, and in particular computer vision, speech recognition and natural language processing, the applications of AI is enabling classical businesses to reinvent themselves, and new business fields to arise which were even not imaginable a few years back. Hassan will present some of these use cases, and dive into some in more detail, showing where current and future AI/ML technology is accelerating innovation.
Lyft's mission is to improve people's lives with the world's best transportation. Self driving vehicles have the potential to deliver unprecedented improvements to safety and quality, at a price and convenience that challenges traditional models of vehicle ownership. A combination of hardware, software, and knowledge technologies are needed to build self-driving cars. In this talk, I'll present the core problems in self-driving and how recent advances in computer vision, robotics, and machine learning are powering this revolution. The car is carefully designed with a variety of sensors that complement each other to address a wide variety of driving scenarios. Sensor fusion bring all of these signals together into an interpretable AI engine comprising of perception, prediction, planning, and controls. For example, deep learning models and large scale machine learning have closed the gap between human and machine perception. In contrast, predicting the behavior of other humans and effectively planning and negotiating maneuvers continue to be hard problems. Combining AI technologies with deep knowledge about the real world is key to addressing these.
I plan to talk about a few big challenges we have at LinkedIn in the space of Data Science. The ones coming to my mind are (1) Measuring long-term impact; (2) Learning while preserving privacy; (3) Fairness. I can also touch upon productivity and efficiency -- which is a very practical challenge I'm sure all DS organizations face.
The latest generation of geostationary satellites carry sensors such as the Advanced Baseline Imager (GOES-16/17) and the Advanced Himawari Imager (Himawari-8/9) that closely mimic the spatial and spectral characteristics of widely used polar orbiting sensors such as EOS/MODIS. More importantly, they provide observations at 1-5-15 minute intervals, instead of twice a day from MODIS, offering unprecedented opportunities for monitoring large parts of the Earth. In addition to serving the needs of weather forecasting, these observations offer new and exciting opportunities in managing solar power, fighting wildfires, and tracking air pollution. Creation of actionable information in near realtime from these data streams is a challenge that is best addressed through collaborative efforts among the industry, academia and government agencies.
There are two aspects of data that make them big: sample size and dimensionality. The advantages of large sample size have long been touted. In contrast, high dimensionality has typically been seen as an obstacle to successful analysis. In this talk, using the area of genomics as an example, I will illustrate some of the advantages of high dimensionality.
After a natural disaster or other crisis, humanitarian organizations need to know where affected people are located and what resources they need. While this information is difficult to capture quickly through conventional methods, aggregate usage patterns of social media apps like Facebook can help fill these information gaps. This talk will describe the data and methodology that power Facebook Disaster Maps. These maps utilize information about Facebook usage in areas impacted by natural hazards, producing insights into how the population is affected by and responding to the hazard. In addition to methodology details, including efforts taken to ensure the security and privacy of Facebook users, I'll also discuss how we worked with humanitarian partners to develop the maps, which are actively used in disaster response today. I'll give examples of insights generated from the maps and I'll also discuss some limitations of the current methodologies, challenges, and opportunities for improvement.
Friends Don't Let Friends Deploy Black-Box Models: The Importance of Intelligibility in Machine Learning
Every data set is flawed, often in ways that are unanticipated and difficult to detect. If you can't understand what your model has learned, then you almost certainly are shipping models that are less accurate than they could be and which might even be risky. Historically there has been a tradeoff between accuracy and intelligibility: accurate models such as neural nets, boosted tress and random forests are not very intelligible, and intelligible models such as logistic regression and small trees or decision lists usually are less accurate. In mission-critical domains such as healthcare, where being able to understand, validate, edit and ultimately trust a model is important, one often had to choose less accurate models. But this is changing. We have developed a learning method based on generalized additive models with pairwise interactions (GA2Ms) that is as accurate as full complexity models yet even more interpretable than logistic regression. In this talk I'll highlight the kinds of problems that are lurking in all of our datasets, and how these interpretable, high-performance GAMs are making what was previously hidden, visible. I'll also show how we're using these models to uncover bias in models where fairness and transparency are important. (Code for the models has recently been released open-source.)
The last decade has seen three great phenomena in computing - the rebirth of AI algorithms and AI hardware; the evolution of cloud computing and distributed software development; and the explosive growth of open source software that has led to the availability of code as data, and its associated metadata, at scale. In this talk, we will describe how we take advantage of innovations in these dimensions to improve developer productivity and infuse AI and automation into software processes. We will discuss examples of how we built intelligent software by creating AI algorithms driven by deep understanding of code as data. In addition, we will talk about how data can also be treated as code for the next-generation AI-infused software development.
In this talk, I will first discuss deep learning models that can find semantically meaningful representations of words, learn to read documents and answer questions about their content. I will show how we can encode external linguistic knowledge as an explicit memory in recurrent neural networks, and use it to model co-reference relations in text. I will further introduce methods that can augment neural representation of text with structured data from Knowledge Bases for question answering, and show how we can use structured prior knowledge from Knowledge Graphs for image classification. Finally, I will introduce the notion of structured memory as being a crucial part of an intelligent agent's ability to plan and reason in partially observable environments. I will present a modular hierarchical reinforcement learning agent that can learn to store arbitrary information about the environment over long time lags, perform efficient exploration and long-term planning, while generalizing across domains and tasks.
Official figures in Africa indicate that 1,349 rhinos were killed in 2015. This marked the most critical moment of the current rhino poaching crisis that began in 2008. This trend has since reversed to the minimum of 1,124 poached rhinos achieved in 2017. Although this brings hope, this emergency is still a formidable challenge that needs to be tackled from a multi-angle approach including anti-poaching collaboration among countries to enforce effective wildlife crime laws, target campaigns in the illegal horn rhino end-user countries like China and Vietnam, and the adoption of cutting-edge technology by governmental agencies and conservationist NGOs.
In this talk, we will highlight the recent efforts of Peace Parks Foundation (PPF), the advocate for the creation of transfrontier conservation areas in South Africa, and Microsoft to address this crisis. We will explain how, through the joint use of deep learning and Cloud, PPF and Microsoft developed a fast and accurate potential poacher detection solution that allows PPF to allocate the park resources in a smarter and more efficient manner.
Artificial Intelligence (AI) is behind practically every product experience at LinkedIn. From ranking the member's feed to recommending new jobs, AI is used to fulfill our mission to connect the world's professionals to make them more productive and successful. While product functionality can be decomposed into separate components, they are deeply interconnected; thus, creating interesting questions and challenging AI problems that need to be solved in a sound and practical manner. In this talk, I will provide an overview of lessons learned and approaches we have developed to address these problems, including scaling to large problem sizes, handling multiple conflicting objective functions, efficient model tuning, and our progress toward using AI to optimize the LinkedIn product ecosystem more holistically.
A unified graph engine has been playing increasingly critical roles in many applications, especially for those requiring cross-domain analysis and near real-time decision-making on massive data, aiming to offer integrated and efficient end-to-end capabilities in concurrent graph data query, interactive graph analysis, and large scale graph-based (deep) learning. However, there is barely a unified graph system for enterprise use to the best of our knowledge. Simply assembling some frameworks/libraries together can result in significant performance degradation, due to the sparsity of graph data and irregular data access patterns, which adversely impacts its adoption in industry. In this talk, we will exemplify the challenges using our three efforts on making metropolitan districts smart within an integrated engine, which consist of managing a complex synergy of heterogeneous urban data via property graphs, modeling traffic flow patterns from the stored data through heterogenous information network analysis, and predicting traffics interactively using emerging graph neural networks. Although smart city is projected to become one of the most promising scenarios in the AI era, the unified graph engine can address many other domains such knowledge graph analysis and multi-modality medical research. We will address in the talk the progress towards the above three scientific directions and also point out relevant future research opportunities from the industrial perspective.
In this talk I will share my experience at the start of "Data Science" in 2012, joining one of the fastest growing early stage unicorns and then building data science over 7 years. What we think of when we think of data science has evolved rapidly during this time. At Airbnb we now use data across product and operations, and for decisions by humans and machines. I will share my predictions for the future and where I believe data science can make the biggest impact in the years to come.
Advances in supervised machine learning have frequently been fueled by access to large-scale labeled data. In business, however, natural labels may not exist. In these cases, a common industry playbook involves using manual, human annotation to label enough data points in order to train a model. This paper describes a general structure for accelerating the annotation process using artificial intelligence and combining it with model quality assurance (QA).
In this talk we walk through this process in detail. We start with rich manual annotation of a small number of unlabeled data points. These can then be used to train a series of coarse predictive models that are used to prepopulate some default selections in the annotation tool and speed up annotator performance. With more data points, models can be retrained on a regular cadence and less human intervention is required. Finally, models can provide defaults for all fields, and re-training continues until the annotator override rate reaches a production-grade level.
Tradeoffs of this type of approach include balancing the in- creased annotation efficiency with engineering costs associated with building annotation and quality assurance tools. We will walk through these tradeoffs, which depend on the problem class and complexity of the model.
Finally, we will include a detailed industry case study based on the use of artificial intelligence in the annotation process at KeepTruckin, where we use annotation to label vehicle location history data.
The discipline of Software Engineering has evolved over the past 5+ decades to good levels of maturity. This maturity is in fact both a blessing and a necessity, since the modern world largely depends on it.
At the same time, the popularity of Machine Learning (ML) has been steadily increasing over the past 2+ decades, and over the last decade ML is being increasingly used for both experimentation and production workloads. It is no longer uncommon for ML to power widely used applications and products that are integral parts of our life. Much like what was the case for Software Engineering, the proliferation of use of ML technology necessitates the evolution of the ML discipline from "Coding" to "Engineering".
Gus Katsiapis offers a view from the trenches of using and building end-to-end ML platforms, and shares collective knowledge and experience, gothered over more than a decade of applied ML at Google. We hope this helps pave the way towards a world of ML Engineering.
Kevin Haas offers an overview of TensorFlow Extended (TFX), the end-to-end machine learning platform for TensorFlow that powers products across all of Alphabet (and beyond). TFX helps effectively manage the end-to-end training and production workflow including model management, versioning, and serving, thereby helping one realize aspects of ML Engineering.
Didi Chuxing is the world's leading mobile transportation platform that offers a full range of app-based transportation options for 550 million users. Every day, DiDi's platform receives over 100TB new data, processes more than 40 billion routing requests, and acquires over 15 billion location points. Machine learning has been used in numerous components of DiDi's platform to improve travel safety, experience and efficiency. This talk systematically presents the challenges and opportunities in the core area of modern transportation systems, and highlights some of our recent works on order dispatching and fleet management via deep reinforcement learning.
Recent years have witnessed the rise of many successful e-commerce marketplace platforms like AirBnB, Uber/Lyft, and Upwork, where a central platform mediates economic transactions between buyers and sellers. Some common features that distinguish such marketplaces from more traditional marketplaces are search and discovery of the service providers which could result in asymmetric matching of services; sharing of a service by multiple users such as ride-sharing; and handling different preferences such as patience, desired level of service expressed by participating agents. In this talk, I will summarize our work on different welfare maximizing strategies arising out of the aforementioned scenarios.
Information diffusion and social influence are more and more present in today's Web ecosystem. Having algorithms that optimize the presence and message diffusion on social media is indeed crucial to all actors (media companies, political parties, corporations, etc.) who advertise on the Web. Motivated by the need for effective viral marketing strategies, influence estimation and influence maximization have therefore become important research problems, leading to a plethora of methods. However, the majority of these methods are non-adaptive, and therefore not appropriate for scenarios in which influence campaigns may be ran and observed over multiple rounds, nor for scenarios which cannot assume full knowledge over the diffusion networks and the ways information spreads in them.
In this tutorial we intend to present the recent research on adaptive influence maximization,which aims to address these limitations. This can be seen as a particular case of the influence maximization problem (seeds in a social graph are selected to maximize information spread), one in which the decisions are taken as the influence campaign unfolds, over multiple rounds, and where knowledge about the graph topology and the influence process may be partial or even entirely missing. This setting, depending on the underlying assumptions, leads to variate and original approaches and algorithmic techniques, as we have witnessed in recent literature. We will review the most relevant research in this area, by organizing it along several key dimensions, and by discussing the methods' advantages and shortcomings, along with open research questions and the practical aspects of their implementation. Tutorial slides will become publicly available on https://sites.google.com/view/aim-tutorial/home.
Classification is an important problem for data mining and knowledge discovery and comes with a wide range of applications. Different applications usually evaluate the classification performance with different criteria. The variety of criteria calls for cost-sensitive classification algorithms, which take the specific criterion as input to the learning algorithm and adapt to different criteria more easily. While the cost-sensitive binary classification problem has been relatively well-studied, the cost-sensitive multiclass and multilabel classification problems are harder to solve because of the sophisticated nature of their evaluation criteria. The tutorial aims to review current techniques for solving cost-sensitive multiclass and multilabel classification problems, with the hope of helping more real-world applications enjoy the benefits of cost-sensitive classification.
A/B Testing is the gold standard to estimate the causal relationship between a change in a product and its impact on key outcome measures. It is widely used in the industry to test changes ranging from simple copy change or UI change to more complex changes like using machine learning models to personalize user experience. The key aspect of A/B testing is evaluation of experiment results. Designing the right set of metrics - correct outcome measures, data quality indicators, guardrails that prevent harm to business, and a comprehensive set of supporting metrics to understand the "why" behind the key movements is the #1 challenge practitioners face when trying to scale their experimentation program [18, 22]. On the technical side, improving sensitivity of experiment metrics is a hard problem and an active research area, with large practical implications as more and more small and medium size businesses are trying to adopt A/B testing and suffer from insufficient power. In this tutorial we will discuss challenges, best practices, and pitfalls in evaluating experiment results, focusing on both lessons learned and practical guidelines as well as open research questions.
Real-world data exists largely in the form of unstructured texts. A grand challenge on data mining research is to develop effective and scalable methods that may transform unstructured text into structured knowledge. Based on our vision, it is highly beneficial to transform such text into structured heterogeneous information networks, on which actionable knowledge can be generated based on the user's need. In this tutorial, we provide a comprehensive overview on recent research and development in this direction. First, we introduce a series of effective methods that construct heterogeneous information networks from massive, domain-specific text corpora. Then we discuss methods that mine such text-rich networks based on the user's need. Specifically, we focus on scalable, effective, weakly supervised, language-agnostic methods that work on various kinds of text. We further demonstrate, on real datasets (including news articles, scientific publications, and product reviews), how information networks can be constructed and how they can assist further exploratory analysis.
As data volume and variety have increased, so have the ties between machine learning and data integration become stronger. For machine learning to be effective, one must utilize data from the greatest possible variety of sources; and this is why data integration plays a key role. At the same time machine learning is driving automation in data integration, resulting in overall reduction of integration costs and improved accuracy. This tutorial focuses on three aspects of the synergistic relationship between data integration and machine learning: (1) we survey how state-of-the-art data integration solutions rely on machine learning-based approaches for accurate results and effective human-in-the-loop pipelines, (2) we review how end-to-end machine learning applications rely on data integration to identify accurate, clean, and relevant data for their analytics exercises, and (3) we discuss open research challenges and opportunities that span across data integration and machine learning.
In silico modeling of medicine refers to the direct use of computational methods in support of drug discovery and development. Machine learning and data mining methods have become an integral part of in silico modeling and demonstrated promising performance at various phases of the drug discovery and development process. In this tutorial we will introduce data analytic methods in drug discovery and development. For the first half, we will provide an overview about related data and analytic tasks, and then present the enabling data analytic methods for these tasks. For the second half, we will describe concrete applications of each of those tasks. The tutorial will be concluded with open problems and a Q&A session.
This tutorial addresses the advances in deep Bayesian mining and learning for natural language with ubiquitous applications ranging from speech recognition to document summarization, text classification, text segmentation, information extraction, image caption generation, sentence generation, dialogue control, sentiment classification, recommendation system, question answering and machine translation, to name a few. Traditionally, "deep learning" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model. The "semantic structure" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs. The "distribution function" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated. This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process, Chinese restaurant process, hierarchical Pitman-Yor process, Indian buffet process, recurrent neural network, long short-term memory, sequence-to-sequence model, variational auto-encoder, generative adversarial network, attention mechanism, memory-augmented neural network, skip neural network, stochastic neural network, predictive state neural network, policy neural network. We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in natural language. The variational inference and sampling method are formulated to tackle the optimization for complicated models. The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints. A series of case studies are presented to tackle different issues in deep Bayesian mining, learning and understanding. At last, we will point out a number of directions and outlooks for future studies.
Search and recommender systems share many fundamental components including language understanding, retrieval and ranking, and language generation. Building powerful search and recommender systems requires processing natural language effectively and efficiently. Recent rapid growth of deep learning technologies has presented both opportunities and challenges in this area. This tutorial offers an overview of deep learning based natural language processing (NLP) for search and recommender systems from an industry perspective. It first introduces deep learning based NLP technologies, including language understanding and language generation. Then it details how those technologies can be applied to common tasks in search and recommender systems, including query and document understanding, retrieval and ranking, and language generation. Applications in LinkedIn production systems are presented. The tutorial concludes with discussion of future trend.
This tutorial aims to provide the audience with a guided introduction to deep reinforcement learning (DRL) with specially curated application case studies in transportation. The tutorial covers both theory and practice, with more emphasis on the practical aspects of DRL that are pertinent to tackle transportation challenges. Some core examples include online ride order dispatching, fleet management, traffic signals control, route planning, and autonomous driving.
Artificial Intelligence is increasingly playing an integral role in determining our day-to-day experiences. Moreover, with proliferation of AI based solutions in areas such as hiring, lending, criminal justice, healthcare, and education, the resulting personal and professional implications of AI are far-reaching. The dominant role played by AI models in these domains has led to a growing concern regarding potential bias in these models, and a demand for model transparency and interpretability. In addition, model explainability is a prerequisite for building trust and adoption of AI systems in high stakes domains requiring reliability and safety such as healthcare and automated transportation, and critical industrial applications with significant economic implications such as predictive maintenance, exploration of natural resources, and climate change modeling.
As a consequence, AI researchers and practitioners have focused their attention on explainable AI to help them better trust and understand models at scale. The challenges for the research community include (i) defining model explainability, (ii) formulating explainability tasks for understanding model behavior and developing solutions for these tasks, and finally (iii) designing measures for evaluating the performance of models in explainability tasks.
In this tutorial, we will present an overview of model interpretability and explainability in AI, key regulations/laws, and techniques/tools for providing explainability as part of AI/ML systems. Then, we will focus on the application of explainability techniques in industry, wherein we present practical challenges/ guidelines for using explainability techniques effectively and lessons learned from deploying explainable models for several web-scale machine learning and data mining applications. We will present case studies across different companies, spanning application domains such as search and recommendation systems, sales, lending, and fraud detection. Finally, based on our experiences in industry, we will identify open problems and research directions for the data mining/machine learning community.
Researchers and practitioners from different disciplines have highlighted the ethical and legal challenges posed by the use of machine learned models and data-driven systems, and the potential for such systems to discriminate against certain population groups, due to biases in algorithmic decision-making systems. This tutorial aims to present an overview of algorithmic bias / discrimination issues observed over the last few years and the lessons learned, key regulations and laws, and evolution of techniques for achieving fairness in machine learning systems. We will motivate the need for adopting a "fairness-first" approach (as opposed to viewing algorithmic bias / fairness considerations as an afterthought), when developing machine learning based models and systems for different consumer and enterprise applications. Then, we will focus on the application of fairness-aware machine learning techniques in practice, by highlighting industry best practices and case studies from different technology companies. Based on our experiences in industry, we will identify open problems and research challenges for the data mining / machine learning community.
Fake news has become a global phenomenon due its explosive growth, particularly on social media. The goal of this tutorial is to (1) clearly introduce the concept and characteristics of fake news and how it can be formally differentiated from other similar concepts such as mis-/dis-information, satire news, rumors, among others, which helps deepen the understanding of fake news; (2) provide a comprehensive review of fundamental theories across disciplines and illustrate how they can be used to conduct interdisciplinary fake news research, facilitating a concerted effort of experts in computer and information science, political science, journalism, social science, psychology and economics. Such concerted efforts can result in highly efficient and explainable fake news detection; (3) systematically present fake news detection strategies from four perspectives (i.e., knowledge, style, propagation, and credibility) and the ways that each perspective utilizes techniques developed in data/graph mining, machine learning, natural language processing, and information retrieval; and (4) detail open issues within current fake news studies to reveal great potential research opportunities, hoping to attract researchers within a broader area to work on fake news detection and further facilitate its development. The tutorial aims to promote a fair, healthy and safe online information and news dissemination ecosystem, hoping to attract more researchers, engineers and students with various interests to fake news research. Few prerequisite are required for KDD participants to attend.
Time series forecasting is a key ingredient in the automation and optimization of business processes: in retail, deciding which products to order and where to store them depends on the forecasts of future demand in different regions; in cloud computing, the estimated future usage of services and infrastructure components guides capacity planning; and workforce scheduling in warehouses and factories requires forecasts of the future workload. Recent years have witnessed a paradigm shift in forecasting techniques and applications, from computer-assisted model- and assumption-based to data-driven and fully-automated. This shift can be attributed to the availability of large, rich, and diverse time series data sources and result in a set of challenges that need to be addressed such as the following. How can we build statistical models to efficiently and effectively learn to forecast from large and diverse data sources? How can we leverage the statistical power of "similar'' time series to improve forecasts in the case of limited observations? What are the implications for building forecasting systems that can handle large data volumes?
The objective of this tutorial is to provide a concise and intuitive overview of the most important methods and tools available for solving large-scale forecasting problems. We review the state of the art in three related fields: (1) classical modeling of time series, (2) modern methods including tensor analysis and deep learning for forecasting. Furthermore, we discuss the practical aspects of building a large scale forecasting system, including data integration, feature generation, backtest framework, error tracking and analysis, etc. While our focus is on providing an intuitive overview of the methods and practical issues which we will illustrate via case studies and interactive materials with Jupyter notebooks.
Large-scale sequential hypothesis testing (A/B-testing) is rampant in the tech industry, with internet companies running hundreds of thousands of tests per year. This experimentation is actually "doubly-sequential", since it consists of a sequence of sequential experiments. In this tutorial, the audience will learn about the various problems encountered in large-scale, asynchronous, doubly-sequential experimentation, both for the inner sequential process (a single sequential test) and for the outer sequential process (the sequence of tests), and learn about recently developed principles to tackle these problems. We will discuss error metrics both within and across experiments, and present state-of-the-art methods that provably control these errors, both with and without resorting to parametric or asymptotic assumptions. In particular, we will demonstrate how current common practices of peeking and marginal testing fail to control errors both within and across experiments, but how these can be alleviated using simple yet nuanced changes to the experimentation setup. We will also briefly discuss the role of multi-armed bandit methods for testing hypotheses (as opposed to minimizing regret), and the potential pitfalls due to selection bias introduced by adaptive sampling. This tutorial is timely because while almost every single internet company runs such tests, most practitioners in the tech industry focus mainly on how to run a single test correctly. However, ignoring the interplay with the outer sequential process could unknowingly inflate the number of false discoveries, as we will carefully explain in the second half of the tutorial.
Gold Panning from the Mess: Rare Category Exploration, Exposition, Representation, and Interpretation
In contrast to the massive volume of data, it is often the rare categories that are of great importance in many high impact domains, ranging from financial fraud detection in online transaction networks to emerging trend detection in social networks, from spam image detection in social media to rare disease diagnosis in the medical decision support system. The unique challenges of rare category analysis include: (1) the highly-skewed class-membership distribution; (2) the non-separability nature of the rare categories from the majority classes; (3) the data and task heterogeneity, e.g., the multi-modal representation of examples, and the analysis of similar rare categories across multiple related tasks. This tutorial aims to provide a concise review of state-of-the-art techniques on complex rare category analysis, where the majority classes have a smooth distribution, while the minority classes exhibit a compactness property in the feature space or subspace. In particular, we start with the context, problem definition and unique challenges of complex rare category analysis; then we present a comprehensive overview of recent advances that are designed for this problem setting, from rare category exploration without any label information to the exposition step that characterizes rare examples with a compact representation, from representing rare patterns in a salient embedding space to interpreting the prediction results and providing relevant clues for the end users' interpretation; at last, we will discuss the potential challenges and shed light on the future directions of complex rare category analysis.
The availability of massive datasets has highlighted the need of computationally efficient and statistically-sound methods to extracts patterns while providing rigorous guarantees on the quality of the results, in particular with respect to false discoveries. In this tutorial we survey recent methods that properly combine computational and statistical considerations to efficiently mine statistically reliable patterns from large datasets. We start by introducing the fundamental concepts in statistical hypothesis testing, including conditional and unconditional tests, which may not be familiar to everyone in the data mining community. We then explain how the computational and statistical challenges in pattern mining have been tackled in different ways. Finally, we describe the application of these methods in areas such as market basket analysis, subgraph mining, social networks analysis, and cancer genomics.
Most network analysis is conducted on existing incomplete samples of much larger complete, fully observed graphs. For example, many researchers obtain graphs from online data repositories without knowing how these graphs were collected. Thus, these graphs can be poor representations of the fully observed networks. More complete data would lead to more accurate analyses, but data acquisition can be at best costly and at worst error-prone. For example, think of an adversary that deliberately poisons the answer to a query. Given a query budget for identifying additional nodes and edges, how can one improve the observed graph sample so that it is a more accurate representation of the complete, fully observed network? How does the approach change if one is interested in learning the best function (e.g. node classifier) on the network for a down-stream task? This is a novel problem that is related to, but distinct from, topics such as graph sampling and crawling. Given the prevailing use of graph samples in the research literature, this problem is of considerable importance, even though it has been ignored. In this tutorial, we discuss latent biases in incomplete networks and present methods for enriching such networks through active probing of nodes and edges. We focus on active learning and sequential decision-making formulations of this problem (a.k.a. the network discovery problem). We present distinctions between learning to grow the network (a.k.a. active exploration) vs. learning the "best" function on the network (a.k.a. active learning). In addition, we will discuss issues surrounding adversarial machine learning when querying for more data to reduce incompleteness.
This tutorial covers the state-of-the-art research, development, and applications in the KDD area of interpretable knowledge discovery reinforced by visual methods to stimulate and facilitate future work. It serves the KDD mission and objectives of gaining insight from the data. The topic is interdisciplinary bridging of scientific research and applied communities in KDD, Visual Analytics, Information Visualization, and HCI. This is a novel and fast growing area with significant applications, and potential. First, in KDD, these studies have grown under the name of visual data mining. The recent growth under the names of deep visualization, and visual knowledge discovery, is motivated considerably by deep learning success in accuracy of prediction and its failure in explanation of the produced models without special interpretation efforts. In the areas of Visual Analytics, Information Visualization, and HCI, the increasing trend toward machine learning tasks, including deep learning, is also apparent. This tutorial reviews progress in these areas with a comparative analysis of what each area brings to the joint table. The comparison includes the approaches: (1) to visualize Machine Learning (ML) models produced by the analytical ML methods, (2) to discover ML models by visual means, (3) to explain deep and other ML models by visual means, (4) to discover visual ML models assisted by analytical ML algorithms, (5) to discover analytical ML models assisted by visual means. The presenter will use multiple relevant publications including his books: "Visual and Spatial Analysis: Advances in Visual Data Mining, Reasoning, and Problem Solving" (Springer, 2005), and "Visual Knowledge Discovery and Machine Learning" (Springer, 2018). The target audience of this tutorial consists of KDD researchers, graduate students, and practitioners with the basic knowledge of machine learning.
Arguably, every entity in this universe is networked in one wayr another. With the prevalence of network data collected, such as social media and biological networks, learning from networks has become an essential task in many applications. It is well recognized that network data is intricate and large-scale, and analytic tasks on network data become more and more sophisticated. In this tutorial, we systematically review the area of learning from networks, including algorithms, theoretical analysis, and illustrative applications. Starting with a quick recollection of the exciting history of the area, we formulate the core technical problems. Then, we introduce the fundamental approaches, that is, the feature selection based approaches and the network embedding based approaches. Next, we extend our discussion to attributed networks, which are popular in practice. Last, we cover the latest hot topic, graph neural based approaches. For each group of approaches, we also survey the associated theoretical analysis and real-world application examples. Our tutorial also inspires a series of open problems and challenges that may lead to future breakthroughs. The authors are productive and seasoned researchers active in this area who represent a nice combination of academia and industry.
What are the basic forms of healthcare data? How are Electronic Health Records and Cohorts structured? How can we identify the key variables in such data and how important are temporal abstractions? What are the main challenges in knowledge extraction from medical data sources? What are the key machine algorithms used for this purpose? What are the main questions that clinicians and medical experts pose to machine learning researchers?
In this tutorial, we provide answers to these questions by presenting state-of-the-art methods, workflows, and tools for mining and understanding medical data. Particular emphasis is given on temporal abstractions, knowledge extraction from cohorts, machine learning model interpretability, and mHealth.
Networks (or graphs) are used to represent and analyze large datasets of objects and their relations. Naturally, real-world networks have a temporal component: for instance, interactions between objects have a timestamp and a duration. In this tutorial we present models and algorithms for mining temporal networks, i.e., network data with temporal information. We overview different models used to represent temporal networks. We highlight the main differences between static and temporal networks, and discuss the challenges arising from introducing the temporal dimension in the network representation. We present recent papers addressing the most well-studied problems in the setting of temporal networks, including computation of centrality measures, motif detection and counting, community detection and monitoring, event and anomaly detection, analysis of epidemic processes and influence spreading, network summarization, and structure prediction.
Real-world entities' behaviors, associated with their side information, are often recorded over time as asynchronous event sequences. Such event sequences are the basis of many practical applications, neural spiking train study, earth quack prediction, crime analysis, infectious disease diffusion forecasting, condition-based preventative maintenance, information retrieval and behavior-based network analysis and services, etc. Temporal point process (TPP) is a principled mathematical tool for the modeling and learning of asynchronous event sequences, which captures the instantaneous happening rate of the events and the temporal dependency between historical and current events. TPP provides us with an interpretable model to describe the generative mechanism of event sequences, which is beneficial for event prediction and causality analysis. Recently, it has been shown that TPP has potentials to many machine learning and data science applications and can be combined with other cutting-edge machine learning techniques like deep learning, reinforcement learning, adversarial learning, and so on.
We will start with an elementary introduction of TPP model, including the basic concepts of the model, the simulation method of event sequences; in the second part of the tutorial, we will introduce typical TPP models and their traditional learning methods; in the third part of the tutorial, we will discuss the recent progress on the modeling and learning of TPP, including neural network-based TPP models, generative adversarial networks (GANs) for TPP, and deep reinforcement learning of TPP. We will further talk about the practical application of TPP, including useful data augmentation methods for learning from imperfect observations, typical applications and examples like healthcare and industry maintenance, and existing open source toolboxes.
When considering a data set it is often unknown how complex it is, and hence it is difficult to assess how rich a model for the data should be. Often these choices are swept under the carpet, ignored, left to the domain expert, but in practice this is highly unsatisfactory; domain experts do not know how to set k, what prior to choose, or how many degrees of freedom is optimal any more than we do. The Minimum Description Length (MDL) principle can answer the model selection problem from an intuitively appealing and clear viewpoint of information theory and data compression. In a nutshell, it asserts that the best model is the one that best compresses both the data and that model. It does not only imply the best strategy for model selection, but also gives a unifying viewpoint of designing optimal data mining algorithms for a wide range of issues, and has been very successfully applied to a wide range of data mining tasks, ranging from pattern mining, clustering, classification, text mining, graph mining, anomaly detection, up to causal inference. In this tutorial we do not only give an introduction to the basics of model selection, show important properties of MDL-based modelling, successful examples as well as pitfalls for how to apply MDL to solve data mining problems, but also introduce advanced topics on important new concepts in modern MDL (e.g, normalized maximum likelihood (NML), sequential NML, decomposed NML, and MDL change statistics) and emerging applications in dynamic settings.
The increasing need for labeled data has brought the booming growth of crowdsourcing in a wide range of high-impact real-world applications, such as collaborative knowledge (e.g., data annotations, language translations), collective creativity (e.g., analogy mining, crowdfunding), and reverse Turing test (e.g., CAPTCHA-like systems), etc. In the context of supervised learning, crowdsourcing refers to the annotation procedure where the data items are outsourced and processed by a group of mostly unskilled online workers. Thus, the researchers or the organizations are able to collect large amount of information via the feedback of the crowd in a short time with a low cost.
Despite the wide adoption of crowdsourcing, several of its fundamental problems remain unsolved especially at the information and cognitive levels with respect to incentive design, information aggregation, and heterogeneous learning. This tutorial aims to: (1) provide a comprehensive review of recent advances in exploring the power of crowdsourcing from the perspective of optimizing the wisdom of the crowd; and (2) identify the open challenges and provide insights to the future trends in the context of human-in-the-loop learning. We believe this is an emerging and potentially high-impact topic in computational data science, which will attract both researchers and practitioners from academia and industry.
Recent Progress in Zeroth Order Optimization and Its Applications to Adversarial Robustness in Data Mining and Machine Learning
Zeroth-order (ZO) optimization is increasingly embraced for solving big data and machine learning problems when explicit expressions of the gradients are difficult or infeasible to obtain. It achieves gradient-free optimization by approximating the full gradient via efficient gradient estimators. Some recent important applications include: a) generation of prediction-evasive, black-box adversarial attacks on deep neural networks, b) online network management with limited computation capacity, c) parameter inference of black-box/complex systems, and d) bandit optimization in which a player receives partial feedback in terms of loss function values revealed by her adversary.
This tutorial aims to provide a comprehensive introduction to recent advances in ZO optimization methods in both theory and applications. On the theory side, we will cover convergence rate and iteration complexity analysis of ZO algorithms and make comparisons to their first-order counterparts. On the application side, we will highlight one appealing application of ZO optimization to studying the robustness of deep neural networks - practical and efficient adversarial attacks that generate adversarial examples from a black-box machine learning model. We will also summarize potential research directions regarding ZO optimization, big data challenges and some open-ended data mining and machine learning problems.
he abundance of user generated content on social networks pro-vides the opportunity to build models that are able to accurately and effectively extract, mine and predict users' interests with the hopes of enabling more effective user engagement, better quality delivery of appropriate services and higher user satisfaction. While traditional methods for building user profiles relied on AI-based preference elicitation techniques that could have been considered to be intrusive and undesirable by the users, more recent advances are focused on a non-intrusive yet accurate way of determining users' interests and preferences. In this tutorial, we cover five important aspects related to the effective mining of user interests: (1) we introduce the information sources that are used for extracting user interests, (2) various types of user interest profiles that have been proposed in the literature, (3) techniques that have been adopted or proposed for mining user interests, (4) the scalability and re-source requirements of the state of the art methods, and finally (5)the evaluation methodologies that are adopted in the literature for validating the appropriateness of the mined user interest profiles.We also introduce existing challenges, open research question and exciting opportunities for further work.
Spatio-temporal societal event forecasting, which has traditionally been prohibitively challenging, is now becoming possible and experiencing rapid growth thanks to the big data from Open Source Indicators (OSI) such as social media, news sources, blogs, economic indicators, and other meta-data sources. Spatio-temporal societal event forecasting and their precursor discovery benefit the society by providing insight into events such as political crises, humanitarian crises, mass violence, riots, mass migrations, disease outbreaks, economic instability, resource shortages, natural disasters, and others. In contrast to traditional event detection that identifies ongoing events, event forecasting focuses on predicting future events yet to happen. Also different from traditional spatio-temporal predictions on numerical indices, spatio-temporal event forecasting needs to leverage the heterogeneous information from OSI to discover the predictive indicators and mappings to future societal events. While studying large scale societal events, policy makers and practitioners aim to identify precursors to such events to help understand causative attributes and ensure accountability. The resulting problems typically require the predictive modeling techniques that can jointly handle semantic, temporal, and spatial information, and require a design of efficient and interpretable algorithms that scale to high-dimensional large real-world datasets.
In this tutorial, we will present a comprehensive review of the state-of-the-art methods for spatio-temporal societal event forecasting. First, we will categorize the inputs OSI and the predicted societal events commonly researched in the literature. Then we will review methods for temporal and spatio-temporal societal event forecasting. Next, we will also discuss the foundations of precursor identification with an introduction to various machine learning approaches that aim to discover precursors while forecasting events. Through the tutorial, we expect to illustrate the basic theoretical and algorithmic ideas and discuss specific applications in all the above settings.
Statistical Mechanics Methods for Discovering Knowledge from Modern Production Quality Neural Networks
There have long been connections between statistical mechanics and neural networks, but in recent decades these connections have withered. However, in light of recent failings of statistical learning theory and stochastic optimization theory to describe, even qualitatively, many properties of production-quality neural network models, researchers have revisited ideas from the statistical mechanics of neural networks. This tutorial will provide an overview of the area; it will go into detail on how connections with random matrix theory and heavy-tailed random matrix theory can lead to a practical phenomenological theory for large-scale deep neural networks; and it will describe future directions.
Finding nearest neighbors is an important topic that has attracted much attention over the years and has applications in many fields, such as market basket analysis, plagiarism and anomaly detection, community detection, ligand-based virtual screening, etc. As data are easier and easier to collect, finding neighbors has become a potential bottleneck in analysis pipelines. Performing pairwise comparisons given the massive datasets of today is no longer feasible. The high computational complexity of the task has led researchers to develop approximate methods, which find many but not all of the nearest neighbors. Yet, for some types of data, efficient exact solutions have been found by carefully partitioning or filtering the search space in a way that avoids most unnecessary comparisons.
In recent years, there have been several fundamental advances in our ability to efficiently identify appropriate neighbors, especially in non-traditional data, such as graphs or document collections. In this tutorial, we provide an in-depth overview of recent methods for finding (nearest) neighbors, focusing on the intuition behind choices made in the design of those algorithms and on the utility of the methods in real-world applications. Our tutorial aims to provide a unifying view of "neighbor computing" problems, spanning from numerical data to graph data, from categorical data to sequential data, and related application scenarios. For each type of data, we will review the current state-of-the-art approaches used to identify neighbors and discuss how neighbor search methods are used to solve important problems.