SIGKDD Dissertation Awards (1 winner and 2 runner-ups)
ACM SIGKDD dissertation awards recognize outstanding work done by graduate students in the areas of data science, machine learning and data mining.
- Relevance of the Dissertation to KDD
- Originality of the Main Ideas in the Dissertation
- Significance of Scientific Contributions
- Technical Depth and Soundness of Dissertation (including experimental methodologies, theoretical results, etc.)
- Overall Presentation and Readability of Dissertation (including organization, writing style and exposition, etc.)
Exploring and Making Sense of Large Graphs
Danai Koutra (student) and Christos Faloutsos (advisor) at Carnegie Mellon University
Abstract: Graphs represent information ranging from links between webpages, to connections between neurons in our brains, and often span billions of nodes. Within this deluge of data, how can we find its most important structures? How can we detect critical events, such as a computer system attack, or disease formation in the human brain?
This thesis contributes (i) scalable, principled algorithms that combine globality with locality to understand graphs, and (ii) applications in two areas:
- Single-Graph Exploration: We show how to interpretably summarize a graph with its important structures, and complement that with inference, which leverages little prior information and the network structure to efficiently learn information about all the entities.
- Mulltiple-Graph Exploration: We extend our summarization idea to temporal graphs for pattern discovery. We also claim that similarity is a subproblem in many applications with multiple graphs, and contribute methods for network alignment and similarity.
We have applied our methods to massive data, including a Web graph of 6.6 billion edges, a Twitter graph of 1.8 billion edges, and brain graphs with 90 million edges.
Runner-up: Mining Disparate Sources for Question Answering
Huan Sun (student) and Xifeng Yan (advisor) at University of California, Santa Barbara
Abstract: Today’s paradigm of information search is in the midst of a significant transformation. Question answering (QA) systems that can precisely answer user questions are becoming more and more desired, in contrast to traditional search engines only retrieving lengthy web pages. As demonstrated by exemplar intelligent systems including IBM Watson, Google Now, Apple Siri, Microsoft Cortana, and Amazon Echo, QA techniques enjoy tremendous potential in revolutionizing the way we interact with devices and data, e.g., by allowing voice commands to automate complicated tasks like planning a trip, and directly answering questions to understand domain- specific data such as medical forum posts, product reviews and online programming tutorials.
Towards realizing such great potential of question answering, we make two key observations on the status quo: First, knowledge is dispersed across disparate sources. As is known to all, the big data age contributes large-scale diversified information sources, such as structured knowledge bases (KBs), unstructured texts, and semi-structured tables. How to utilize knowledge dispersed in these disparate sources to answer questions becomes a challenge. Second, what current automated QA systems can achieve is still limited in many situations, e.g., when questions have complex language structures and when finding answers involves a lot of reasoning, which necessitates the exploitation of human intelligence for question answering and problem solving. Driven by these two observations, this dissertation systematically mines disparate sources including knowledge bases, texts, tables and human networks for question answering, by developing novel techniques on text mining, network analysis and human behavior understanding. Specifically, our research centers around:
- Text-based question answering. Interesting or important facts and statements often appear repeatedly in copious texts including news articles and Wikipedia-like pages. One main weakness of existing text-based QA systems is the lack of capability in making sense of strings. In contrast, our proposed text-based QA system mines possible answers from large- scale texts, and meanwhile employs knowledge bases to provide important information for detecting the true answer, which significantly improves QA performance by 18% ∼ 54% over various existing systems.
- Table-based question answering. Apart from texts, we observe that informative tabular data are also pervasive and valuable for QA. Therefore, we investigate an important yet largely under-addressed problem: Given millions of tables, how to precisely mine table cells to answer a user question? To attack this problem, we propose a novel table cell search framework, in which we develop chain representations for both table cells and questions, and further employ deep neural networks to semantically match them. Our table cell search framework is either comparable to or significantly outperforms state-of-the-art QA systems by at least 56.7%. Moreover, we empirically verify that tables supply rich knowledge that might not exist or is difficult to be identified in existing KBs.
- Expert-based question answering. The intelligence possessed by current machines is still limited in many aspects. Human intelligence, contributed by crowdsourcing platforms and collaborative networks, should be exploited to complement automated question answering and problem solving. We are among the first to quantitatively analyze task/question routing behaviors of experts in real collaborative networks. We formalize multiple routing patterns by taking into account both rational and random analysis of tasks, and present a generative model to combine them. Our method significantly outperforms all the alternatives with more than 75% accuracy gain under different quality measures. Moreover, we further show that our model can assist optimizing collaborative networks via hypothesis testing.
The developed methodologies and frameworks in this dissertation pave the path for an array of exciting research directions including QA in various domains like healthcare and business intelligence, via using disparate and complementary data sources, and combining human intelligence and machine intelligence for problem solving and decision making through QA interactions.
Runner-up: Scalable Multivariate Time Series Analysis
Mohammad Taha Bahadori (Student) and Yan Liu (Advisor) at University of Southern California
Abstract: Time series data have become ubiquitous in many applications such as climate science, social media, and health care. Analysis of large scale time series data collected from diverse applications has created new multi-faceted challenges and opportunities. In this thesis, we have studied the key challenges in large scale multivariate time series analysis and proposed novel and scalable solutions.
First, we tackle the challenge of modeling high-dimensional multi-modal correlations in the spatio-temporal data, as accurate modeling of correlations is the key to accurate predictive analysis. We cast the problem as a low-rank tensor learning problem with side information incorporated via a graph Laplacian regularization. For scalable estimation, we provide a fast greedy low-rank tensor learning algorithm.
To address the problem of modeling complex correlations in classification and clustering of time series, we propose the functional subspace clustering framework, which assumes that the time series lie on several subspaces with possible deformations. For estimation of the subspaces, we propose an efficient greedy variable selection algorithm.
Second, we observe that the performance of temporal dependency algorithms is severely degraded in presence of unobserved confounders. To address this challenge, we propose two solutions: (i) alleviating the impact of major latent confounders using sparse plus low-rank decomposition and (ii) eliminating the impact of all latent confounders using the prior information about the delays of the confounding paths.
Third, in many application domains, multivariate time series do not follow the commonly assumed multivariate Gaussian distribution. We propose two solutions to address this challenge: (i) a state space model based on generalized extreme value distribution to model the important case of extreme value time series and (ii) a semi-parametric approach using copulas for the general setting.
Finally, often in practice time series measurements are collected at irregular intervals, which violates the assumptions of many existing algorithms. To address this challenge, we propose a fast non-parametric extension for temporal dependency analysis algorithms that improves accuracy over the state of the art techniques.
All of the proposed algorithms are evaluated on multiple datasets from different applications including climate science, social networks, and health care.
Congratulations to all the outstanding students who were nominated and to the winners of this year.