KDD Papers

A Local Algorithm for Structure-Preserving Graph Cut

Dawei Zhou (Arizona State University);Si Zhang (Arizona State University);Mehmet Yigit Yildirim (Arizona State University);Scott Alcorn (Early Warnings LLC.);Hanghang Tong (Arizona State University);Hasan Davulcu (Arizona State University);Jingrui He (Arizona State University)


Nowadays, large-scale graph data is being generated in a variety of real-world applications, from social networks to co-authorship networks, from protein-protein interaction networks to road traffic networks. Many existing works on graph mining focus on the vertices and edges, with first-order Markov chain as the underlying model. They fail to explore the high-order network structures, which are of key importance in many high impact domains. For example, in bank customer personally identifiable information (PII) networks, the star structures often correspond to a set of synthetic identities; in financial transaction networks, the loop structures may indicate the existence of money laundering; in signed networks, the triangle structures play an essential role in the balance theory for edge sign prediction. In this paper, we focus on mining user specified high-order network structures, and aim to find a structure-rich sub-graph which does not break many such structures by separating the sub-graph from the rest. A key challenge associated with finding a structure-rich subgraph is the prohibitive computational cost. To address this problem, inspired by the family of local graph clustering algorithms for efficiently identifying a low conductance cut without exploring the entire graph, we propose to generalize the key idea to model high-order network structures. In particular, we start with a generic definition of high-order conductance and define the high-order diffusion core, which is based on a high-order random walk induced by any user-specified high-order network structures. Then we propose a novel High-Order Structure Preserving Local Clustering algorithm named HOSPLOC, which runs in a near linear time complexity with respect to the number of edges. It starts with a seed node and iteratively explores its neighborhood until it finds a sub-graph with a small high-order conductance. Furthermore, we analyze its performance in terms of both effectiveness and efficiency. The experimental results on both synthetic graphs and real graphs demonstrate the e