The KDD Test of Time Award recognizes outstanding papers from past KDD Conferences beyond the last decade that have had an important impact on the data mining research community.
The 2014 Test of Time award recognizes the following influential contributions to SIGKDD that have withstood the test of time:
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise [KDD 1996]
Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu
This paper introduced density-based clustering to the data mining community. It proposed the now well-known clustering algorithm DBSCAN, which finds clusters of arbitrary shape, is robust to noise, and scales well to large databases, if the intrinsic dimensionality is not too high so that spatial index structures can be effective in supporting range queries. Since its introduction, density-based clustering has become one of the prominent clustering paradigms. Many data mining textbooks cover the DBSCAN clustering algorithm, and several third party implementations of DBSCAN exist (e.g. in WEKA, ELKI, and GNU R). Since the publication of DBSCAN, density-based clustering has been extensively studied (as the citation counts for the paper support), and has been successfully used in many applications.
Original Abstract of the DBSCAN paper: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.
The second test of time award goes to:
Integrating Classification and Association Rule Mining [KDD 1998]
Bing Liu, Wynne Hsu, Yiming Ma
This paper pioneered the research of using association rules for classification by integrating classification and association rule mining. It also proposed an efficient algorithm and built the first system (called CBA) for the purpose. This work triggered a large number of follow-up works and applications. In addition to classification, due to its ability to generate all rules, it enables the user to explore underlying data, and to find causes of problems and actionable knowledge to then solve the problems. Such diagnostic data mining is crucial for many engineering applications, but it is hard to do with other classification methods because they produce only a small model sufficient for classification while missing out a large number of regularities in data.
Original Abstract of the paper: Classification rule mining aims to discover a small set of rules in the database that forms an accurate classifier. Association rule mining finds all the rules existing in the database that satisfy some minimum support and minimum confidence constraints. For association rule mining, the target of discovery is not pre-determined, while for classification rule mining there is one and only one predetermined target. In this paper, we propose to integrate these two mining techniques. The integration is done by focusing on mining a special subset of association rules, called class association rules (CARs). An efficient algorithm is also given for building a classifier based on the set of discovered CARs. Experimental results show that the classifier built this way is, in general, more accurate than that produced by the state-of-the-art classification system C4.5. In addition, this integration helps to solve a number of problems that exist in the current classification systems.
The third SIGKDD test of time award goes to:
Maximizing the Spread of Influence through a Social Network [ KDD ‘03]
David Kempe, Jon Kleinberg, Eva Tardos
This paper considers questions involving the spread of information, innovations, and behaviors in social networks, providing algorithms with provable performance guarantees for a fundamental problem in this area: for promoting an innovation, what are the most effective starting points in the network? The paper identifies a technically rich structure inherent in the problem, and establishes a framework that has subsequently been used in areas ranging from social media and marketing to the diffusion of innovations and the study of inequality. At a mathematical level, it also establishes a link to the theory of submodular functions that has been central to subsequent research in this domain.
Original Abstract of this paper: Models for the processes by which ideas and influence propagate through a social network have been studied in a number of domains, including the diffusion of medical and technological innovations, the sudden and widespread adoption of various strategies in game-theoretic settings, and the effects of “word of mouth” in the promotion of new products. Recently, motivated by the design of viral marketing strategies, Domingos and Richardson posed a fundamental algorithmic problem for such social network processes: if we can try to convince a subset of individuals to adopt a new product or innovation, and the goal is to trigger a large cascade of further adoptions, which set of individuals should we target?
We consider this problem in several of the most widely studied models in social network analysis. The optimization problem of selecting the most influential nodes is NP-hard here, and we provide the first provable approximation guarantees for efficient algorithms. Using an analysis framework based on submodular functions, we show that a natural greedy strategy obtains a solution that is provably within 63% of optimal for several classes of models; our framework suggests a general approach for reasoning about the performance guarantees of algorithms for these types of influence problems in social networks.
We also provide computational experiments on large collaboration networks, showing that in addition to their provable guarantees, our approximation algorithms significantly out-perform node selection heuristics based on the well-studied notions of degree centrality and distance centrality from the field of social networks.