ACM Special Interest Group on Knowledge Discovery & Data Mining

KDD-2000

Sixth ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining
August 20-23, 2000
Boston, MA, USA

KDD-2000 Best Research Paper

Hancock: A Language for Extracting Signatures from Data Streams

Corinna Cortes (AT&T Labs-Research)

Kathleen Fisher (AT&T Labs-Research)

Daryl Pregibon (AT&T Labs-Research)

Anne Rogers (AT&T Labs-Research)

Frederick Smith (Cornell University)

Abstract:

Massive transaction streams present a number of opportunities for data mining techniques. Such transactions can represent calls on a telephone network, commercial credit card purchases, stock market trades, or HTTP requests to a web server. While historically such data have been collected for billing or security purposes, they are now being used to discover how customers or their intermediaries (called {transactors}) use the underlying services. For several years, we have been computing evolving profiles (called {signatures}) of the transactors in large data streams. The signature for each transactor captures the salient features of his transactions through time. Programs for processing signatures must be highly optimized because of the size of the data stream (several gigabytes per day) and the number of signatures to maintain (hundreds of millions). The original C programs to compute signatures often sacrificed readability for performance. Consequently, they were hard to verify and difficult to maintain. Hancock is a domain-specific language designed and implemented to express computationally efficient signature programs cleanly. In this paper, we describe the obstacles to computing signatures from massive transaction streams and explain how Hancock addresses these problems. For expository purposes, we present Hancock using a running example from the telecommunications industry; however, the language itself is general and applies equally well to other domains.

KDD-2000 Home

�