|

ACM Special Interest Group on Knowledge
Discovery & Data Mining
|
KDD-2000
Sixth ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining
August 20-23, 2000
Boston, MA, USA
KDD-2000
Best Research Paper
Hancock:
A Language for Extracting Signatures from Data Streams
Corinna
Cortes (AT&T Labs-Research)
Kathleen
Fisher (AT&T Labs-Research)
Daryl
Pregibon (AT&T Labs-Research)
Anne
Rogers (AT&T Labs-Research)
Frederick
Smith (Cornell University)
Abstract:
Massive
transaction streams present a number of opportunities for data mining
techniques. Such transactions can represent calls on a telephone network,
commercial credit card purchases, stock market trades, or HTTP requests to a
web server. While historically such data have been collected for billing or
security purposes, they are now being used to discover how customers or their
intermediaries (called {transactors}) use the underlying services. For several
years, we have been computing evolving profiles (called {signatures}) of the
transactors in large data streams. The signature for each transactor captures
the salient features of his transactions through time. Programs for processing
signatures must be highly optimized because of the size of the data stream
(several gigabytes per day) and the number of signatures to maintain (hundreds
of millions). The original C programs to compute signatures often sacrificed
readability for performance. Consequently, they were hard to verify and
difficult to maintain. Hancock is a domain-specific language designed and
implemented to express computationally efficient signature programs cleanly.
In this paper, we describe the obstacles to computing signatures from massive
transaction streams and explain how Hancock addresses these problems. For
expository purposes, we present Hancock using a running example from the
telecommunications industry; however, the language itself is general and
applies equally well to other domains.
|

�
|