SIGKDD

ACM Special Interest Group on Knowledge Discovery and Data Mining

SIGKDD Innovation Award

The award recognizes individuals for their outstanding technical contributions to the field of knowledge discovery in data and data mining that have had lasting impact in furthering the theory and/or development of commercial systems.



Back to main awards page

Past SIGKDD Innovation Award Recipients

2008 Dr. Raghu Ramakrishnan >> (read citation)
Chief Scientist for Audience, and Research Fellow at Yahoo! Research
2007 Dr. Usama M. Fayyad >> (read citation)
Chief Data Officer and Executive Vice President, Research & Strategic Data Solutions, Yahoo!
2006 Dr. Ramakrishnan Srikant >> (read citation)
Research Scientist, Google
2005 Dr. Leo Breiman >> (read citation)
Professor Emeritus, Berkeley
2004 Dr. Jiawei Han >> (read citation)
Professor, Department of Computer Science Univ. of Illinois at Urbana-Champaign
2003 Dr. Heikki Manilla >> (read citation)
Professor, Helsinki University of Technology
2002 Dr. Jerome H. Friedman >> (read citation)
Professor, Department of Statistics, Stanford University, and Leader, Computation Research Group, Stanford Linear Accelerator Center
2000 Dr. Rakesh Agrawal >> (read citation)
IBM Fellow

Frequency of the Awards

Once a year.

Administration of the Awards Program

The SIGKDD Awards Committee, consisting of 3-5 prominent senior scientists in the field, will solicit nominations for recipient candidates, evaluate the nominations, and select winners.

The SIGKDD Chair will form the Awards Committee by inviting candidates. The SIGKDD Chair will appoint the Chair of the Awards Committee.

The terms of the Awards Committee will be the same as the term of the SIGKDD Chair.

Once formed, the Awards Committee will select winners of the Awards completely independently of SIGKDD Chair or SIGKDD Executive Committee.

The Award winner will be decided by a two-thirds majority vote of the Awards Committee.

There will be at most one individual or one group to receive either Award in any given year. (It is possible that in a given year, there may be no winner of either Award.)

The Awards Committee will solicit nominations for Award recipients 5 months before the SIGKDD Annual International Conference via the SIGKDD website, SIGKDD Annual Conference website, and the KDNuggets electronic newsletter.

Nominations, once made, may be re-considered for the subsequent two years; if the nominee does not win after the first three years, the nomination is discarded.

The deadline for the nominations will be 3 months before the SIGKDD Annual International Conference. (The Awards Committee will take 6 weeks to make its decisions.)

The winners will receive the Awards at the SIGKDD Annual International Conference. The winners will be announced in the SIGKDD Conference website and the SIGKDD website.

The Awards

Each Award carries a $2,500 monetary award and a plaque.

If the winner is a group of individuals, the group will receive $2,500 (not each individual). However, each individual will receive a plaque.

Exclusions

SIGKDD Chair and members of the SIGKDD Awards Committee are not eligible to be nominated for either Award.

2007 ACM SIGKDD Awards Committee

  • Ramasamy Uthurusamy (General Motors, USA) Chair
  • Jerome Friedman (Stanford University, USA)
  • Jiawei Han (University of Illinois Urbana Champaign, USA)
  • Vipin Kumar (University of Minnesota, USA)
  • Heikki Mannila (University of Helsinki, Finland)
  • Rajeev Motwani (Stanford University, USA)
  • Ramakrishnan Srikant (Google, USA)
  • Ian H. Witten (University of Waikato, New Zealand)
  • Xindong Wu (University of Vermont, USA)



Citations

2000 Innovation Award: Dr. Rakesh Agrawal

Rakesh Agrawal from IBM has received the first ACM SIGKDD Award for Innovation for his many research contributions, including his pioneering work on association rules, mining sequences and much more.

2002 Innovation Award: Dr. Jerome H. Friedman

Jerry Friedman has contributed a remarkable array of topics and methodologies to data mining and machine learning during the last 25 years.

In 1977, as leader of the numerical methods group at the Stanford Linear Accelerator Center (SLAC), he coauthored several algorithms for speeding up nearest-neighbor classifiers.

In the following seven years, he collaborated with Leo Breiman, Richard Olshen, and Charles Stone to produce a landmark work in decision tree methodology, "Classification and Regression Trees" (1984), and released the commercial product CART(R). This work introduced the gini, twoing, and ordered twoing splitting rules, cost-complexity pruning, oblique splitters, the use of a misclassification cost matrix to influence the growing of trees, and the application of cross validation to decision trees. Part of this work was pre-figured in his 1977 paper on decison tree induction.

During this time, he also introduced Projection Pursuit Regression (PPR) for predictive modeling and interactive data visualization. Although PPR has had only a modest following, it was arguably the first instance of a feed-forward, single hidden layer, back propagation neural network with a remarkable twist: the activation function is itself estimated as part of the learning process and the number of hidden units to use is determined dynamically in a stagewise process (1974, 1981, 1987).

In 1991 Jerry extended recursive partitioning ideas to regression in his Multivariate Adaptive Regression Splines (MARS(tm)). In MARS, linear and logistic regressions are built up through searching for breakpoints in the predictor space. Variable selection, missing value handling, and variable transformation are all automated. MARS can be described as the first truly successful stepwise regression methodology. Richard DeVeaux, in a comparative study of MARS and Neural Networks (1993), found that MARS frequently outperformed neural networks in engineering applications and trained hundreds of times faster; similar findings have been reported by others more recently. In 1994, Jerry extended the MARS methodology to permit a dynamic spline version of discriminant analysis.

In the early 1990s Jerry focused on interactive data mining methods, introducing the Patient Rule Induction Method (PRIM, 1997), which he described as "Bump Hunting in High Dimensional Data." PRIM searches for data regions containing unusually high concentrations (or values) of a target variable and allows the analyst to interactively modify its rules and stretch or shrink the "boxes" defining the regions in question. PRIM has become one of the analytical methods of choice at Australia's CSIRO, a government-funded R&D and consulting lab with extensive data mining activity.

More recently, Jerry has focused on the study of boosting, both to understand why it is so successful and to develop improved boosting methodology. In a key article co-authored with Stanford statisticians Trevor Hastie and Rob Tibshirani, Jerry showed that boosting is a form of additive logistic regression and he identified the objective function that boosting seeks to maximize. He followed up with Stochastic Gradient Boosting, which generalizes boosting to a very large class of problems, eliminating the tendency of classical boosting to seriously mistrack when presented with mislabeled target data. In stochastic gradient boosting, small trees, very slow learning rates, mandatory sampling from the training data, and redefinition of the target variable are all combined to produce a remarkably fast and robust learner, capable of handling both regression and classification even under fairly adverse circumstances of dirty data. The methodology, called "MART" for Multiple Additive Regression Trees, includes visualization to convey the relationships between the target and predictors; it has been released commercially as TreeNet(tm).

Finally, Jerry has written a series of expository articles and a substantial book seeking to explain data mining to experienced data analysts and to relate machine learning to statistical foundations. Taken together, this list of new methodology, including CART, MARS, PRIM, PPR, and Gradient Boosting, constitutes one of the broadest ranges of contributions by any one person in the field.

2003 Innovation Award: Dr. Heikki Mannila

The winner of the SIGKDD 2003 Innovation Award is Professor Heikki Mannila, Professor (Helsinki University of Technology) and Research Director, HIIT Basic Research Unit, University of Helsinki & Helsinki University of Technology, Finland. The award carries with it a memorial plaque and a check for $2,500.

Professor Mannila has the rare virtue of being able to identify new problems, viewpoints, and concepts, and thereby taking the field forward. For example, he introduced the concept of "inductive databases" that integrate data mining and databases (CACM 96). This idea is gaining considerable momentum, especially in Europe. Another example is his KDD 96 paper where he elegantly showed that frequent itemsets, samples, and the data cube could all be viewed as instantiations of a general notion of condensed representations, and that this condensed representation could then be used to get approximate confidences of arbitrary boolean rules.

Equally impressive are Professor Mannila's contributions in providing a substantial and much needed theoretical foundation in a very young field. He has given strong theoretical results for many data mining problems, including association rules and frequent time sequences. An excellent example is his work on level-wise search and borders of theories in data mining (DMKD Journal 1997). In this work, he demonstrates how the size of the border is an important factor in the complexity of the level-wise algorithm, and also shows that the problem of computing the border is tightly connected to the well-known problem of computing transversals of hypergraphs.

Professor Mannila has also made many contributions on algorithms for solving data mining problems. His seminal KDD 94 paper identified the monotonicity property for pruning candidate itemsets that underlies most current association rule mining algorithms. His pioneering work in bio-informatics includes the development of novel algorithms for identifying block structures in the human genome, and methods for gene expression analysis for various types of cancer. Other recent highlights include topics on 0-1 data, global and local models, and a large variety of novel methods for analysis of time series and sequence data.

Finally, the breadth of Prof Mannila's work is quite spectacular. He has over 130 articles in journals and refereed conferences covering such diverse topics as association rules, probabilistic modeling, inductive databases, similar time series, and bio-informatics.

Professor Mannila has been very active in the KDD community. He has served as Editor-in-Chief of the Data Mining and Knowledge Discovery journal since its creation. He is also an associate editor of ACM Transactions on Internet Technology, an action editor for Journal of Machine Learning Research, and an area editor for IEEE TKDE. He was the Program Co-Chair for KDD 97, SIAM 2002, and ECML/PKDD 2002; and an Area Chair for ICML 2001. He has also served on the KDD Steering Committee, and as SIGKDD Awards Committee Chair. Finally, he has co-authored the book "Principles of Data Mining".



2004 Innovation Award: Dr. Jiawei Han

The 2004 SIGKDD Innovation Award went to Jiawei Han, Professor, Computer Science, at Univ. of Illinois at Urbana-Champaign.

Dr. Han is widely and well regarded as a pioneer researcher in data mining and knowledge discovery, who has made many fundamental research contributions, including
  • Novel and efficient algorithms for frequent pattern mining, e.g., FP-tree and graph pattern mining algorithms
  • Attribute-oriented induction methods
  • Spatial data mining and clustering
  • Stream mining
  • Data warehousing
  • Innovative schemes to integrate OLAP, data warehouse and data mining

The impact of Dr. Han's work is well illustrated by his paper on mining frequent patterns without candidate generation. Unlike earlier approaches that first generated candidates and then counted their support, this radically different approach essentially merged candidate generation and counting. This is done using a novel database structure, the FP-tree, which condenses the set of transactions into a form that is more compact than the original representation and amenable to a new depth-first pattern search method called FP-Growth. FP-tree based approaches are among the leading state of the art techniques for frequent itemset mining, and the concepts have also proven useful for other patterns (sequences, episodes) and pattern types (maximal, closed).

He has published more than 100 research papers on data mining in leading database and data mining conferences and journals, such as SIGMOD, VLDB, KDD, ICDE, EDBT, TKDE, and TOIS. His contribution can be seen in almost every area of the field.

Because of his many seminal contributions, Dr. Han is a very highly cited author, with over 3,000 citations, according to Citeseer. This clearly indicates the quality of his work, his influence in the field, and his contributions to many topics of data mining.

Jiawei not only is dedicated to pure research, but also industrial applications, benchmarking and products. He was the founder and chief architect of DBMiner, one of the first generation data mining products. He also actively led several projects on industrial applications of data mining techniques, which showcase the value and potential of the data mining technology.



2005 Innovation Award: Dr. Leo Breiman

Dr. Leo Breiman is widely considered one of the founding fathers of modern machine learning and data mining. He has been actively contributing to these fields, as well as to statistics, for more than 30 years.

His best known contribution is his landmark work on decision trees (Classification and Regression Trees, 1984, known as CART(R)), written with Jerome Friedman, Richard Olshen, and Charles Stone.

Since 1984 he has become further associated with tree ensembles in the form of Bootstrap Aggregation (Bagging Predictors, 1996) and Random Forests (also trademarked) and has become one of the most widely cited authors in the field. These works alone would be sufficient to merit Leo substantial honors and indeed he has been elected as a member to the National Academy of Sciences. The citation of the National Academy reads, in part:

"Breiman has done fundamental work in stochastic processes, information theory, and mathematical statistics. He is a seminal thinker who has developed modern methods of classification and pattern recognition. He has made significant contributions to the practice of statistics bridging the gaps between that field, signal processing, and computer science."

Leo was born in New York city in 1928 and laid the foundations for his professional career as a mathematician with a degree in Physics from Cal Tech in 1949 and a PhD in mathematics from the University of California, Berkeley, in 1954. His first well-known paper, which proved the Shannon-Breiman-MacMillan information theorem (1957), was followed by another body of work related to optimal gambling systems (1960). After spending seven years at UCLA as a mathematics professor teaching probability theory, he wrote the celebrated graduate textbook Probability (1968) which has been republished in the SIAM classics series. Leo reports that at this stage in his career he was tired of theoretical pursuits and wanted to get his hands dirty with real world data analysis.

He resigned his tenured full professorship and began a 13-year stint as a statistical consultant working on challenging problems in defense, traffic analysis, toxic substance detection, and air pollution. Soon realizing that classical statistical techniques were not adequate to the challenges, he begin the crafting the idea of classification via a series of yes/no questions, which put him on the road to CART. The ensuing CART monograph is a landmark for the wealth of insight and subtlety with which the subject is treated and is the source of many modern decision tree concepts:

  • cost-complexity
  • pruning, growing trees with asymmetric costs of misclassification,
  • surrogate splitters
  • to handle missing values, cross-validation for trees, linear combination and Boolean splitting rules,
  • the distinction between classification trees and probability trees,
  • regression trees via least squares and least absolute deviations,
  • and proof that in infinite samples CART trees converge toward the minimal obtainable error rate as they are grown larger.

"Estimating optimal transformations in multiple regression" (1985) is now standard in several major statistics packages. This work initiated some ideas carried further in Hastie and Tibshirani's GAM (Generalized Additive Models, 1990). Over the next several years Leo took on a number of projects, not all of which have been published.

In the late 1980's he began tinkering with CART parallelization on a network of Sun computers and reported his somewhat disappointing results in a 1995 discussion paper. In 1992 he developed the first implementation of a CART for a vector of target variables.

He also wrote a definitive analysis of certain problems in US Census estimates ("The 1990 Census adjustment: Undercount or bad data?", 1994).

In 1995 he introduced a major improvement in stepwise regression "Better Subset Regression Using the Nonnegative Garrote," which led to Tibshirani's now celebrated lasso. The method combines shrinking of regression coefficients towards zero, non-uniformly allowing some coefficients to reach zero and thus accomplishing variable selection.

From this point on, Leo was involved in a variety of attempts to improve the performance of CART trees. A series of studies investigating the implications of instability in function approximation began with his award-winning paper "The II-method for estimating multivariate functions from noisy data" (1991), followed by "The heuristics of instability in model selection" in 1996.

By this time he had hit on the idea of creating ensembles of trees via bootstrap re-sampling and he perfected the method (no pruning, atom size of 1) for his paper on Bagging (1996). This was followed by an extension of the bagger to mimic boosting (Adaptive Resampling and Combining, or Arcing Classifiers 1998) and an extension to online learning.

Since 1999 Breiman's name has become tightly linked with Random Forests, which further push the ideas developed for the bagger. Whereas in the bagger randomness is induced into a tree by the data sampling mechanism, in Random Forests the randomness is injected into the split selection itself. The resulting classifications and regressions are often considerably more accurate than the bagger and are competitive with the best methods now extant. Breiman also introduced novel post-processing in which he uses the Random Forest trees to generate a non-metric distance between any two records in a data set, thus supporting new ways to cluster data and identify anomalies and outliers.

Leo is more than just an academic researcher and his non-scientific activities have also been noteworthy. While a Professor at UCLA in the 1960's he took a year's sabbatical to work in Liberia for UNESCO, trekking through rain forests to help count the number of schools and school children in that country. In 1976 he served on the Santa Monica school board and developed ways to improve mathematics instruction. Finally, over the years he has hosted a total of 21 rural Mexican school children as "exchange students", providing them the opportunity to learn English in one-year stays in his home.

In summary, Leo Breiman has contributed some of the key ideas at the heart of today's data mining and has wielded immense influence in the field, doing this while also contributing to the welfare of many of the people in academia, industry, and everyday life with whom he has come in contact.

2006 Innovation Award: Dr. Ramakrishnan Srikant

Srikant identified novel pruning techniques and data structures that made the discovery of association rules feasible. He also generalized association rules along three orthogonal dimensions: discovering associations across different levels of a hierarchy over the items; discovering temporal associations ("sequential patterns"); and discovering associations over quantitative attributes.

In each case, Srikant invented pruning techniques and data structures that kept the execution times practical.

Srikant also showed how to push constraints over the set of items in the discovered associations into the mining algorithms. For this body of work, Srikant was awarded the prestigious Grace Murray Hopper award in 2002, which is given to the outstanding young computer professional of the year.

Srikant has also been instrumental in developing new technologies for data mining that respect the privacy of individuals whose data is being mined. There have recently been growing concerns that data mining is too powerful and that it can impinge on consumers' privacy. The conventional wisdom has been that data mining and privacy are adversaries, and the only way to protect privacy was to restrict the use of data mining. Srikant cleverly resolved this contradiction by developing techniques for "privacy preserving data mining" that exploit the difference between the level where we care about privacy, i.e., individual data, and the level where we run data mining algorithms, i.e., aggregated data. User data is randomized to disallow recovery of anything meaningful at the individual level, while still allowing recovery of aggregate information to build mining models.

Srikant's publications have had significant impact on the research community evidenced by their very high citations. His VLDB '94 paper, describing the Apriori algorithm for mining association rules, was awarded the 10-year best paper award at the 2004 VLDB conference.

The commercial impact of Srikant's work is equally impressive. Srikant was a key architect and code contributor for IBM Intelligent Miner, a technically sophisticated data mining product. Association rules are now considered one of the three primary data mining techniques (along with classification and clustering), and are part of the standard feature list for data mining products.

Srikant has actively participated in the KDD community. He served as Program Co-Chair of SIGKDD 2001 and PAKDD 2004, Vice Chair (Data Mining Track) of WWW 2006, Deputy Chair (Data Mining Track) of WWW 2004, and Vice Chair of ICDM 2004. He is the Editor-in-Chief of SIGKDD Explorations, and Associate Editor of ACM Transactions on Internet Technology.

2007 Innovation Award: Dr. Usama M. Fayyad

Fayyad is recognized for his seminal work on the development data mining, machine learning algorithms and their scalability to massive database systems, and fundamental applications of data mining in scientific discovery and commercial database systems.

His contributions span fundamental technical innovation and significant large-scale applications of the technology in science data analysis, commercial practice, and commercial database systems.

Fayyad's early contributions include the theoretical analysis of decision tree learning algorithms and the invention of some of the fundamental algorithms in decision tree induction from large scale data. His algorithm for discretization of numerical attributes has been and remains the state-of-the-art method in the machine learning and data mining communities for the past decade. His work on applications of data mining and statistical pattern recognition to massive scientific data sets in Astronomy, Planetary Geology, and remote sensing at NASA's Jet Propulsion Lab (JPL), California Institute of Technology have led to solving significant scientific advances and new discoveries in those fields. He received a U.S. Government Medal from NASA for this work as well as the JPL Lew Allen Award for Research Excellence from Caltech -- the highest honor granted to JPL scientists.

Fayyad's contributions to database systems involved inventing scalable data mining algorithms for massive databases, co-authoring new SQL Extensions and leading development work for embedding data mining algorithms inside the database engine of Microsoft's SQL Server 2000 system. The latest version of SQL Server 2006 still includes Fayyad's algorithms as well as derivatives and descendants of the core methodology he introduced.

Fayyad has played a leading innovative role in the development of the data mining industry by launching a startup company, Revenue Science Inc. (digiMine, Inc.) that developed an innovative business model around hosted on-demand applications of data mining, business intelligence, and targeting algorithms. His second start-up, DMX Group was acquired by Yahoo! Inc. in 2004 where, as a member of the senior executive team as the industry's first Chief Data Officer, he presides over the world's largest data streams (processing over 25 terabytes of data per day), and launching and overseeing Yahoo! Research which has the mission of inventing the new sciences underlying the data-rich areas of Internet, Microeconomics of the Web, and Search and Information Navigation over the world's largest collection of knowledge: the world-wide web.

Fayyad is co-editor of two influential books in data mining and knowledge discovery and has published over 100 technical articles in machine learning, Artificial intelligence, data mining and databases. He is a prolific inventor with over 30 patents issued and over 50 filed patents in the areas of data mining, on-line marketing and the Internet.

Fayyad has actively participated in the KDD community. He served as Program Co-Chair of the First International Conference on Knowledge Discovery and Data Mining (KDD 1995), and served as general chair of KDD-96 and as first general chair when the conference moved to ACM SIGKDD in 1999. He is the founding Editor-in-Chief of the primary technical journal in the field: Data Mining and Knowledge Discovery and remained as editor-in-chief for its first decade. He is founding Editor-in-Chief of ACM's SIGKDD Explorations, the official newsletter of the SIGKDD. He is a Fellow of the AAAI (Association for Advancement of Artificial Intelligence) and a Fellow of the ACM and is the recipient of many industry awards.

2008 Innovation Award: Dr. Raghu Ramakrishnan

Ramakrishnan's contributions span foundational technical innovation on algorithmic and systems aspects of data mining.

His work on scalable data mining algorithms started with BIRCH, the first truly scalable clustering algorithm. BIRCH introduced the groundbreaking idea of a cluster feature, a concise summary of a cluster, which was then used in many subsequent clustering algorithms as an integral component.

Because of its novelty and importance, this is one of the highest cited data mining papers in the last decade. Ramakrishnan later extended this work into a clustering framework for arbitrary metric spaces. He also worked on scalable algorithms for decision tree construction that are still considered state-of-the-art today. BIRCH is also the first true data stream mining algorithm: it constructs a clustering model in a single scan over the data with limited memory. Such algorithms for mining data streams have become a very important area of research in the data mining community over the last decade.

Further, Ramakrishnan developed a general framework for incrementally mining evolving data and created a framework for measuring change in data streams, again, visionary research topics that have generated much follow-up work since then. His work also introduced a new construct for analysis of ordered data, reflected in the inclusion of WINDOW functions in the SQL language.

Ramakrishnan’s work includes important contributions to data anonymization, and applying the multi-dimensional model from OLAP to develop a framework for exploratory data mining.

In addition to his academic research at the University of Wisconsin-Madison, Ramakrishnan has been active in applying data mining in industry. From 2000 to 2003, he was CTO and chairman of QUIQ, a company that developed technology for mass collaboration, a visionary concept that now with the arrival of Web 2.0 has gained widespread acceptance; the QUIQ-powered Ask Jeeves AnswerPoint question-answering portal was the forerunner of similar portals from Amazon, Linked-In and Yahoo!.

As Chief Scientist for Audience at Yahoo! he has led the research on content optimization, i.e., the task of algorithmically selecting the right content to display on a page when a user visits a web portal. This technology is already having a significant impact in practice. At Yahoo!, Ramakrishnan is also leading the research in cloud computing to develop a family of data hosting and analysis services, which, among other applications, will make it much easier to do data mining on the massive datasets seen at web-scale.

Ramakrishnan was Program Co-Chair of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD 2000), and served as an Editor-in-Chief of the primary technical journal in the field, Data Mining and Knowledge Discovery.

He is Chair of ACM SIGMOD, on the Board of Directors of ACM SIGKDD, and on the Board of Trustees of the VLDB Endowment. He is also a Fellow of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE).

He has received several awards, including the ACM SIGMOD Contributions Award, a Distinguished Alumnus Award from IIT Madras, a Packard Foundation Fellowship in Science and Engineering, and an NSF Presidential Young Investigator Award.