Dr. Leo Breiman is widely considered one of the founding fathers of modern machine learning and data mining. He has been actively contributing to these fields, as well as to statistics, for more than 30 years.
His best known contribution is his landmark work on decision trees (Classification and Regression Trees, 1984, known as CART(R)), written with Jerome Friedman, Richard Olshen, and Charles Stone.
Since 1984 he has become further associated with tree ensembles in the form of Bootstrap Aggregation (Bagging Predictors, 1996) and Random Forests (also trademarked) and has become one of the most widely cited authors in the field. These works alone would be sufficient to merit Leo substantial honors and indeed he has been elected as a member to the National Academy of Sciences. The citation of the National Academy reads, in part:
"Breiman has done fundamental work in stochastic processes, information theory, and mathematical statistics. He is a seminal thinker who has developed modern methods of classification and pattern recognition. He has made significant contributions to the practice of statistics bridging the gaps between that field, signal processing, and computer science."
Leo was born in New York city in 1928 and laid the foundations for his professional career as a mathematician with a degree in Physics from Cal Tech in 1949 and a PhD in mathematics from the University of California, Berkeley, in 1954. His first well-known paper, which proved the Shannon-Breiman-MacMillan information theorem (1957), was followed by another body of work related to optimal gambling systems (1960). After spending seven years at UCLA as a mathematics professor teaching probability theory, he wrote the celebrated graduate textbook Probability (1968) which has been republished in the SIAM classics series. Leo reports that at this stage in his career he was tired of theoretical pursuits and wanted to get his hands dirty with real world data analysis.
He resigned his tenured full professorship and began a 13-year stint as a statistical consultant working on challenging problems in defense, traffic analysis, toxic substance detection, and air pollution. Soon realizing that classical statistical techniques were not adequate to the challenges, he begin the crafting the idea of classification via a series of yes/no questions, which put him on the road to CART. The ensuing CART monograph is a landmark for the wealth of insight and subtlety with which the subject is treated and is the source of many modern decision tree concepts:
- pruning, growing trees with asymmetric costs of misclassification,
- surrogate splitters
- to handle missing values, cross-validation for trees, linear combination and Boolean splitting rules,
- the distinction between classification trees and probability trees,
- regression trees via least squares and least absolute deviations,
- and proof that in infinite samples CART trees converge toward the minimal obtainable error rate as they are grown larger.
"Estimating optimal transformations in multiple regression" (1985) is now standard in several major statistics packages. This work initiated some ideas carried further in Hastie and Tibshirani's GAM (Generalized Additive Models, 1990). Over the next several years Leo took on a number of projects, not all of which have been published.
In the late 1980's he began tinkering with CART parallelization on a network of Sun computers and reported his somewhat disappointing results in a 1995 discussion paper. In 1992 he developed the first implementation of a CART for a vector of target variables.
He also wrote a definitive analysis of certain problems in US Census estimates ("The 1990 Census adjustment: Undercount or bad data?", 1994).
In 1995 he introduced a major improvement in stepwise regression "Better Subset Regression Using the Nonnegative Garrote," which led to Tibshirani's now celebrated lasso. The method combines shrinking of regression coefficients towards zero, non-uniformly allowing some coefficients to reach zero and thus accomplishing variable selection.
From this point on, Leo was involved in a variety of attempts to improve the performance of CART trees. A series of studies investigating the implications of instability in function approximation began with his award-winning paper "The II-method for estimating multivariate functions from noisy data" (1991), followed by "The heuristics of instability in model selection" in 1996.
By this time he had hit on the idea of creating ensembles of trees via bootstrap re-sampling and he perfected the method (no pruning, atom size of 1) for his paper on Bagging (1996). This was followed by an extension of the bagger to mimic boosting (Adaptive Resampling and Combining, or Arcing Classifiers 1998) and an extension to online learning.
Since 1999 Breiman's name has become tightly linked with Random Forests, which further push the ideas developed for the bagger. Whereas in the bagger randomness is induced into a tree by the data sampling mechanism, in Random Forests the randomness is injected into the split selection itself. The resulting classifications and regressions are often considerably more accurate than the bagger and are competitive with the best methods now extant. Breiman also introduced novel post-processing in which he uses the Random Forest trees to generate a non-metric distance between any two records in a data set, thus supporting new ways to cluster data and identify anomalies and outliers.
Leo is more than just an academic researcher and his non-scientific activities have also been noteworthy. While a Professor at UCLA in the 1960's he took a year's sabbatical to work in Liberia for UNESCO, trekking through rain forests to help count the number of schools and school children in that country. In 1976 he served on the Santa Monica school board and developed ways to improve mathematics instruction. Over the years, he has hosted a total of 21 rural Mexican school children as "exchange students", providing them the opportunity to learn English in one-year stays in his home.
Leo Breiman has contributed some of the key ideas at the heart of today's data mining and has wielded immense influence in the field, while also contributing to the welfare of many of the people in academia, industry, and everyday life with whom he has come in contact.