Tej Anand, AT&T GIS
Dr. Gholamreza Nakhaeizadeh, Daimler-Benz
Evangelos Simoudis, IBM, co-chair
Gregory Piatetsky-Shapiro, GTE Laboratories, co-chair
Ralphe Wiggins, Information Harvesting
Kamran Parsaye, Information Discovery
Mario Schkolnick, SGI
Commercial Knowledge Discovery Applications The objective of most commercial knowledge discovery endeavors is the development of knowledge discovery applications rather than a one time discovery of some interesting insight. Knowledge discovery applications are business applications that possess the following characteristics: 1. Knowledge encoded in the application is derived from enterprise-wide data. Often times this knowledge is validated and supplemented by subject matter experts. 2. Output produced by the application is actionable. Output is actionable if it is in the business user's vocabulary, its intent can be determined by the business user without complicated inference and it can be used directly (without any transformation) for the business user's task. 3. Knowledge encoded in the application can be maintained and enhanced by the business user. In addition, it should be possible to modify this knowledge without re-compiling the application.
Knowledge discovery applications are designed for knowledge workers within an organization. They emphasize ease of use and explicit representations of underlying knowledge. In the following we will describe the activities that an organization needs to invest in to develop knowledge discovery applications.
Building a Data Warehouse
For most organizations data is distributed across numerous "operational systems". These operational systems are designed for efficient retrieval and update. In order to develop knowledge discovery applications an organization needs to bring together data from the disparate operational systems into a single logical enterprise wide data model. Such a consolidated system is referred to as a data warehouse. Consolidation of data involves numerous data transformation operations: conditioning, scrubbing, householding, etc. Building a data warehouse is a necessary "pre-processing" activity for developing knowledge discovery applications.
Knowledge Discovery
This activity usually starts with understanding the task (or the outcome variables) and understanding the universe (data populated in the data warehouse). Based on this understanding, clustering and visualization tools can be used to segment the universe into subsets. This is followed by selecting the independent variables for model development. Several information theoretic measures and results from learnability theory can be employed here to select an optimum set of independent variables. This is followed by selecting a modeling methodology. Based on the data types and the distributions of the independent and outcome variables certain modeling methodologies can be ruled out. Finally a model is developed and evaluated. Evaluation of the model leads to iterations of one or more steps in this activity. This activity is extremely dependent on making explicit data abstraction within the data warehouse.
Designing Actionable Output
This is the "post processing" activity. What should the results of knowledge discovery or the output of the knowledge discovery application? We find three kinds of actionable output: information frames are a useful metaphor for knowledge discovery applications intended for post-hoc analysis, active database triggers are useful for exception reporting of key operational business indicators, and agents are useful for automatically implementing normal (i.e. non-exceptional) business operations.
Updating Discovered Knowledge
There are two ways of fulfilling the maintenance and enhancement criteria of knowledge discovery applications: for post-hoc analysis applications the knowledge is represented as abstract causal mode is that a user can interact with and modify, for knowledge discovery applications based on time series data the model is recomputed based on new data.
Position statement As somebody who does "application oriented research", I think the main reason that makes KDD interesting and applicable to industry is the second "D" (Discovery) in KDD. The second "D" brings a lot of new problems but brings also new chances, and I think that success of KDD-industrial projects depends strongly to this fact that how good we can solve the problems which are caused by the presence of this second "D". Of course the classical criteria like understanding the problem, management support etc. are important as well but they are not KDD specific. They are important for success of every project.
Generally, I distinguish strongly between "KD in DATA" and "KD in DATABASES". As a matter of fact knowledge discovery in data (not in large databases) is not new and has a history which is at least as old as the history of Statistics and AI. Contributions of Fisher to the discriminant analysis in 1930-s or the work of Hunt and his colleagues on concept learning in 1960-s are classical historical examples for KD in data. In the recent years both statisticians and AI researchers have developed new algorithms which can be regarded as very useful instruments for knowledge discovery in data (but not always in databases).
Nowadays, however, the main sources of the industrial data are large databases. This fact lead to two major problems:
1. The problem of huge amount of data.
2. The problem of efficient access to data
Although the second problem is an important one as well, I'm going to comment here only the first one because as the co-ordinator and partner of the project StatLog I have some experiences concerning this problem. The main goal of StatLog was to evaluate the performance of different supervised learning algorithms using real-world large scale datasets. The largest dataset we had in StatLog included about 60,000 examples and 50 attributes. We determined that most of applied supervised learning algorithms could not handle such datasets in a reasonable time with a reasonable accuracy. Meanwhile at Daimler-Benz we are confronted with still bigger databases (one of them includes information on 7 millions cars ). In my opinion to solve this problem two approaches can be followed:
a) Re-implementation of the learning algorithms to make them convenient
for handling huge amount of data.
b) Reducing the volume of data and attributes
Part b) is closely related to pre-processing which is a topic of the panel and shows very good the importance of this topic not only for "KD in data" but specially for "KD in databases". Besides Statistics, windowing, CBR and IBL approaches can contribute to solve this problem. Considering the expert and domain knowledge is useful as well.
When I put together my StatLog-experiences and the experiences of a current KDD-project at Daimler-Benz, I would conclude that:
1- At present the majority of real-world KDD-Problems can not be solved by available KDD-approaches. The available approaches can do only KD in data but not in databases. Developing of KDD specific approaches needs more research. Providing such approaches would be a challenging but very interesting task for statisticians, ML-researchers and database specialists.
2-Pre-processing is a useful instrument in developing of applicable KDD-approaches, specially in reducing the amount of data and attributes.
3-Post-Processing is, in long term, useful as well. In my opinion, in the current situation it is not very important. In a KDD research project, I would assign only 10% of my budget to post processing.
Position statement The number of references of knowledge discovery and data mining in the technical, scientific, and popular press are a testimony of the field's success popularity. The offering of commercial data mining products and services by both startup and established information technology companies offer another proof. During this period when expectations of data mining are inflating and the risk of failure is rather high since data mining systems have not clearly and repeatedly demonstrated their value, it is important to identify and examine the issues which must be addressed to ensure the field's commercial success.
Knowledge discovery and data mining is not just predictive modeling. It includes database segmentation, link analysis, and deviation detection operations, many of which must be applied cooperatively to address specific problems such as identifying prospects for targeted marketing. Many of the commercially available data mining systems, and early reported applications, include, and attempt to solve problems only through the use of, predictive modeling techniques. As a result, it is often necessary to reformulate, many times unsuccessfully, several problems as classification operations. While it may be possible to develop feasibility study data mining systems in this way, this approach risks creating prototype system which cannot grow to full applications.
The definition of data mining and KDD refers to information extraction from large databases. Yet the majority of the research prototypes and commercially available data mining systems are only being tested against very small data sets, usually a few thousand records. Many of the existing commercial data mining systems cannot deal with data of the size (both in number of records and number of attributes per record) that is presently found in real-world databases. As a result, it is not known whether these data mining systems scale and how well they scale. In several industries, organizations maintain databases that have grown to several hundreds of gigabytes in size.
Extracting information from large databases implies that the KDD systems must be capable of interacting directly with database management systems (dbms) rather than requiring their users to extract data from the dbms and maintain these extracts outside the database. Such interaction will not only facilitate the data mining operation but will also facilitate the management of the data. Otherwise, as several users may be mining data from the same database, creating such extracts becomes impractical and inefficient. If, in addition, the data mining operation is performed within the database rather than in the application's memory space, then issues of scale will be addressed easier. Unfortunately, at present not all database management systems can support such processing.
Commercial Knowledge Discovery Applications: What makes them a success The growing popularity of knowledge discovery in databases as a research topic (as evidenced by this conference) and the growing number of companies offering tools for data mining and knowledge discovery stem from the real need to cope with mountains of data. At the same time there is an apparent scarcity of reported successful applications of KDD.
I see several reasons for this:
1) Success depends first on selecting a proper application. Such application should have a need for discovery of "actionable" findings in data. The findings may have the form of classification models (rules, neural networks, etc) which can be used to predicting e.g. various customer behaviors, but they also can be selected changes and deviations, dependency networks, new clusters, etc. There should be some domain knowledge to guide the discovery but it should be incomplete (otherwise there is no need to discover anything). If the application deals with the analysis of personal data, it should consider the privacy issues -- will data subjects object to access and analysis of their data ? Are there legal issues ?
2) Success depends next matching the application with the proper discovery methods. Many existing methods and approaches exist [see http://info.gte.com/~kdd]. Latest research suggests that no single method is universally superior -- several should be tried. Another important issue is interfacing the KDD engine to an external DBMS server vs. building a specialized DBMS management inside KDD system. Using the first (external DBMS) approach makes it not only easier to interface to existing systems, but also makes feasible application to very large databases. The second (internal DBMS) approach, while making discovery simpler for small and medium size databases, faces serious scaling problems for dealing with very large (Gigabyte-size) databases.
While the discovery engine is central in a KDD system, many other modules for pre- and post- processing are needed. Data needs to be merged, scaled, uninteresting fields removed, formats may need to be changed for different tools, etc. The discovered findings also need to be put in a form that is suitable for action or for presentation to users.
3) The transition of research KDD systems to application is facing the the same stumbling blocks as all transfers of research to application, including
There should be a strong business need to the solution -- i.e. application pull, not technology push simpler solutions have been tried and found inadequate There is strong organizational support and a champion KDD system needs to integrate with other existing systems There should be a plan for system maintenance and lifecycle support.
4) Existing KDD-related publications reach only a small percentage of those who work on data mining in the broad sense, defined as "getting benefit from analysis of large databases". This broad definition encompasses many activities not "traditionally" considered part of KDD, such as OLAP (On-Line Analytical Processing) and database marketing, which have been very successful and reported their successes.
5) Some applications, especially those that deal with predicting market behaviour, will not be reported especially if they are successful. E.g. the mutual funds that use neural networks for stock selection (and many exist!) naturally do not want to disclose any details. A similar situation exists with many existing applications for fraud detection.
1. In depth discussions with the user to discover the nature and purpose of the application. The most important step in this process is selecting a dependent variable(s) that truly provides actionable information to the business. Coupled with this is a thorough discussion and acquisition of all of the variables (database columns) that may enhance the model. Some types of problems may require only 50 variables, others may require 500 or 1000 variables.
2. The KDD tool must be able to handle a large number of data rows in addition to data variables. Some applications may produce useful results with only a few tens of thousands of rows. Many real business problems require more than 500,000 rows just to prevent overfitting. The largest problem we have tackled, so far, had 50,000,000 rows. The only issue standing in the way of larger problems is CPU time on faster CPU's and adequate data storage.
3. The KDD tool must be easy to use. That means that the interface should provide a view of the data relationships, the quality of the predictions, and the bases for the models directly and intuitively. The user should not have to have week long course in the underlying technologies and should not have to be trained in how the interface works. The people who have the strongest interest in KDD are managers with profit and loss responsibility who have a great interest in gaining a basis of trust in the results but who usually have little interest in understanding the modeling technology.
4. The KDD tool must be truly at ease with categorical and numerical variables, with noisy and missing data, and with highly nonlinear relationships. That is what the data is like that needs analysis.
The Secrets to Success in Data Mining and the Necessity for Revising some Perspectives First, whoever thought that one would readily discuss all the secrets to success in data mining within a non-paying seminar? One may one day do this after the Coca-Cola company tells us their formula, or GTE discusses the secrets of making free phone calls. In practice, the real secrets are discussed in meetings with purchase orders on the table.
However. more seriously, and for the sake of generating interesting discussion within the field, I can discuss some of the basic steps necessary for paving the way towards success -- total victory being the second half of the story.
To begin with, I would like to suggest that we must throw away many of the current notions that we have about databases even before we begin to think of data mining. Some of these ideas were discussed in our last book [1], some other key issues are:
1. Normalization theory: I am not comfortable with the validity or applicability of this theory to decision support systems (DSS). These ideas were invented by Codd even before on-line transaction processing (OLTP) systems were widespread. They do NOT apply to DSS -- and when you cannot get to your data you cannot mine it the right way. Yes, in many data warehousing projects normal forms are used because the DBA's were taught them in schools and because no other theory is present. The use of normalization theory often results in an abnormal DSS. When your DBA says, he has 10 years of experience and wants to design your warehouse, it is time to worry. I have discussed a set of new ideas with a few people, and a revision of the normal forms will be forthcoming.
2. Waterfall Design and Sandwich Paradigm: The linear waterfall approach of requirements, design, implementation, test, and maintenance simply does not work in DSS. Again, many software engineers have been taught this in schools -- so that is what they do. The time-lag between academia and practice is becoming crucial. I have elsewhere discussed Concentric Design that is more suited to DSS applications. This is then extended to the Sandwich Paradigm that sandwiches warehousing between two layers of data mining, see references 1, 2, and 3 below.
3. Whales in Swimming Pools: Most people do not buy enough hardware for a large scale relational DSS, and do not use the right data structures. Hence they can perform indexed-based retrieval quickly, (e.g., given a name find a person's age) but can not get enough performance even for simple pie-charts (e.g., average age of people by occupation) since this requires a full scan of an entire table. Hence a whale of a database is put on disk, but has little room to maneuver. A few additional steps taken can help a lot here, see reference 4 for some early ideas.
4. Data Quality: In many cases, this will quickly become an issue if the data mining and rule discovery system is not used to check the quality of the data first. Most legacy systems have been subject to liberties taken by various COBOL programmers for at least a decade. But data quality will become a big issue only if one does not plan and budget for dealing with it in a warehousing project. See reference 6 below for more on this topic.
5. Good-bye to Sampling: Relying on sample of data for analysis is a 1930's approach. You don't know what you are missing when you sample. The Central Limit Theorem runs out of stream when we are looking at semi-sparse databases with very large numbers of non-numeric values (e.g., 50,000 products, 5,000 stores and 1,000,000 customers). The entire warehouse must be analyzed, without sampling. This is one of the reasons why some of the "first generation" neural nets and some statistical approaches have deep problems in dealing with large warehouses.
6. Good-bye Al, Hello Database: For data mining to succeed, we must avoid the semi-emotional and elitist scenarios of Al in the 1980's. The down-to-earth approach of databases is what is needed. Data mining will best succeed as a follow-on to the data warehousing phenomenon.
Now that these issues have been discussed, let me add that in practice one can avoid most of the pitfalls with some careful thinking, some good design and a good piece of commercial software. Our practical experience with mining large databases has been very successful in the past few years now, and the fear of mining a multi-gigabyte database is all gone--the next milestone will be terabyte databases. We routinely get better than 10 to 1 ratios of return on investment in most data mining projects with our IDIS: The Information Discovery System--and the long term benefit to organizations is even higher. We now have the software and the techniques for succeeding in data mining, our current challenge is to get people to build their warehouses the right way.
References:
1. Parsaye, K. and Chignell, M. Intelligent Database Tools and Applications, John Wiley, 1993.
2. Parsaye, K. and Chignell, M. "Concentric Design for Decision Support", Database Programming & Design, May 1993
3. Parsaye, K. "The Sandwich Paradigm", Database Programming & Design, April 1993
4. Parsaye, K. "The Discovery Machine", KDD-89 workshop, August 1989
5. Parsaye, K. "Large Scale Data Mining in Parallel", DBMS Magazine, February 1995
6. Parsaye, K. and Chignell, M. "Quality Unbound", Database Programming & Design, January 1995
What Makes KDD Applications Successful: Position Statement For some time now, KDD applications have been used in commercial environments. Their use has been more successful in situations where the amount of data that is required to produce a solution is either small or can be reduced through the use of statistical techniques. Frequently, the performance of the data mining algorithms that lie at the core of these applications is non-linear in the size of the input data set. Moreover, in order to use these applications, users must have a good knowledge about the characteristics of the algorithms or have a reasonable background in statistical techniques. These characteristics have effectively limited the use of these tools in this marketplace.
The success of KDD applications in commercial marketplace will depend on the ability that the KDD community will have to deliver technology that can be made usable by a large segment of the potential market, and that have the performance that is required to match the requirements of commercial applications. These requirements can be summarized as follows:
1) The applications must be easy to use. By this I mean more than a friendly GUI (this always helps!) Users should be able to use these applications without any knowledge of how the internals of it work or any knowledge other than their understanding of the business processes for which these applications are being used. When this happens, the potential market for KDD applications will change from that of a reduced set of savy analysts to a very large group of business people (executives, buyers, store mangers, marketing specialists, etc.)
2) The data mining algorithms used in these applications must be capable of handling large data bases. To do this, their performance must be as close to linear (with the size of the data) as possible. Having such characteristics effectively opens up the possibility of using large scale parallelism to deal with large collections of data.