Data mining is an interdisciplinary field at the intersection of artificial intelligence, machine learning, statistics, and database systems, and we believe that different educators will emphasize different topics in their courses. Thus we divided this curriculum proposal into two parts. The first part titled Foundations contains basic material that we believe should be covered in any introductory course on data mining. The second part called Advanced Topics is a comprehensive collection of material that can be sampled to complete an introductory course or selections of which can form the basis for an advanced course in data mining. We believe that the teaching of data mining should concentrate on long-lasting scientific principles and concepts of the field. Thus instead of covering the last details of the most recent research, we designed the basic material to lay a solid foundation that opens the door to explore more advanced material. The core endeavor in data mining is to extract knowledge from data; this knowledge is captured in a human-understandable structure. The discovery of structure in data is a multifaceted problem that includes the following components:
The partial list above demonstrates that data mining involves many problems and many notions that have historically been studies in isolation. This necessitates a healthy coverage of a wide range of areas within the proposed curriculum.
Database and Data Management Issues:
Where does the data reside? How is it to be accessed? What forms of sampling are needed? are possible? are appropriate? What are the implications of the database or data warehouse structure and constraints on data movement and data preparation?
What are the required data transformations before a chosen algorithm or class of algorithms can be applied to the data? What are effective methods for reducing the dimensionality of of the data so the algorithms can work efficiently? How are missing data items to be modelled? What transformations properly encode a priori knowledge of the problem?
Choice of Model and Statistical Inference Considerations:
What are the appropriate choices to ensure proper statistical inference? What are valid approximations? What are the implications of the inference methods on the expected results? How is the resulting structure to be evaluated? Validated?
What makes the derived structure interesting or useful? How do the goals of the particular data mining activity influence the choice of algorithms or techniques to be used?
Algorithmic Complexity Considerations:
What choice of algorithms based on the size and dimensionality of data? What about computational resource constraints? Requirements on accuracy of resulting models? What are the scalability considerations and how should they be addressed?
Post-processing of Discovered Structure:
How are the results to be used? What are the requirements for use at prediction time? What are the transformation requirements at model application time? How are changes in the data or underlying distributions to be managed?
Visualization and Understandability:
What are the constraints on the discovered structure from the perspective of understandability by humans? What are effective visualization techniques for the resulting structure? How can data be effectively visualized in the context of or with the aid of the discovered structures?
Maintenance, Updates, and Model Life Cycle Considerations:
When are models to be changed or updated? How must the models change as the utility metrics in the application domain change? How are the resulting predictions or discovered structure integrated with application domain metrics and constraints?