Next: Prerequisites
Up: Data Mining Curriculum: A
Previous: Introduction
Curriculum Design Philosophy
Data mining is an interdisciplinary field at the intersection of
artificial intelligence, machine learning, statistics, and database
systems, and we believe that different educators will emphasize
different topics in their courses. Thus we divided this curriculum
proposal into two parts. The first part titled Foundations
contains basic material that we believe should be covered in any
introductory course on data mining. The second part called
Advanced Topics is a comprehensive collection of material
that can be sampled to complete an introductory course or selections
of which can form the basis for an advanced course in data mining.
We believe that the teaching of data mining should concentrate on
long-lasting scientific principles and concepts of the field.
Thus instead of covering the last details of the most recent
research, we designed the basic material to lay a solid foundation
that opens the door to explore more advanced material.
The core endeavor in data mining is to extract knowledge from data;
this knowledge is captured in a human-understandable
structure. The discovery of structure in data is a
multifaceted problem that includes the following components:
- Database and Data Management Issues:
- Where does the data
reside? How is it to be accessed? What forms of sampling are needed?
are possible? are appropriate? What are the implications of the
database or data warehouse structure and constraints on data
movement and data preparation?
- Data Preprocessing:
- What are the required data transformations
before a chosen algorithm or class of algorithms can be applied to
the data? What are effective methods for reducing the dimensionality
of of the data so the algorithms can work efficiently? How are
missing data items to be modelled? What transformations properly
encode a priori knowledge of the problem?
- Choice of Model and Statistical Inference Considerations:
- What
are the appropriate choices to ensure proper statistical inference?
What are valid approximations? What are the implications of the
inference methods on the expected results? How is the resulting
structure to be evaluated? Validated?
- Interestingness Metrics:
- What makes the derived structure
interesting or useful? How do the goals of the particular
data mining activity influence the choice of algorithms or
techniques to be used?
- Algorithmic Complexity Considerations:
- What choice of
algorithms based on the size and dimensionality of data? What about
computational resource constraints? Requirements on accuracy of
resulting models? What are the scalability considerations and how
should they be addressed?
- Post-processing of Discovered Structure:
- How are the results to
be used? What are the requirements for use at prediction time? What
are the transformation requirements at model application time? How
are changes in the data or underlying distributions to be managed?
- Visualization and Understandability:
- What are the constraints
on the discovered structure from the perspective of
understandability by humans? What are effective visualization
techniques for the resulting structure? How can data be effectively
visualized in the context of or with the aid of the discovered
structures?
- Maintenance, Updates, and Model Life Cycle Considerations:
- When
are models to be changed or updated? How must the models change as
the utility metrics in the application domain change? How are the
resulting predictions or discovered structure integrated with
application domain metrics and constraints?
The partial list above demonstrates that data mining involves many
problems and many notions that have historically been studies in
isolation. This necessitates a healthy coverage of a wide range of
areas within the proposed curriculum.
Next: Prerequisites
Up: Data Mining Curriculum: A
Previous: Introduction
Gabor Melli
2010-06-01