next up previous
Next: Advanced Topics (Course II) Up: Course Topics and Models Previous: Course Topics and Models

Foundations (Course I)

  1. Introduction. Basic concepts of data mining, including motivation, definition, the relationships of data mining with database systems, statistics, machine learning, different kinds of data repositories on which data mining can be performed, different kind of patterns and knowledge to be mined, the concept of interestingness, and the current trends and developments of data mining. The material can probably be introduced by showing a few case studies.

    1. Concepts of data mining: motivation, definition, the relationships of data mining with database systems, statistics, machine learning, and information retrieval.

    2. Knowledge discovery process: An overview of the Knowledge Discovery Process. Emphasis on the iterative and interactive nature of the KDD Process.

    3. Mining on different kinds of data: relational, transactional, object-relational, heterogeneous, spatiotemporal, text, multimedia, Web, stream, mobile, and so on.

    4. Mining for different kind of knowledge: classification, regression, clustering, frequent patterns, discriminant, outliers, and so on.

    5. Evaluation of knowledge: interestingness or quality of knowledge, including accuracy, utility (such as support), and relevance (such as correlation).

    6. Applications of data mining: market analysis, scientific and engineering process analysis, bioinformatics, homeland security, and so on.

  2. Data Preprocessing. This unit will cover the following topics: (1) why preprocess the data? (2) basic data cleaning techniques, (3) data integration and transformation, and (4) data reduction methods. In particular, the following topics will be covered.

    1. Descriptive data summarization: This unit covers basic techniques for summarizing and describing data. It will cover: (1) computing the measures of central tendency such as mean, and mode, (2) computing the measures of data dispersion such as quantiles, boxplots, variances, standard deviation, and outliers, and (3) graphic display of basic statistical descriptions, such as histogram, scatter plot, boxplot, quantile-quantile plot, and local regression curves.

    2. Data cleaning methods: Basic techniques for handling missing values, noisy data, and inconsistent data, including typical binning, clustering, and regression methods for data cleaning.

    3. Data integration and transformation methods: This includes data smoothing, data aggregation, data generalization, normalization, attribute (or feature) construction.

    4. Basic data reduction methods: It introduces binning (histograms), sampling, and data cube aggregation.

    5. Discretization and concept hierarchy generation: It covers discretization and concept hierarchy generation for numeric data (including binning, clustering, histogram analysis), and for categorical data (automatic generation of concept hierarchies).

  3. Data Warehousing and OLAP for Data Mining. This unit introduces the concept of a data warehouse and its associated dimensional data model. It then introduces basic OLAP-style analysis on the data cube.

    1. Concept and architecture of data warehouse

    2. The dimensional data model: including dimensions and measures; star schema, snowflake schema, and fact constellations; data cube concept; concept hierarchies in the cube.

    3. OLAP Operations. OLAP operations in the multidimensional data model (drill-down, roll-up, slice and dice, pivot)

  4. Association, correlation, and frequent pattern analysis. This unit covers the concepts and techniques for association, correlation, and frequent pattern analysis, including the following topics.

    1. Basic concepts: frequent patterns, associations, support and confidence of association rules, correlation measure, other objective functions or measures, a typical application scenario: market basket analysis.

    2. Frequent pattern mining methods: (1) the Apriori algorithm, (2) improvements to Apriori, (3) mining for max-patterns, closed patterns, and top-$k$ patterns.

    3. Mining various kinds of frequent patterns: (1) multilevel and multidimensional association rules, (2) quantitative association rules, and (3) correlation analysis.

    4. Applications of association rules: (1) Web log analysis, (2) usage of association rules as classifiers

  5. Classification. This unit covers the concepts and techniques for classification analysis, including the following topics.

    1. Basic concepts: classification

    2. Evaluation of classification: (1) evaluation metric, (2) validation for model selection, (3) overfitting, (4) comparing classifiers based on cost-benefit and ROC curves

    3. Bayesian classification: (1) foundation: Bayes theorem, (2) Naive Bayesian classification methods

    4. Decision tree and decision rule induction: (1) attribute selection and reduction, (2) basic top-down classification-tree induction schema, (3) pre/post-pruning uninformative subtrees, (4) extraction of rules from classification trees, (5) decision rule induction.

    5. Linear models for classification: (1) linear discriminant analysis, (2) classification by SVM (Support Vector Machine) analysis

    6. Basic concepts of nonlinear classification: (1) neural network, (2) SVM with nonlinear kernels

    7. Classification by lazy evaluation: (1) k-nearest neighbor classifier: basic idea and error bounds, (2) locally weighted learning

    8. Emsemble classifier: Basic ideas why ensemble construction helps, basics of: weighted voting, bagging, boosting.

  6. Cluster and Outlier Analysis. This unit covers the concepts and techniques for cluster and outlier analysis, including the following topics.

    1. Concept of cluster analysis

    2. Types of data and for dissimilarity computation: Interval-scaled variables, binary variables, nominal, ordinal, and ratio-scaled variables, and variables of mixed types.

    3. A categorization of major clustering methods

    4. Partition-based clustering: $k$-means and $k$-medoids algorithms, and scalable partitioning methods.

    5. Hierarchical clustering: agglomerative and divisive hierarchical clustering methods, micro-clusters: integrated and scalable hierarchical clustering methods.

    6. Density-based clustering: concept of density-based clustering, scalable mining of clustering structures, clustering based on density distribution functions.

    7. Model-based clustering: (1) The EM Algorithm, (2) neural network approach (SOM)

    8. Outlier analysis: Concepts and basic outlier detection methods.

  7. Mining Time-Series and Sequence Data. This unit covers the techniques for mining time-series and sequence data, with the following topics.

    1. Regression analysis: (1) simple and multiple linear regression, (2) nonlinear regression, (3) logistic regression, (4) regression trees, (5) regression using Support Vector Machine, (6) other regression models.

    2. Trend analysis: A statistical approach

    3. Sequential pattern mining: Mining different kinds of sequential patterns, sequential pattern mining methods, constraint-based sequential pattern mining, closed sequential patterns, from sequential patterns to partially ordered patterns.

  8. Text Mining and Web Mining. This unit covers the techniques for mining text and Web data, including the following topics.

    1. Mining text databases: (1) Text data analysis and information retrieval, (2) keyword-based association analysis, (3) document classification, (4) text clustering analysis

    2. Mining the World-Wide Web: (1) Mining the Web's link structures to identify authoritative Web page, (2) automatic classification of Web documents, (3) construction of a multilayered Web information base, (4) mining social networks, (5) Web resource discovery, (6) Web usage mining.

  9. Visual Data Mining. This unit covers the visual data mining techniques, including the following topics.

    1. Data visualization

    2. Visualization of data mining results

    3. Visual data mining: visual classifier, projection pursuits, class-preserving projections, visualizing class-structure of high-dimensional data, class tours

  10. Data Mining: Industry efforts and social impacts

    1. Social impact of data mining
    2. Data mining and privacy
    3. Standardization efforts
    4. Data mining system products


next up previous
Next: Advanced Topics (Course II) Up: Course Topics and Models Previous: Course Topics and Models
Gabor Melli 2010-06-01