Next: Advanced Topics (Course II)
Up: Course Topics and Models
Previous: Course Topics and Models
- Introduction.
Basic concepts of data mining, including motivation, definition, the
relationships of data mining with database systems, statistics,
machine learning, different kinds of data repositories on which data
mining can be performed, different kind of patterns and knowledge to
be mined, the concept of interestingness, and the current trends and
developments of data mining. The material can probably be
introduced by showing a few case studies.
- Concepts of data mining: motivation, definition,
the relationships of data mining with database systems,
statistics, machine learning, and information retrieval.
- Knowledge discovery process: An overview of the
Knowledge Discovery Process. Emphasis on the iterative and
interactive nature of the KDD Process.
- Mining on different kinds of data: relational,
transactional, object-relational, heterogeneous, spatiotemporal,
text, multimedia, Web, stream, mobile, and so on.
- Mining for different kind of knowledge:
classification, regression, clustering, frequent patterns,
discriminant, outliers, and so on.
- Evaluation of knowledge: interestingness or quality
of knowledge, including accuracy, utility (such as support), and
relevance (such as correlation).
- Applications of data mining: market analysis,
scientific and engineering process analysis, bioinformatics,
homeland security, and so on.
- Data Preprocessing. This unit will cover the following
topics: (1) why preprocess the data? (2) basic data cleaning
techniques, (3) data integration and transformation, and (4) data
reduction methods. In particular, the following topics will be
covered.
- Descriptive data summarization: This unit covers
basic techniques for summarizing and describing data. It will
cover: (1) computing the measures of central tendency such as mean,
and mode, (2) computing the measures of data dispersion such as
quantiles, boxplots, variances, standard deviation, and outliers,
and (3) graphic display of basic statistical descriptions, such as
histogram, scatter plot, boxplot, quantile-quantile plot, and local
regression curves.
- Data cleaning methods: Basic techniques for
handling missing values, noisy data, and inconsistent data,
including typical binning, clustering, and regression methods for
data cleaning.
- Data integration and transformation methods: This
includes data smoothing, data aggregation, data generalization,
normalization, attribute (or feature) construction.
- Basic data reduction methods: It introduces
binning (histograms), sampling, and data cube aggregation.
- Discretization and concept hierarchy generation: It
covers discretization and concept hierarchy generation for numeric
data (including binning, clustering, histogram analysis), and for
categorical data (automatic generation of concept hierarchies).
- Data Warehousing and OLAP for Data Mining. This
unit introduces the concept of a data warehouse and its associated
dimensional data model. It then introduces basic OLAP-style
analysis on the data cube.
- Concept and architecture of data warehouse
- The dimensional data model: including dimensions
and measures; star schema, snowflake schema, and fact
constellations; data cube concept; concept hierarchies in the
cube.
- OLAP Operations. OLAP operations in the
multidimensional data model (drill-down, roll-up, slice and dice,
pivot)
- Association, correlation, and frequent pattern
analysis. This unit covers the concepts and techniques for
association, correlation, and frequent pattern analysis, including
the following topics.
- Basic concepts: frequent patterns, associations,
support and confidence of association rules, correlation measure,
other objective functions or measures, a typical application
scenario: market basket analysis.
- Frequent pattern mining methods: (1) the Apriori
algorithm, (2) improvements to Apriori, (3) mining for max-patterns,
closed patterns, and top-
patterns.
- Mining various kinds of frequent patterns: (1)
multilevel and multidimensional association rules, (2)
quantitative association rules, and (3) correlation analysis.
- Applications of association rules: (1) Web log
analysis, (2) usage of association rules as classifiers
- Classification. This unit covers
the concepts and techniques for classification analysis, including
the following topics.
- Basic concepts: classification
- Evaluation of classification: (1)
evaluation metric, (2) validation for model selection, (3)
overfitting, (4) comparing classifiers based on cost-benefit and ROC
curves
- Bayesian classification: (1) foundation: Bayes
theorem, (2) Naive Bayesian classification methods
- Decision tree and decision rule induction: (1)
attribute selection and reduction, (2) basic top-down
classification-tree induction schema, (3) pre/post-pruning
uninformative subtrees, (4) extraction of rules from
classification trees, (5) decision rule induction.
- Linear models for classification: (1) linear
discriminant analysis, (2) classification by SVM (Support Vector
Machine) analysis
- Basic concepts of nonlinear classification: (1)
neural network, (2) SVM with nonlinear kernels
- Classification by lazy evaluation: (1) k-nearest
neighbor classifier: basic idea and error bounds, (2) locally
weighted learning
- Emsemble classifier: Basic ideas why ensemble
construction helps, basics of: weighted voting, bagging, boosting.
- Cluster and Outlier Analysis. This
unit covers the concepts and techniques for cluster and outlier
analysis, including the following topics.
- Concept of cluster analysis
- Types of data and for dissimilarity computation:
Interval-scaled variables, binary variables, nominal, ordinal, and
ratio-scaled variables, and variables of mixed types.
- A categorization of major clustering methods
- Partition-based clustering:
-means and
-medoids
algorithms, and scalable partitioning methods.
- Hierarchical clustering: agglomerative and divisive
hierarchical clustering methods, micro-clusters: integrated and
scalable hierarchical clustering methods.
- Density-based clustering: concept of density-based
clustering, scalable mining of clustering structures, clustering
based on density distribution functions.
- Model-based clustering:
(1) The EM Algorithm, (2) neural network approach (SOM)
- Outlier analysis: Concepts and basic outlier detection methods.
- Mining Time-Series and Sequence Data. This unit covers
the techniques for mining time-series and sequence data, with the
following topics.
- Regression analysis: (1) simple and multiple linear
regression, (2) nonlinear regression, (3) logistic regression, (4)
regression trees, (5) regression using Support Vector Machine, (6)
other regression models.
- Trend analysis: A statistical approach
- Sequential pattern mining: Mining different kinds
of sequential patterns, sequential pattern mining methods,
constraint-based sequential pattern mining, closed sequential
patterns, from sequential patterns to partially ordered patterns.
- Text Mining and Web Mining. This unit covers the
techniques for mining text and Web data, including the following
topics.
- Mining text databases: (1) Text data analysis and
information retrieval, (2) keyword-based association analysis, (3)
document classification, (4) text clustering analysis
- Mining the World-Wide Web: (1) Mining the Web's
link structures to identify authoritative Web page, (2) automatic
classification of Web documents, (3) construction of a
multilayered Web information base, (4) mining social networks, (5)
Web resource discovery, (6) Web usage mining.
- Visual Data Mining. This unit covers the
visual data mining techniques, including the following topics.
- Data visualization
- Visualization of data mining results
- Visual data mining: visual classifier, projection
pursuits, class-preserving projections, visualizing
class-structure of high-dimensional data, class tours
- Data Mining: Industry efforts and social impacts
- Social impact of data mining
- Data mining and privacy
- Standardization efforts
- Data mining system products
Next: Advanced Topics (Course II)
Up: Course Topics and Models
Previous: Course Topics and Models
Gabor Melli
2010-06-01