Conversation Topics for KDD2016
Curated by: Ian Davidson
Cluster analysis or clustering aims to take a collection of objects and divide them into a number of different groups such that instances in the same group (cluster) are similar to each other and dis-similar to those in other groups/clusters. It is extensively used in many domains including image analysis, information retrieval and bioinformatics. Clustering is traditionally inherently exploratory in that it takes no human guidance and aims to uncover the underlying structure in the data. Recent innovations include adding supervision (semi-supervised clustering), constraints (constrained clustering) and extensions to handle complex data such as graphs, evolving data and multi-view data.
Frequent Pattern Mining
Curated by: Xifeng Yan
Finding frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data. Moreover, it helps in data indexing, classification, clustering, and other data mining tasks as well. Frequent pattern mining is an important data mining task and a focused theme in data mining research. Abundant literature has been dedicated to this research and tremendous progress has been made, ranging from efficient and scalable algorithms for frequent itemset mining in transaction databases to numerous research frontiers, such as sequential pattern mining, structured pattern mining, correlation mining, associative classification, and frequent pattern-based clustering, as well as their broad applications.
Outlier and Anomaly Detection
Curated by: Varun Chandola and Vipin Kumar
Anomalies are the unusual, unexpected, surprising patterns in the observed world. Identifying, understanding, and predicting anomalies from data form one of the key pillars of modern data mining. Ective detection of anomalies allows extracting critical information from data which can then be used for a variety of applications, such as to stop malicious intruders, detect and repair faults in complex systems, and better understand the behavior of natural, social, and engineered systems.
Curated by: Aarti Singh
In modern datasets, the dimensionality of the input data is typically too large to measure, store, compute, transmit, visualize or interpret. This necessitates dimensionality reduction methods that can identify few of the most relevant dimensions. Dimensionality reduction methods can be categorized into feature selection methods that aim to select a subset from given features (aka coordinates, attributes or dimensions) that are most relevant and feature extraction methods that aim to identify few transformations of the given features that are most relevant. Feature extraction methods can yield more parsimonious representations than feature selection methods, however the latter lead to interpretable solutions e.g. which genes are most representative (or predictive of a disease), instead of transformations of gene expressions that are most representative (or predictive).
Curated by: Yehuda Koren
Recommender systems assist users in selecting products or services most suitable to their tastes and needs. With the rapid growth of web content supply and of online item catalogs, the personalized advice offered by recommenders is vital. This, together with the widening availability of user data, has contributed to a vast interest in recommendation technologies.
Graph Mining and Social Networks
Curated by: Christos Faloutsos
Have you ever wondered how Google finds the best page for your question? How would you spot the most important people on faceBook? How would you spot fake followers on Twitter? In a who-contacts- whom network, which is the best nodes to immunize, to stop a flu epidemic?
Curated by: Wei Fan
Deep learning attempts to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple non-linear transformations. It is part of a broader family of machine learning methods based on learning representations of data. An observation, for example an image, can be represented in many ways such as a vector of intensity values per pixel, or in a more abstract way as a set of edges, regions of particular shape, etc.
Curated by: H V Jagadish
The development of massively distributed computing infrastructures has changed the economics of data management, and made it possible to apply sophisticated data distillation and learning methods to datasets of unprecedented scale, diversity, and freshness; a technical and social phenomenon that has been dubbed Big Data. The sheer size of the data, of course, is a major challenge, and is the one that is most easily recognized. However, there are others. Industry analysis companies like to point out that there are challenges not just in Volume, but also in Variety and Velocity, and that companies should not focus on just the first of these.
Time Series and Stream Mining
Curated by: Eamonn Keogh
It is in the nature of humans to measure things, and (with rare exceptions) things change over time. A familiar example is a heartbeat, which represents the change in heart's electrical activity. A collection of such temporal measurements are called a “time series”. Other familiar examples include a politician’s popularity waxing and waning, or the temperature rising and falling over both the short term (each day) the medium term (each year) and the long term (climate change drift).
Mining Rich Data Types
Curated by: Huan Liu
The very first issue of data mining and knowledge discovery is to properly handle data. It is essential to take into account different data types. Rich data types can be categorized into: non-dependency and dependency data. The non-dependency data is the most commonly encountered type, which refers to data without specified dependencies between data instances. In other words, data instances are or are assumed independent and identically distributed.
Privacy-Preserving Data Mining
Curated by: Chris Clifton
The ever-increasing collection of personal data, and the growing capabilities to analyze that data, pose increased risks to personal privacy. This has long been a concern for the SIGKDD community; in 2003 there was actually a Data Mining Moratorium Actproposed in the U.S. Senate (for more details, see the SIGKDD response. While there have been examples of data mining that people feel are privacy violations, such as Target’s pregancy prediction, most privacy problems have come from security failures leading to data breaches, rather than the data analysis itself.
Curated by: Tao Li
The field of data mining increasingly adapts methods and algorithms from advanced matrix computations, graph theory and optimization. In these methods, the data is described using matrix representations (graphs are represented by their adjacency matrices) and the data mining problem is formulated as an optimization problem with matrix variables. With these, the data mining task becomes a process of minimizing or maximizing a desired objective function of matrix variables.
Curated by: Jerry Xiaojin Zhu
Semi-supervised learning uses both labeled and unlabeled data to improve supervisedlearning. The goal is to learn a predictor that predicts future test data better than the predictor learned from the labeled training data alone. Semi-supervised learning is motivated by its practical value in learning faster, better, and cheaper. In many real world applications, it is relatively easy to acquire a large amount of unlabeled data x.
Data Reliability and Truthfulness
Curated by: Jiawei Han and Jing Gao
The data reliability issue poses great difficulty to many decision making tasks when the data contains inconsistent, inaccurate, or even false information that could mislead the decisions and eventually result in invaluable losses. Unfortunately, we cannot expect real-world data to be clean and accurate, instead, data inconsistency, ambiguity and uncertainty widely exist. Such ubiquitous veracity problems motivate numerous efforts towards improving the information quality, trustworthiness and reliability.
Large Scale Machine Learning Systems
Curated by: Eric P. Xing and Qirong Ho
The rise of Big Data requires complex Machine Learning models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics (such as high-dimensional latent features, intermediate representations, and decision functions) thereupon. In turn, this has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters. In order to support the computational needs of ML algorithms at such scales, an ML system often needs to operate on distributed clusters with 10s to 1000s of machines; however, implementing algorithms and writing systems softwares for such distributed clusters demands significant design and engineering effort.
Curated by: Aidong Zhang
Classification – Assigning labels to objects is one of the cornerstone application/task in data mining. Many day-to-day activities, some so involuntary that we don’t even realize doing it, are classification tasks – “Identifying your car in the parking lot” or “Recognizing your family member in a crowd”. These seemingly simple tasks for humans, however, is extremely difficult for computers and forms the core of AI problems.