KDD 2005 - Randomized Algorithms for Matrices and Massive Data Sets: Aug 21-24, Chicago, IL. USA

Tutorial: Randomized Algorithms for Matrices and Massive Data Sets

Presenters:

Petros Drineas
drinep at cs rpi edu

Michael W. Mahoney
michael.mahoney at yale edu

Abstract:

Extremely large matrices and, more generally, extremely large tensors (multi-mode arrays) arise in numerous applications of Information Retrieval and Data Mining. In these applications, a large collection of $m$ objects, e.g., documents, genomes, images, or web pages, is implicitly represented as a set of points in an $n$-dimensional Euclidean space, where $n$ is the number of features that describe an object. Thus, this collection may be represented by an $m$-by-$n$ matrix $A$, the rows of which are the object vectors and the columns of which are the feature vectors. Notice that if the object vectors evolve in a structured way over time -- consider, for example, the evolution of the web graph over time -- then a third dimension (time) might be added. Algorithms that efficiently operate on linear algebraic objects such as matrices and tensors are of particular interest and importance for efficiently mining data in today's highly networked and cooperative environments.

The main theme of the tutorial will be on sampling techniques that extract structure from large matrices and tensors by reading only a few elements and/or a few rows and/or a few columns of the matrix or tensor. These elements/rows/columns are sampled from the data matrices or tensors probabilistically, in one or more sequential passes through the data. The tutorial will cover both theoretical foundations of these techniques, as well as their applications in the context of important KDD research topics. Applications of these techniques include explaining the success of LSI for large document corpora, nearest neighbor queries, ``sketching approaches'' for matrices and tensors, speeding up kernel computations, recommendation systems and collaborative filtering, and a large class of algorithmic problems in the framework of the streaming and the pass-efficient model. Empirical work that will be discussed includes not only application of these methods to web modeling and identification of communities in networked systems, but also scientific and economic applications to data sets in bioinformatics and recommendation system analysis.

Webmaster: Michal Sabala