KDD Papers

Randomization or Condensation?: Linear-Cost Matrix Sketching Via Cascaded Compression Sampling

Kai Zhang (Lawrence Berkeley National Laboratories);Chuanren Liu (Drexel University);Jie Zhang (Fudan University);Hui Xiong (Rutgers University);Eric Xing (Carneigie Mellon University);Jieping Ye (University of Michigan)


The problem of matrix decomposition is to learn compact representations of a matrix while simultaneously preserving most of its properties, which is a fundamental building block in modern scientific computing and big data applications. Currently, even state-of-the-art solutions still require the use of the entire input matrix in generating desired factorizations, causing a major computational and memory bottleneck. In this paper, we uncover an interesting theoretic connection between matrix low-rank decomposition and \emph{ lossy data compression}, based on which a cascaded compression sampling framework is devised to approximate an $m\times n$ matrix in only $\mathcal{O}{(m+n)}$ time and space. Indeed, the proposed method accesses only a small number of matrix rows and columns, which significantly improves the memory footprint. Meanwhile, by sequentially teaming two rounds of approximation procedures and upgrading the sampling strategy from a uniform probability to more sophisticated, encoding-orientated sampling, significant algorithmic boosting is achieved to uncover more granular structures in the data. Empirical results on a wide spectrum of real-world, large-scale matrices show that by taking only linear time and space, the accuracy of our method rivals those state-of-the-art randomized algorithms consuming a quadratic, $\mathcal{O}(mn)$, amount of resources.