Saket Sathe (IBM T. J. Watson Research Center);Charu Aggarwal (IBM T. J. Watson Research Center)
Random forests are among the most successful methods used in data mining because of their extraordinary accuracy and effectiveness. However, their use is primarily limited to multidimensional data because they sample features from the original data set. In this paper, we propose a method for extending random forests to work with any arbitrary set of data objects as long as similarities can be computed among the data objects. Furthermore, since it is understood that similarity computation between all $O(n^2)$ pairs of objects might be expensive, our method computes only a very small fraction of the $O(n^2)$ pairwise similarities between objects to construct the forests. Our results show that the proposed similarity forest approach is extremely efficient and is also very accurate on a wide variety of data sets. Therefore, this paper significantly extends the applicability of random forest methods to arbitrary data domains. Furthermore, the approach even outperforms traditional random forests on multidimensional data. In many cases, the similarity matrices learned from arbitrary applications are noisy, because of the difficulty in estimating similarity values between pairs of objects. Similarity forests are very robust to errors in classification. In many practical settings, the similarity values between objects are incompletely specified because of the difficulty in collecting such values. In such cases, the similarity forest approach can be naturally extended to a partially specified similarity matrix.