Linearized GMM Kernels and Normalized Random Fourier Features
Ping Li (Rutgers University)
Abstract
The method of ``random Fourier features (RFF)’’ has become a popular tool for approximating the ``radial basis function (RBF)’’ kernel. The variance of RFF is actually very large. Interestingly, the variance can be substantially reduced by a simple normalization step as we theoretically demonstrate. We name the improved scheme as the ``normalized RFF (NRFF)’‘, and we provide a technical proof of the theoretical variance of NRFF, as validated by simulations.
We also propose the ``generalized min-max (GMM)’’ kernel as a measure of data similarity, where data vectors can have both positive and negative entries. GMM is positive definite as there is an associate hashing method named ``generalized consistent weighted sampling (GCWS)’’ which linearizes this (nonlinear) kernel. We provide an extensive empirical evaluation of the RBF and GMM kernels on more than 50 datasets. For a majority of the datasets, the (tuning-free) GMM kernel outperforms the best-tuned RBF kernel.
We also conduct extensive experiments for comparing the linearized RBF kernel using NRFF hashing with the linearized GMM kernel using GCWS hashing. We observe that, in order to reach a similarity classification accuracy, GCWS typically requires substantially fewer samples than NRFF, even on datasets where the original RBF kernel outperforms the original GMM kernel. The training, storage, and processing costs are directly proportional to the sample size. Thus, our experiments help demonstrate that GCWS would be a much more practical scheme for large-scale machine learning.
The empirical success of GCWS (compared to NRFF) can also be explained theoretically, from at least two aspects. Firstly, the relative variance (normalized by the squared expectation) of GCWS is substantially smaller than that of NRFF, except for the very high similarity region (where the variances of both methods are close to zero). Secondly, if we make a general model assumption on the data, we can show analytically that GCWS exhibits much smaller variance than NRFF for estimating the same object (e.g., the RBF kernel), except for the high similarity region.