Recent statistical evidence has shown that a regression model by incorporating the interactions among the original covariates (features) can significantly improve the interpretability for biological data. One major challenge is the exponentially expanded feature space when adding high-order feature interactions to the model. To tackle the huge dimensionality, Hierarchical Sparse Models (HSM) are developed by enforcing sparsity under heredity structures in the interactions among the covariates. However, existing methods only consider pairwise interactions, making the discovery of important high-order interactions a non-trivial open problem. In this paper, we propose a Generalized Hierarchical Sparse Model (GHSM) as a generalization of the HSM models to learn arbitrary-order inter-actions. The GHSM applies the l1 penalty to all the model coefficients under a constraint that given any covariate, if none of its associated kth-order interactions contribute to the regression model, then neither do its associated higher-order interactions. The resulting objective function is non-convex with a challenge lying in the coupled variables appearing in the arbitrary-order hierarchical constraints and we devise an efficient optimization algorithm to directly solve it. Specifically, we decouple the variables in the constraints via both the GIST and ADMM methods into three subproblems, each of which is proved to admit an efficiently analytical solution. We evaluate the GHSM method in both synthetic problem and the antigenic sites identification problem for the flu virus data, where we expand the feature space up to the 5th-order interactions. Empirical results demonstrate the effectiveness and efficiency of the proposed method and the learned high-order interactions have meaningful synergistic covariate patterns in the virus antigenicity.

Filed under: Dimensionality Reduction | Frequent Pattern Mining