KDD Papers

Functional Annotation of Human Protein Coding Isoforms via Non-convex Multi-Instance Learning

Tingjin Luo (National University of Defense Technology);Weizhong Zhang (Zhejiang University);Shuang Qiu (University of Michigan);Yang Yang (Beihang University);Dongyun Yi (National University of Defense Technology);Guangtao Wang (University of Michigan);Jieping Ye (University of Michigan);Jie Wang (University of Michigan)


Functional annotation of human genes is fundamentally important for understanding the molecular basis of various genetic diseases. A major challenge in determining the functions of human genes lies in the functional diversity of proteins, that is, a gene can perform different functions as it may consist of multiple protein coding isoforms (PCIs). Therefore, differentiating functions of PCIs can significantly deepen our understanding of the functions of genes. However, due to the lack of isoform-level gold-standards (ground-truth annotation), many existing functional annotation approaches are developed at gene-level. In this paper, we propose a novel approach to differentiate the functions of PCIs by integrating sparse simplex projection—-that is, a nonconvex sparsity-inducing regularizer—-with the framework of multi-instance learning (MIL). Specifically, we label the genes that are annotated to the function under consideration as \emph{positive bags} and the genes without the function as \emph{negative bags}. Then, by sparse projections onto simplex, we learn a mapping that embeds the original bag space to a discriminative feature space. Our framework is flexible to incorporate various smooth and nonsmooth loss functions such as logistic loss and hinge loss. To solve the resulting highly nontrivial non-convex and nonsmooth optimization problem, we further develop an efficient block coordinate decent algorithm. Extensive experiments on human genome data demonstrate that the proposed approaches significantly outperform the state-of-the-art methods in terms of functional annotation accuracy of human PCIs and efficiency.