A Practical Algorithm for Solving the Incoherence Problem of Topic Models In Industrial Applications
Amr Ahmed (google);James Long (google);Dan Silva (google);Yuan Wang (google)
Abstract
Topic models are often applied in industrial settings to discover user profiles from activity logs where documents corresponds to users and words to complex objects such as web sites and installed apps. Standard topic models ignore the content-based similarity structure between these objects largely because of the inability of the Dirichlet prior to capture such side information of word-word correlation. Several approaches were proposed to replace the Dirichlet prior with more expressive alternatives. However, this added expressivity comes with a heavy premium: inference becomes intractable and sparsity is lost which renders these alternatives not suitable for industrial scale applications. In this paper, we take a radically different approach to incorporating word-word correlation in topic models by applying this side information at the posterior level rather than at the prior level. We show that this choice preserves sparsity and results in a graph-based sampler for LDA whose computational complexity is asymptotically on bar with state of the art Alias base samplers for LDA [6]. We illustrate the efficacy of our approach over real industrial datasets that span up to billions of users, tens of millions of words and thousands of topic. To the best of our knowledge, our approach provides the first practical and scalable solution to this important problem.