A Practical Algorithm for Solving the Incoherence Problem of Topic Models In Industrial Applications

Amr Ahmed (google);James Long (google);Dan Silva (google);Yuan Wang (google)

Abstract

Topic models are often applied in industrial settings to discover user profiles from activity logs where documents corresponds to users and words to complex objects such as web sites and installed apps. Standard topic models ignore the content-based similarity structure between these objects largely because of the inability of the Dirichlet prior to capture such side information of word-word correlation. Several approaches were proposed to replace the Dirichlet prior with more expressive alternatives. However, this added expressivity comes with a heavy premium: inference becomes intractable and sparsity is lost which renders these alternatives not suitable for industrial scale applications. In this paper, we take a radically different approach to incorporating word-word correlation in topic models by applying this side information at the posterior level rather than at the prior level. We show that this choice preserves sparsity and results in a graph-based sampler for LDA whose computational complexity is asymptotically on bar with state of the art Alias base samplers for LDA [6]. We illustrate the efficacy of our approach over real industrial datasets that span up to billions of users, tens of millions of words and thousands of topic. To the best of our knowledge, our approach provides the first practical and scalable solution to this important problem.

A PHP Error was encountered

KDD Papers

A Practical Algorithm for Solving the Incoherence Problem of Topic Models In Industrial Applications

Abstract

Comments

Diamond Sponsor

Platinum

Gold

Silver

Bronze

KDD Cup

Industry/Government Track Best Paper Awards

Research Track Best Paper Awards

Dissertation Award

Best Student Paper

Media Sponsor

WiFi Sponsor

Named Student Travel Grant

Lanyard Sponsor

Track/Session Sponsors

Contact Us

Save the Date