Learning large-scale Latent Dirichlet Allocation (LDA) models is beneficial for many applications that involve large collections of documents. Recent work has been focusing on developing distributed algorithms in the batch setting, while leaving stochastic methods behind, which can effectively explore statistical redundancy in big data and thereby are complementary to distributed computing. The distributed stochastic gradient Langevin dynamics (DSGLD) represents one attempt to combine stochastic sampling and distributed computing, but it suffers from drawbacks such as excessive communications and sensitivity to partitioning of datasets across nodes. DSGLD is typically limited to learn small models that have about 10^3 topics and 10^3 vocabulary size.

In this paper, we present embarrassingly parallel SGLD (EPS-GLD), a novel distributed stochastic gradient sampling method for topic models. Our sampler is built upon a divide-and-conquer architecture which enables us to produce robust and asymptotically exact samples with less communication overhead than DSGLD. We further propose several techniques to reduce the overhead in I/O and memory usage. Experiments on Wikipedia and ClueWeb12 documents demonstrate that, EPSGLD can scale up to large models with 10^10 parameters (i.e., 10^5 topics, 10^5 vocabulary size), four orders of magnitude larger than DSGLD, and converge faster.

Filed under: Big Data | Dimensionality Reduction | Mining Rich Data Types