KunPeng: Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial
Jun Zhou (Ant Financial Group);Xiaolong Li (Ant Financial Group);Peilin Zhao (Ant Financial Group);Chaochao Chen (Ant Financial Group);Longfei Li (Ant Financial Group);Xinxing Yang (Ant Financial Group);Qing Cui (Alibaba Cloud);Jin Yu (Alibaba Cloud);Xu Chen (Alibaba Cloud);Yi Ding (Alibaba Cloud);Yuan Qi (Ant Financial Group)
In recent years, due to the emergence of Big Data (terabytes or petabytes) and Big Model (tens of billions of parameters), there has been an ever-increasing need of parallelizing machine learning (ML) algorithms in both academia and industry. Although there are some existing distributed computing systems, like Hadoop and Spark, for parallelizing ML algorithms, they only provide synchronous and coarse-grained operators (e.g., Map, Reduce, and Join, etc.), which may hinder developers from implementing more efficient algorithms. This motivated us to design a universal distributed platform termed KunPeng, that combines both distributed systems and parallel optimization algorithms to deal with the complexities that arise from large-scale ML. Specifically, KunPeng not only encapsulates the characteristics of data/model parallelism, load balancing, model sync-up, sparse representation, industrial fault-tolerance, etc., but also provides easy-to-use interface to empower users to focus on the core ML logics. Empirical results on terabytes of real datasets with billions of samples and features demonstrate that, such a design brings compelling performance improvements on ML programs ranging from Follow-the-Regularized-Leader Proximal algorithm to Sparse Logistic Regression and Multiple Additive Regression Trees. Furthermore, KunPeng’s encouraging performance is also shown for several real-world applications including the Alibaba’s Double 11 Online Shopping Festival.