Frontiers of Data and Computing ›› 2019, Vol. 1 ›› Issue (1): 63-72.doi: 10.11871/jfdc.issn.2096.742X.2019.01.007

Special Issue: “数据与计算平台”专刊

Previous Articles     Next Articles

Angel +: A Large-Scale Machine Learning Platform on Angel

Zhipeng Zhang1,Jiawei Jiang2,Lele Yu2,Bin Cui1   

  1. 1.Department of Computer Science & Key Laboratory of High Confidence Software Technologies (MOE), Peking University, Beijing 100871, China;
    2.Tencent, Beijing 100193, China
  • Received:2019-08-15 Online:2019-01-20 Published:2019-10-09

Abstract:

[Objective] Real-world data becomes much more complex, sparse and high-dimensional for the big data shock in this era. According to this, modern ML models are designed in a deep, complicated way, which arises challenges when designing a distributed machine learning (ML) system. Though researchers have developed many efficient centralized ML systems like TensorFlow, PyTorch and XGBoost, these systems suffer from the following two problems: (1) They cannot integrate well with existing big data systems, (2) they are not general enough and are usually designed for specific ML models. [Methods] To tackle these challenges, we introduce Angel +, a large-scale ML platform based on parameter servers. [Results] With the power of parameter servers, Angel +can efficiently support existing big data systems and ML systems without neither breaking the core of big data systems, Apache Spark for instance, nor degrades the computation efficiency of current ML frameworks like PyTorch. Furthermore, Angel + provides algorithms like model averaging, gradient compression and heterogeneous-aware stochastic gradient descent, to deal with the huge communication cost and the straggler problem in distributed training process. [Conclusions] We also enhance the usability of Angel +by providing efficient implementation for many ML models. We conduct extensive experiments to demonstrate the superiority of Angel +.

Key words: machine learning platform, parameter servers, big data systems, distributed machine learning systems