Angel + : 基于Angel的分布式机器学习平台

doi:10.11871/jfdc.issn.2096.742X.2019.01.007

数据与计算发展前沿 ›› 2019, Vol. 1 ›› Issue (1): 63-72.

doi: 10.11871/jfdc.issn.2096.742X.2019.01.007

所属专题： “数据与计算平台”专刊

Angel ⁺ : 基于Angel的分布式机器学习平台

张智鹏¹,江佳伟²,余乐乐²,崔斌¹

1．北京大学,计算机科学与技术系,高可信软件技术教育部重点实验室,北京100871
2．腾讯公司,北京100193

收稿日期:2019-08-15 出版日期:2019-01-20 发布日期:2019-10-09
作者简介:张智鹏,1993年生,北京大学在读博士生。研究方向为分布式机器学习和大数据分析。
本文贡献：系统实现、论文写作。
Zhang Zhipeng, born in 1993, is a PhD candidate from Peking university. His research interests include distributed machine learning and big data analytics.
In this paper he undertakes the following tasks: system implementation and paper writing.
E-mail：zhangzhipeng@pku.edu.cn|崔斌,1975年生,北京大学长江学者特聘教授,计算机系副主任。崔斌博士于2004年在新加坡国立大学获得博士学位。研究方向为数据库系统架构,查询/索引关键技术,大数据分析,分布式机器学习系统等。他是中国计算机学会杰出会员,目前担任数据库专委会秘书长。
本文贡献：系统框架及论文组织。
Cui Bin, born in 1975, is a professor and deputy head of the Department of computer science at Peking University. He obtained his PhD from National University of Singapore in 2004. His research interests include database system architectures, query and index techniques, big data management and distributed machine learning systems.
In this paper he undertakes the following tasks: system framework design and organizing paper structure.
E-mail：bin.cui@pku.edu.cn
基金资助:
国家重点研发计划重点专项(2018YFB1004403);国家自然科学基金(61832001)

Angel ⁺: A Large-Scale Machine Learning Platform on Angel

Zhipeng Zhang¹,Jiawei Jiang²,Lele Yu²,Bin Cui¹

1．Department of Computer Science & Key Laboratory of High Confidence Software Technologies (MOE), Peking University, Beijing 100871, China;
2．Tencent, Beijing 100193, China

Received:2019-08-15 Online:2019-01-20 Published:2019-10-09

摘要/Abstract

摘要：

【目的】随着大数据时代的来临,数据变得高维、稀疏,机器学习模型也变得复杂、高维,因此也给分布式机器学习系统带来了很多挑战。尽管研究人员已经开发了很多高性能的机器学习系统,比如TensorFlow、 PyTorch、XGBoost等,但是这些系统存在以下两个问题：（1）不能与现有的大数据系统很好的结合;（2）不够通用,这些系统往往是为了某一类机器学习算法设计。【方法】为了解决以上两个挑战,本文介绍Angel ⁺：一个基于参数服务器架构的分布式机器学习平台。【结果】Angel ⁺能够高效的支持现有的大数据系统以及机器学习系统——依赖于参数服务器处理高维模型的能力,Angel ⁺能够以无侵入的方式为大数据系统（比如Apache Spark）提供高效训练超大机器学习模型的能力,并且高效的运行已有的分布式机器学习系统（比如PyTorch）。此外,针对分布式机器学习中通信开销大和掉队者问题,Angel ⁺ 也提供了模型平均、梯度压缩和异构感知的随机梯度下降解法等。【结论】笔者结合Angel ⁺开发了很多高效、易用的机器学习模型,并且通过实验验证了Angel ⁺平台的高效性。

关键词: 分布式机器学习平台, 参数服务器, 大数据处理系统, 分布式机器学习系统

Abstract:

[Objective] Real-world data becomes much more complex, sparse and high-dimensional for the big data shock in this era. According to this, modern ML models are designed in a deep, complicated way, which arises challenges when designing a distributed machine learning (ML) system. Though researchers have developed many efficient centralized ML systems like TensorFlow, PyTorch and XGBoost, these systems suffer from the following two problems: (1) They cannot integrate well with existing big data systems, (2) they are not general enough and are usually designed for specific ML models. [Methods] To tackle these challenges, we introduce Angel ⁺, a large-scale ML platform based on parameter servers. [Results] With the power of parameter servers, Angel ⁺can efficiently support existing big data systems and ML systems without neither breaking the core of big data systems, Apache Spark for instance, nor degrades the computation efficiency of current ML frameworks like PyTorch. Furthermore, Angel ⁺ provides algorithms like model averaging, gradient compression and heterogeneous-aware stochastic gradient descent, to deal with the huge communication cost and the straggler problem in distributed training process. [Conclusions] We also enhance the usability of Angel ⁺by providing efficient implementation for many ML models. We conduct extensive experiments to demonstrate the superiority of Angel ⁺.

Key words: machine learning platform, parameter servers, big data systems, distributed machine learning systems

张智鹏,江佳伟,余乐乐,崔斌. Angel ⁺ : 基于Angel的分布式机器学习平台[J]. 数据与计算发展前沿, 2019, 1(1): 63-72.

Zhipeng Zhang,Jiawei Jiang,Lele Yu,Bin Cui. Angel ⁺: A Large-Scale Machine Learning Platform on Angel[J]. Frontiers of Data and Computing, 2019, 1(1): 63-72.

图/表 7

图1

图2

图3

图4

图5

参考文献 17

[1]	Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, Xiaoqiang Zheng : TensorFlow: A System for Large-Scale Machine Learning[C]. OSDI 2016, 265-283.
[2]	Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, Zheng Zhang : MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems[C]. CoRRabs/1512.01274.
[3]	Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, Yaoliang Yu : Petuum: A New Platform for Distributed Machine Learning on Big Data[C]. KDD 2015, 1335-1344.
[4]	Tianqi Chen , Carlos Guestrin: XGBoost: A Scalable Tree Boosting System[C]. KDD 2016, 785-794.
[5]	Jie Jiang, Lele Yu, Jiawei Jiang, Yuhong Liu, Bin Cui : Angel: A new large scale machine learning system[J], NSR 2017, 1-21 .
[6]	Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J.Shekita, Bor-Yiing Su : Scaling Distributed Machine Learning with the Parameter Server[C]. OSDI 2014, 583-598.
[7]	Zhipeng Zhang, Bin Cui, Yingxia Shao, Lele Yu, Jiawei Jiang, Xupeng Miao : PS2: Parameter Server on Spark[C], SIGMOD 2019, 376-388.
[8]	Zhipeng Zhang, Jiawei Jiang, Wentao Wu, Ce Zhang, Lele Yu, Bin Cui : MLlib*: Fast Training of GLMs Using Spark MLlib [C], ICDE 2019, 1778-1789.
[9]	Jiawei Jiang, Fangcheng Fu, Tong Yang and Bin Cui : SketchML: Accelerating Distributed Machine Learning with Data Sketches [C], SIGMOD 2018, 1269-1284.
[10]	Jiawei Jiang, Bin Cui, Ce Zhang and Lele Yu : Heterogeneity-aware Distributed Parameter Servers [C], SIGMOD 2017, 463-478.
[11]	Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica : Spark: Cluster Computing with Working Sets [C]. HotCloud 2010.
[12]	Lele Yu, Ce Zhang, Yingxia Shao and Bin Cui : LDA*: A Robust and Large-scale Topic Modeling System [C], VLDB 2017, 1406-1417.
[13]	Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, Wei-Ying Ma : LightLDA: Big Topic Models on Modest Computer Clusters [C]. WWW 2015, 1351-1361.
[14]	Jiawei Jiang, Bin Cui, Ce Zhang and Fangcheng Fu : DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions [C], SIGMOD 2018, 1363-1376.
[15]	Fangcheng Fu, Jiawei Jiang, Yingxia Shao, Bin Cui : An Experimental Evaluation of Large Scale GBDT Systems [C], VLDB 2019.
[16]	Jie Jiang, Jiawei Jiang, Bin Cui and Ce Zhang : TencentBoost: A Gradient Boosting Tree System with Parameter Server [C], ICDE 2017, 281-284.
[17]	Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu : LightGBM: A Highly Efficient Gradient Boosting Decision Tree[C]. NIPS 2017: 3146-3154.

Angel ⁺ : 基于Angel的分布式机器学习平台

Angel ⁺: A Large-Scale Machine Learning Platform on Angel

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 17

相关文章 0

编辑推荐

Metrics

本文评价

Angel + : 基于Angel的分布式机器学习平台

Angel +: A Large-Scale Machine Learning Platform on Angel

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 17

相关文章 0

编辑推荐

Metrics

本文评价

Angel ⁺ : 基于Angel的分布式机器学习平台

Angel ⁺: A Large-Scale Machine Learning Platform on Angel