数据与计算发展前沿 ›› 2019, Vol. 1 ›› Issue (1): 63-72.

doi: 10.11871/jfdc.issn.2096.742X.2019.01.007

所属专题: “数据与计算平台”专刊

• • 上一篇    下一篇

Angel + : 基于Angel的分布式机器学习平台

张智鹏1,江佳伟2,余乐乐2,崔斌1   

  1. 1.北京大学,计算机科学与技术系,高可信软件技术教育部重点实验室,北京100871
    2.腾讯公司,北京100193
  • 收稿日期:2019-08-15 出版日期:2019-01-20 发布日期:2019-10-09
  • 作者简介:张智鹏,1993年生,北京大学在读博士生。研究方向为分布式机器学习和大数据分析。
    本文贡献:系统实现、论文写作。
    Zhang Zhipeng, born in 1993, is a PhD candidate from Peking university. His research interests include distributed machine learning and big data analytics.
    In this paper he undertakes the following tasks: system implementation and paper writing.
    E-mail:zhangzhipeng@pku.edu.cn|崔斌,1975年生,北京大学长江学者特聘教授,计算机系副主任。崔斌博士于2004年在新加坡国立大学获得博士学位。研究方向为数据库系统架构,查询/索引关键技术,大数据分析,分布式机器学习系统等。他是中国计算机学会杰出会员,目前担任数据库专委会秘书长。
    本文贡献:系统框架及论文组织。
    Cui Bin, born in 1975, is a professor and deputy head of the Department of computer science at Peking University. He obtained his PhD from National University of Singapore in 2004. His research interests include database system architectures, query and index techniques, big data management and distributed machine learning systems.
    In this paper he undertakes the following tasks: system framework design and organizing paper structure.
    E-mail:bin.cui@pku.edu.cn
  • 基金资助:
    国家重点研发计划重点专项(2018YFB1004403);国家自然科学基金(61832001)

Angel +: A Large-Scale Machine Learning Platform on Angel

Zhipeng Zhang1,Jiawei Jiang2,Lele Yu2,Bin Cui1   

  1. 1.Department of Computer Science & Key Laboratory of High Confidence Software Technologies (MOE), Peking University, Beijing 100871, China;
    2.Tencent, Beijing 100193, China
  • Received:2019-08-15 Online:2019-01-20 Published:2019-10-09

摘要:

【目的】随着大数据时代的来临,数据变得高维、稀疏,机器学习模型也变得复杂、高维,因此也给分布式机器学习系统带来了很多挑战。尽管研究人员已经开发了很多高性能的机器学习系统,比如TensorFlow、 PyTorch、XGBoost等,但是这些系统存在以下两个问题:(1)不能与现有的大数据系统很好的结合;(2)不够通用,这些系统往往是为了某一类机器学习算法设计。【方法】为了解决以上两个挑战,本文介绍Angel +:一个基于参数服务器架构的分布式机器学习平台。【结果】Angel +能够高效的支持现有的大数据系统以及机器学习系统——依赖于参数服务器处理高维模型的能力,Angel +能够以无侵入的方式为大数据系统(比如Apache Spark)提供高效训练超大机器学习模型的能力,并且高效的运行已有的分布式机器学习系统(比如PyTorch)。此外,针对分布式机器学习中通信开销大和掉队者问题,Angel + 也提供了模型平均、梯度压缩和异构感知的随机梯度下降解法等。【结论】笔者结合Angel +开发了很多高效、易用的机器学习模型,并且通过实验验证了Angel +平台的高效性。

关键词: 分布式机器学习平台, 参数服务器, 大数据处理系统, 分布式机器学习系统

Abstract:

[Objective] Real-world data becomes much more complex, sparse and high-dimensional for the big data shock in this era. According to this, modern ML models are designed in a deep, complicated way, which arises challenges when designing a distributed machine learning (ML) system. Though researchers have developed many efficient centralized ML systems like TensorFlow, PyTorch and XGBoost, these systems suffer from the following two problems: (1) They cannot integrate well with existing big data systems, (2) they are not general enough and are usually designed for specific ML models. [Methods] To tackle these challenges, we introduce Angel +, a large-scale ML platform based on parameter servers. [Results] With the power of parameter servers, Angel +can efficiently support existing big data systems and ML systems without neither breaking the core of big data systems, Apache Spark for instance, nor degrades the computation efficiency of current ML frameworks like PyTorch. Furthermore, Angel + provides algorithms like model averaging, gradient compression and heterogeneous-aware stochastic gradient descent, to deal with the huge communication cost and the straggler problem in distributed training process. [Conclusions] We also enhance the usability of Angel +by providing efficient implementation for many ML models. We conduct extensive experiments to demonstrate the superiority of Angel +.

Key words: machine learning platform, parameter servers, big data systems, distributed machine learning systems