COS：度量分布式大数据处理系统的效率

doi:10.11871/jfdc.issn.2096-742X.2020.01.008

数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (1): 93-104.

doi: 10.11871/jfdc.issn.2096-742X.2020.01.008

所属专题： “高性能与高通量计算及应用”专刊

• 专刊:高性能与高通量计算及应用 • 上一篇下一篇

COS：度量分布式大数据处理系统的效率

李晓涵,陈文光()

清华大学计算机科学与技术系,北京 100084

收稿日期:2019-10-28 出版日期:2020-02-20 发布日期:2020-03-28
通讯作者: 陈文光
作者简介:李晓涵 ,清华大学计算机系高性能计算研究所,研究生,主要研究方向为大数据处理、并行程序设计。
主要贡献：完成单机程序实现、分布式系统性能测试、论文撰写等工作。
Li Xiaohan is currently a Master student of Institute of High-Performance Computing, Department of Computer Science and Technology, Tsinghua University. Her research interests are big data processing and parallel programming.
Undertaking the following tasks in this article: Algorithm implementation, experiments on clusters, and article writing.
E-mail: xh-li18@mails.tsinghua.edu.cn|陈文光 ,清华大学计算机系,教授,主要研究方向为并行计算和分布式系统。
主要贡献：在指标设计、论文组织等方面给予建设性指导。
Chen Wenguang is a Professor in Department of Computer Science and Technology, Tsinghua University. His research interests are parallel computing and distributed systems.
Undertaking the following tasks in this article: Supervising on metric design and article organization.

COS: Measuring the Efficiency of Distributed Big Data Processing System

Li Xiaohan,Chen Wenguang()

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Received:2019-10-28 Online:2020-02-20 Published:2020-03-28
Contact: Chen Wenguang

摘要/Abstract

摘要：

【目的】在大数据处理领域,分布式计算系统得到广泛应用,它们的可扩展性得到重点关注,但其绝对性能往往没有得到重视。我们希望提出科学合理、与时俱进的度量标准,对分布式系统的性能进行评估。【方法】本文通过对比特定任务的单机实现和分布式实现来讨论分布式系统的性能,提出COS（Configuration that Outperforms a Single machine）这一指标,来衡量分布式系统在达到单台机器的性能时,需要的硬件资源数量。我们选取k-means聚类和逻辑回归两个经典机器学习算法,对其进行单机多线程实现,并通过向量化计算、优化内存分配与访问等方式对性能进行了优化,为分布式多机系统的性能提供参考。【结果】以Apache Spark作为对标系统,实验发现无论是使用其原生编程接口,还是经过悉心优化的机器学习库,都要使用数倍甚至数百倍的机器,才能达到单机多线程实现的性能。【局限】分布式系统与单机实现进行性能对比并不是完全公平的,分布式系统的额外开销客观存在。【结论】但COS指标仍能反映分布式系统存在的绝对性能较差、没有充分利用硬件优势等问题。

关键词: 并行计算, 大数据, 多线程, k-means, 逻辑回归

Abstract:

[Objective] Distributed computing systems are used widely in the field of big data processing. They are designed and implemented with a focus on scalability. With good scalability, a system can hold and process a growing amount of data by adding resources without modifying the system itself while sacrificing the absolute performance of a single machine at huge expenses. We want to offer a reasonable and modern metric to evaluate the performance of distributed systems. [Methods] In this article, we discuss the performance of distributed systems by comparing them with the same task on a single machine with the proposed metric, COS, or the Configuration that Outperforms a Single machine. The COS of a system on a given problem is the number of machines required when the system outperforms a competent single-machine implementation. Given a limited hardware resources, COS of a distributed system is usually too large to measure. So, we offer another metric by giving a parameter n to COS. COS(n) equals to n multiplied by the time used on n machines over that on a single machine. COS(n) indicates the performance and expense loss in a cluster system. We implemented two classic machine learning algorithms, k-means clustering and logistic regression, on a single machine with multi-threading, SIMD support and NUMA-aware memory control. [Results] Our experiments show that by using Apache Spark, with no matter its native API or optimized machine learning library like MLlib, it needs tens to hundreds of machines to achieve the same performance as we did on a single machine. [Limitations] The comparison between a single machine and a cluster is not entirely fair, for overheads in a cluster is unavoidable. [Conclusions] This COS metric can still reflect the problems of poor absolute performance and insufficient utilization of hardware advantages in distributed systems.

Key words: parallel computing, big data, multi-thread, k-means, logistic regression

李晓涵,陈文光. COS：度量分布式大数据处理系统的效率[J]. 数据与计算发展前沿, 2020, 2(1): 93-104.

Li Xiaohan,Chen Wenguang. COS: Measuring the Efficiency of Distributed Big Data Processing System[J]. Frontiers of Data and Computing, 2020, 2(1): 93-104.

图/表 10

图 1

图 2

图3

图4

表 1

图5

图6

图7

图8

表 2

参考文献 21

[1]	Hadoop. .
[2]	Spark. .
[3]	McSherry F, Isard M, Murray D G. Scalability! but at what cost? [C]//HotOS. [S.l.]: Citeseer, 2015.
[4]	Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing [C]//Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. [S.l.]: USENIX Association, 2012: 2-2.
[5]	Wang L, Zhan J, Luo C, et al. Bigdatabench: A big data benchmark suite from internet services [C]//High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. [S.l.]: IEEE, 2014: 488-499.
[6]	Meng X, Bradley J, Yavuz B , et al. Mllib: Machine learning in apache spark[J]. The Journal of Machine Learning Research, 2016,17(1):1235-1241.
[7]	LeCun Y., Cortes C., & Burges C. J. ( 2010). MNIST handwritten digit database. AT&T Labs [Online]. Available: , 2, 18.
[8]	Wang E., Zhang Q., Shen B., Zhang G., Lu X., Wu Q., & Wang Y . ( 2014). Intel math kernel library[M]. In High-Performance Computing on the Intel® Xeon Phi™( pp. 167-188) . Springer, Cham.
[9]	Lameter C . Numa (non-uniform memory access): An overview[J]. Queue, 2013,11(7):40.
[10]	Wikipedia contributors. Limited-memory bfgs — Wikipedia, the free encyclopedia[Z]. [S.l.:s.n.], 2018.
[11]	Malewicz G., Austern M. H., Bik A. J., Dehnert J. C., Horn I., Leiser N., & Czajkowski G. ( 2010, June). Pregel: a system for large-scale graph processing [C]. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 135-146). ACM.
[12]	Bu Y., Howe B., Balazinska M., & Ernst M. D. ( 2010). HaLoop: efficient iterative data processing on large clusters [C]. Proceedings of the VLDB Endowment, 3(1-2), 285-296.
[13]	Nitzberg, B., & Lo, V . ( 1991). Distributed shared memory: A survey of issues and algorithms[J]. Computer, 24(8), 52-60.
[14]	Pfister, G. F . ( 2001). An introduction to the infiniband architecture[J]. High Performance Mass Storage and Parallel I/O, 42, 617-632.
[15]	Liu J., Wu J., & Panda D. K . ( 2004). High performance RDMA-based MPI implementation over InfiniBand[J]. International Journal of Parallel Programming, 32(3), 167-198.
[16]	Zhu X., Chen W., Zheng W., & Ma X. ( 2016). Gemini: A computation-centric distributed graph processing system [C]. In 12th {USENIX} Symposium on Operating Systems Design and Implementation( {OSDI} 16) (pp. 301-316).
[17]	Isard M., Budiu M., Yu Y., Birrell A., & Fetterly D. ( 2007, March). Dryad: distributed data-parallel programs from sequential building blocks [C]. In ACM SIGOPS operating systems review (Vol. 41, No. 3, pp. 59-72). ACM.
[18]	Li P., Luo Y., Zhang N., & Cao Y. ( 2015, August). Heterospark: A heterogeneous cpu/gpu spark platform for machine learning algorithms [C]. In 2015 IEEE International Conference on Networking, Architecture and Storage (NAS)( pp. 347-348). IEEE.
[19]	Hong S., Choi W., & Jeong, W. K. (2017, May). GPU in-memory processing using Spark for iterative computation[C]. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (pp. 31-41). IEEE Press.
[20]	Kanungo T, Mount D M, Netanyahu N S , et al. An efficient k-means clustering algorithm: Analysis and implementation[J]. IEEE transactions on pattern analysis and machine intelligence, 2002,24(7):881-892.
[21]	McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching [C]//Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. [S.l.]: ACM, 2000: 169-178.

项目	详情
节点数	4
节点型号	Intel(R) Xeon(R) CPU E5-2680 v4 @2.40GHz
单机核数	28
Spark版本	2.2.0
通信	千兆以太网

	Apache Spark		MLlib
n	k-means	逻辑回归	k-means	逻辑回归
1	20.22	724.53	10.85	307.08
2	89.50	808.85	6.49	128.72
3	103.63	897.71	8.12	9.51
4	117.18	915.94	7.32	12.69

COS：度量分布式大数据处理系统的效率

COS: Measuring the Efficiency of Distributed Big Data Processing System

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 21

相关文章 15

编辑推荐

Metrics

本文评价

[1]	周成祖, 吴文, 蔡晓强. 基于分类分级的数据安全防控策略研究[J]. 数据与计算发展前沿, 2023, 5(1): 128-135.
[2]	许淞源,刘峰. ESDRec：一种面向地球大数据平台的数据推荐模型[J]. 数据与计算发展前沿, 2023, 5(1): 55-64.
[3]	金天骄,栗蔚. 基于算力网络的大数据计算资源智能调度分配方法[J]. 数据与计算发展前沿, 2022, 4(6): 29-37.
[4]	危婷,张宏海,蔺小丽,张蕾蕾,王妍,贾金峰. 云服务网站用户复访行为预测模型研究[J]. 数据与计算发展前沿, 2022, 4(3): 124-130.
[5]	季明辰,任勇毛,张运栋,周慧娟,周旭,周艳芳. 面向交通大数据的高速文件传输系统设计与实现[J]. 数据与计算发展前沿, 2022, 4(3): 141-151.
[6]	胡庆宝,郑伟,王佳荣,汪璐,颜田. 高能物理科学数据中心智能运维系统[J]. 数据与计算发展前沿, 2022, 4(1): 30-41.
[7]	陈文杰,胡正银,胡靖,庞弘燊,何雨娟. 多维数据驱动的粮食安全分析与智能决策系统研究与实践[J]. 数据与计算发展前沿, 2021, 3(6): 1-14.
[8]	鹿旭东,宋伟凤,郭伟,崔立真,林岳,姜涛. 大数据驱动的创新方法论与创新服务平台[J]. 数据与计算发展前沿, 2021, 3(5): 141-155.
[9]	曹义魁,陆忠华,张鉴,刘夏真,袁武,梁姗. 面向国产加速器的CFD核心算法并行优化[J]. 数据与计算发展前沿, 2021, 3(4): 93-103.
[10]	柴象海,胡寿丰,张执南,侯亮. 显式动力学子模型法在航空发动机整机瞬态冲击并行计算中的应用[J]. 数据与计算发展前沿, 2020, 2(6): 11-20.
[11]	张婕,郭印. 基于大数据语言实验平台的隐私安全研究[J]. 数据与计算发展前沿, 2020, 2(6): 90-102.
[12]	张留莹,王鹏飞,张峰,刘海龙,林鹏飞,王涛,韦俊林,田少博,姜金荣,迟学斌. 海洋环流模式LICOM的GPU实现与优化[J]. 数据与计算发展前沿, 2020, 2(4): 92-104.
[13]	陈梅丽,马英克,李茹姣,鲍一明. 基因组学数据分析方法现状和展望[J]. 数据与计算发展前沿, 2020, 2(2): 1-19.
[14]	王文生,郭雷风. 大数据技术农业应用[J]. 数据与计算发展前沿, 2020, 2(2): 101-110.
[15]	陈雷,袁媛. 基于深度迁移学习的农业病害图像识别[J]. 数据与计算发展前沿, 2020, 2(2): 111-119.