数据与计算发展前沿 ›› 2020, Vol. 2 ›› Issue (1): 93-104.doi: 10.11871/jfdc.issn.2096-742X.2020.01.008

• 专刊:高性能与高通量计算及应用 • 上一篇    下一篇

COS:度量分布式大数据处理系统的效率

李晓涵,陈文光()   

  1. 清华大学计算机科学与技术系,北京 100084
  • 收稿日期:2019-10-28 出版日期:2020-02-20 发布日期:2020-03-28
  • 通讯作者: 陈文光 E-mail:cwg@tsinghua.edu.cn
  • 作者简介:李晓涵 ,清华大学计算机系高性能计算研究所,研究生,主要研究方向为大数据处理、并行程序设计。
    主要贡献:完成单机程序实现、分布式系统性能测试、论文撰写等工作。
    Li Xiaohan is currently a Master student of Institute of High-Performance Computing, Department of Computer Science and Technology, Tsinghua University. Her research interests are big data processing and parallel programming.
    Undertaking the following tasks in this article: Algorithm implementation, experiments on clusters, and article writing.
    E-mail: xh-li18@mails.tsinghua.edu.cn|陈文光 ,清华大学计算机系,教授,主要研究方向为并行计算和分布式系统。
    主要贡献:在指标设计、论文组织等方面给予建设性指导。
    Chen Wenguang is a Professor in Department of Computer Science and Technology, Tsinghua University. His research interests are parallel computing and distributed systems.
    Undertaking the following tasks in this article: Supervising on metric design and article organization.

COS: Measuring the Efficiency of Distributed Big Data Processing System

Li Xiaohan,Chen Wenguang()   

  1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:2019-10-28 Online:2020-02-20 Published:2020-03-28
  • Contact: Chen Wenguang E-mail:cwg@tsinghua.edu.cn

摘要:

【目的】在大数据处理领域,分布式计算系统得到广泛应用,它们的可扩展性得到重点关注,但其绝对性能往往没有得到重视。我们希望提出科学合理、与时俱进的度量标准,对分布式系统的性能进行评估。【方法】本文通过对比特定任务的单机实现和分布式实现来讨论分布式系统的性能,提出COS(Configuration that Outperforms a Single machine)这一指标,来衡量分布式系统在达到单台机器的性能时,需要的硬件资源数量。我们选取k-means聚类和逻辑回归两个经典机器学习算法,对其进行单机多线程实现,并通过向量化计算、优化内存分配与访问等方式对性能进行了优化,为分布式多机系统的性能提供参考。【结果】以Apache Spark作为对标系统,实验发现无论是使用其原生编程接口,还是经过悉心优化的机器学习库,都要使用数倍甚至数百倍的机器,才能达到单机多线程实现的性能。【局限】分布式系统与单机实现进行性能对比并不是完全公平的,分布式系统的额外开销客观存在。【结论】但COS指标仍能反映分布式系统存在的绝对性能较差、没有充分利用硬件优势等问题。

关键词: 并行计算, 大数据, 多线程, k-means, 逻辑回归

Abstract:

[Objective] Distributed computing systems are used widely in the field of big data processing. They are designed and implemented with a focus on scalability. With good scalability, a system can hold and process a growing amount of data by adding resources without modifying the system itself while sacrificing the absolute performance of a single machine at huge expenses. We want to offer a reasonable and modern metric to evaluate the performance of distributed systems. [Methods] In this article, we discuss the performance of distributed systems by comparing them with the same task on a single machine with the proposed metric, COS, or the Configuration that Outperforms a Single machine. The COS of a system on a given problem is the number of machines required when the system outperforms a competent single-machine implementation. Given a limited hardware resources, COS of a distributed system is usually too large to measure. So, we offer another metric by giving a parameter n to COS. COS(n) equals to n multiplied by the time used on n machines over that on a single machine. COS(n) indicates the performance and expense loss in a cluster system. We implemented two classic machine learning algorithms, k-means clustering and logistic regression, on a single machine with multi-threading, SIMD support and NUMA-aware memory control. [Results] Our experiments show that by using Apache Spark, with no matter its native API or optimized machine learning library like MLlib, it needs tens to hundreds of machines to achieve the same performance as we did on a single machine. [Limitations] The comparison between a single machine and a cluster is not entirely fair, for overheads in a cluster is unavoidable. [Conclusions] This COS metric can still reflect the problems of poor absolute performance and insufficient utilization of hardware advantages in distributed systems.

Key words: parallel computing, big data, multi-thread, k-means, logistic regression