%A Li Xiaohan,Chen Wenguang %T COS: Measuring the Efficiency of Distributed Big Data Processing System %0 Journal Article %D 2020 %J Frontiers of Data and Computing %R 10.11871/jfdc.issn.2096-742X.2020.01.008 %P 93-104 %V 2 %N 1 %U {http://www.jfdc.cnic.cn/CN/abstract/article_36.shtml} %8 2020-02-20 %X

[Objective] Distributed computing systems are used widely in the field of big data processing. They are designed and implemented with a focus on scalability. With good scalability, a system can hold and process a growing amount of data by adding resources without modifying the system itself while sacrificing the absolute performance of a single machine at huge expenses. We want to offer a reasonable and modern metric to evaluate the performance of distributed systems. [Methods] In this article, we discuss the performance of distributed systems by comparing them with the same task on a single machine with the proposed metric, COS, or the Configuration that Outperforms a Single machine. The COS of a system on a given problem is the number of machines required when the system outperforms a competent single-machine implementation. Given a limited hardware resources, COS of a distributed system is usually too large to measure. So, we offer another metric by giving a parameter n to COS. COS(n) equals to n multiplied by the time used on n machines over that on a single machine. COS(n) indicates the performance and expense loss in a cluster system. We implemented two classic machine learning algorithms, k-means clustering and logistic regression, on a single machine with multi-threading, SIMD support and NUMA-aware memory control. [Results] Our experiments show that by using Apache Spark, with no matter its native API or optimized machine learning library like MLlib, it needs tens to hundreds of machines to achieve the same performance as we did on a single machine. [Limitations] The comparison between a single machine and a cluster is not entirely fair, for overheads in a cluster is unavoidable. [Conclusions] This COS metric can still reflect the problems of poor absolute performance and insufficient utilization of hardware advantages in distributed systems.