Big Data 3.0—The Key Technologies of Big Data in Post-Hadoop Era

doi:10.11871/jfdc.issn.2096.742X.2019.01.010

Abstract

Abstract:

[Objective] Since cloud computing and new hardware technology quickly adopted by industry, more and more users complain about the architect of Hadoop because of its property of high complexity, not mature nor stable, and not flexible for cloud computing. Transwarp redesigned the big data software stack in order to make users be able to use big data technology better and easier. [Methods] The new stack includes a new Resource Management and Scheduling layer, which can be able to manage tasks within different kinds of life cycle; a new Storage Management Layer which is able to add or remove different storage plugins for different data types and acts as a new distributed storage; a unified DAG-based computing engine which can be used for data warehouse, stream computing, graph computing and etc. A development interface supporting SQL and Python is designed for developers to reduce the coding complexity. [Results] Big data technology finally can work well with cloud computing by using Kubernetes for resource management. Besides, applications can work well with big data system software using these technologies on one unified platform. [Conclusions] After we refined the big data system stack, we not only solved the technical issues related to Hadoop, but also make big data system software works well with cloud computing and new hardware, which specifies the research direction of big data technology in the future.

Key words: big data, cloud computing, DAG, Stream computing, Kubernetes, multi-tenancy, unified storage engine

Wanggen Liu,Yuanhao Sun. Big Data 3.0—The Key Technologies of Big Data in Post-Hadoop Era[J]. Frontiers of Data and Computing, 2019, 1(1): 94-104.

Figures/Tables 10

Fig.1

Fig.2

Fig.3

Fig. 4

Table 1

Fig.5

Table 2

Fig.6

Fig.7

Fig.8

References 16

[1]	http://hadoop.apache.org/.
[2]	Matei Zaharia . Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012.
[3]	https://www.docker.com/.
[4]	Brendan Burns, Brian Grant , etc. Borg, Omega, and Kubernetes. Magazine Queue - Containers, Volume 14 Issue 1, January-February 2016.
[5]	A Thusoo, JS Sarma ,etc. Hive: a warehousing solution over a map-reduce framework,Proceedings of the VLDB Endowment , Volume 2 Issue 2, August 2009 .
[6]	P Carbone, A Katsifodimos, S Ewen ,etc. Apache flink: Stream and batch processing in a single engine, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4).
[7]	M Kornacker, A Behm, V Bittorf , Impala: A Modern, Open-Source SQL Engine for Hadoop.In Proc.CIDR’15.
[8]	M Armbrust, RS Xin , etc. Spark SQL: Relational Data Processing in Spark, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.
[9]	http://www.transwarp.cn/transwarp/assembly-inceptor.html.
[10]	http://www.transwarp.cn/transwarp/assembly-slipstream.html.
[11]	V Vavilapalli ,etc. Apache Hadoop YARN: yet another resourcenegotiator. Proceedings of the 4th annual Symposium on Cloud Computing.
[12]	https://en.wikipedia.org/wiki/Directed_acyclic_graph.
[13]	A Buchmann, B Koldehofe , etc. Complex event processing, Methods and Applications of Informatics and Information Technology, 2009.
[14]	https://en.wikipedia.org/wiki/Business_rules_engine.
[15]	L Lamport , Paxos made simple - ACM Sigact News , 2001.
[16]	D Ongaro, J Ousterhout , In search of an understandable consensus algorithm,Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference.

技术点	MPP	DAG
SQL编译与优化	依赖单机数据库的SQL能力	自研SQL编译器
数据存储	Share nothing架构	共享分布式存储架构
元数据信息	比较有限的meta信息,全局的计算任务的优化有难度	有全局的meta信息,可以更好地协调executor之间的数据通信、任务启停
Shard内性能	本地库的执行速度高,理论上是DAG的上限	可以通过执行器、Codegen等技术来优化性能
容错性	依赖各个数据库完成切分任务,因此容错性不足	共享数据存储,Task的设计上可以简单、有幂等性,更好容错
数据通信性能	依赖数据分布来减少数据通信的性能损耗,因此不灵活	依赖全局的数据元信息来减少通信的性能损耗,更加灵活
核心优势	优化器成熟,本地执行性能更好	灵活性、容错性更高,能够更好的减少数据通信消耗
劣势	总体性能依赖业务特性和数据分布部分MPP的可扩展性方面还需要提高	SQL、事务、优化器等仍需持续改进基本逼近MPP的性能