大数据3.0—— 后Hadoop时代大数据的核心技术

doi:10.11871/jfdc.issn.2096.742X.2019.01.010

数据与计算发展前沿 ›› 2019, Vol. 1 ›› Issue (1): 94-104.

doi: 10.11871/jfdc.issn.2096.742X.2019.01.010

所属专题： “数据与计算平台”专刊

大数据3.0—— 后Hadoop时代大数据的核心技术

刘汪根,孙元浩

星环信息科技（上海）有限公司,上海 200233

收稿日期:2019-08-15 出版日期:2019-01-20 发布日期:2019-10-09
作者简介:刘汪根,1985年生,星环信息科技（上海）有限公司研发总监、总架构师,目前负责星环科技大数据技术的前沿规划、研发管理工作,曾带领团队全球首个通过TPC-DS的基准测试。目前主要研究新一代的大数据架构、分布式数据库技术和容器云等领域,探索大数据PaaS化和服务化等工程技术。
本文主要是对过去几年我们团队的大数据技术研发工作的一个总结,作者主要负责总体架构的设计、验证以及研发计划的落地,同时负责文章的主题撰写工作。
Liu Wanggen , born in 1985, R&D director and chief architect of Transwarp. He is responsible for the strategic planning and R&D management of big data technology. He led the team developed the first product world-wide that passed the TPC-DS benchmark testing, certified officially. His research interests mainly focus on the new generation of big data architecture, distributed database technology, cloud computing, and engineering technologies such as data PaaS and other PaaS. This paper is summary of his team's big data technology R&D work over the past few years. He is responsible for the design, verification and development of the overall architecture, and is responsible for the topic writing of the article.
E-mail:wayne.liu@transwarp.io|孙元浩,1976年生,星环信息科技（上海）有限公司创始人,从2009年开始布道大数据技术,并创立国内首个Hadoop发行版,2013年创立星环信息科技并开始新一代大数据技术的研发工作,负责星环科技的整体技术战略管理。本文中新一代大数据总体架构是孙元浩率先提出,并积极的通过推动了商业化的验证和落地。
Sun Yuanhao , born in 1976, founder of Transwarp. He started his career in big data since 2009 and created the first Hadoop distribution in China. In 2013, he founded Transwarp and started the R&D of the new generation big data technology, responsible for the overall strategy of Transwarp. He took the lead in proposing the overall architecture of the new generation big data in this paper and actively promoted the validation as well as its commercialization.
E-mail:yuanhao.sun@transwarp.io

Big Data 3.0—The Key Technologies of Big Data in Post-Hadoop Era

Wanggen Liu,Yuanhao Sun

Transwarp Technology (Shanghai) co, Ltd, Shanghai 200233, China

Received:2019-08-15 Online:2019-01-20 Published:2019-10-09

摘要/Abstract

摘要：

【目的】以Hadoop为代表的第一代大数据技术架构存在过于复杂、性能不足,以及与云计算不能很好结合等问题,因此星环科技重新设计了大数据技术栈。【方法】设计了资源调度层来管理各种生命周期的服务和任务;抽象出了统一存储管理层,通过插拔不同的存储引擎来实现对不同类型数据的需求;通过统一的基于DAG的计算引擎来支持多种计算负载;在开发层提供标准的SQL和Python接口。【结果】使用Kubernetes技术统一管理数据服务和容器技术实现更好的多租户能力,打通大数据和业务之间的衔接,从而更好的实现数据业务化和业务数据化,也在大规模商用中得到了验证。【结论】通过对大数据架构的重新设计,不仅有效的解决了第一代大数据实现的技术问题,而且更好的与云计算和新型硬件技术结合,可以代表新一代大数据基础技术栈的发展方向。

关键词: 大数据, 云计算, DAG, 实时计算, Kubernetes, 多租户, 统一存储管理

Abstract:

[Objective] Since cloud computing and new hardware technology quickly adopted by industry, more and more users complain about the architect of Hadoop because of its property of high complexity, not mature nor stable, and not flexible for cloud computing. Transwarp redesigned the big data software stack in order to make users be able to use big data technology better and easier. [Methods] The new stack includes a new Resource Management and Scheduling layer, which can be able to manage tasks within different kinds of life cycle; a new Storage Management Layer which is able to add or remove different storage plugins for different data types and acts as a new distributed storage; a unified DAG-based computing engine which can be used for data warehouse, stream computing, graph computing and etc. A development interface supporting SQL and Python is designed for developers to reduce the coding complexity. [Results] Big data technology finally can work well with cloud computing by using Kubernetes for resource management. Besides, applications can work well with big data system software using these technologies on one unified platform. [Conclusions] After we refined the big data system stack, we not only solved the technical issues related to Hadoop, but also make big data system software works well with cloud computing and new hardware, which specifies the research direction of big data technology in the future.

Key words: big data, cloud computing, DAG, Stream computing, Kubernetes, multi-tenancy, unified storage engine

刘汪根,孙元浩. 大数据3.0—— 后Hadoop时代大数据的核心技术[J]. 数据与计算发展前沿, 2019, 1(1): 94-104.

Wanggen Liu,Yuanhao Sun. Big Data 3.0—The Key Technologies of Big Data in Post-Hadoop Era[J]. Frontiers of Data and Computing, 2019, 1(1): 94-104.

图/表 10

图1

图2

图3

图4

表1

图5

表2

图6

图7

图8

参考文献 16

[1]	http://hadoop.apache.org/.
[2]	Matei Zaharia . Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012.
[3]	https://www.docker.com/.
[4]	Brendan Burns, Brian Grant , etc. Borg, Omega, and Kubernetes. Magazine Queue - Containers, Volume 14 Issue 1, January-February 2016.
[5]	A Thusoo, JS Sarma ,etc. Hive: a warehousing solution over a map-reduce framework,Proceedings of the VLDB Endowment , Volume 2 Issue 2, August 2009 .
[6]	P Carbone, A Katsifodimos, S Ewen ,etc. Apache flink: Stream and batch processing in a single engine, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4).
[7]	M Kornacker, A Behm, V Bittorf , Impala: A Modern, Open-Source SQL Engine for Hadoop.In Proc.CIDR’15.
[8]	M Armbrust, RS Xin , etc. Spark SQL: Relational Data Processing in Spark, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.
[9]	http://www.transwarp.cn/transwarp/assembly-inceptor.html.
[10]	http://www.transwarp.cn/transwarp/assembly-slipstream.html.
[11]	V Vavilapalli ,etc. Apache Hadoop YARN: yet another resourcenegotiator. Proceedings of the 4th annual Symposium on Cloud Computing.
[12]	https://en.wikipedia.org/wiki/Directed_acyclic_graph.
[13]	A Buchmann, B Koldehofe , etc. Complex event processing, Methods and Applications of Informatics and Information Technology, 2009.
[14]	https://en.wikipedia.org/wiki/Business_rules_engine.
[15]	L Lamport , Paxos made simple - ACM Sigact News , 2001.
[16]	D Ongaro, J Ousterhout , In search of an understandable consensus algorithm,Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference.

技术点	MPP	DAG
SQL编译与优化	依赖单机数据库的SQL能力	自研SQL编译器
数据存储	Share nothing架构	共享分布式存储架构
元数据信息	比较有限的meta信息,全局的计算任务的优化有难度	有全局的meta信息,可以更好地协调executor之间的数据通信、任务启停
Shard内性能	本地库的执行速度高,理论上是DAG的上限	可以通过执行器、Codegen等技术来优化性能
容错性	依赖各个数据库完成切分任务,因此容错性不足	共享数据存储,Task的设计上可以简单、有幂等性,更好容错
数据通信性能	依赖数据分布来减少数据通信的性能损耗,因此不灵活	依赖全局的数据元信息来减少通信的性能损耗,更加灵活
核心优势	优化器成熟,本地执行性能更好	灵活性、容错性更高,能够更好的减少数据通信消耗
劣势	总体性能依赖业务特性和数据分布部分MPP的可扩展性方面还需要提高	SQL、事务、优化器等仍需持续改进基本逼近MPP的性能

大数据3.0—— 后Hadoop时代大数据的核心技术

Big Data 3.0—The Key Technologies of Big Data in Post-Hadoop Era

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 16

相关文章 15

编辑推荐

Metrics

本文评价

[1]	周成祖, 吴文, 蔡晓强. 基于分类分级的数据安全防控策略研究[J]. 数据与计算发展前沿, 2023, 5(1): 128-135.
[2]	许淞源,刘峰. ESDRec：一种面向地球大数据平台的数据推荐模型[J]. 数据与计算发展前沿, 2023, 5(1): 55-64.
[3]	金天骄,栗蔚. 基于算力网络的大数据计算资源智能调度分配方法[J]. 数据与计算发展前沿, 2022, 4(6): 29-37.
[4]	杨昕,沈文海. “东数西算”趋势下的气象算力网络和算力服务体系架构[J]. 数据与计算发展前沿, 2022, 4(5): 50-59.
[5]	季明辰,任勇毛,张运栋,周慧娟,周旭,周艳芳. 面向交通大数据的高速文件传输系统设计与实现[J]. 数据与计算发展前沿, 2022, 4(3): 141-151.
[6]	胡庆宝,郑伟,王佳荣,汪璐,颜田. 高能物理科学数据中心智能运维系统[J]. 数据与计算发展前沿, 2022, 4(1): 30-41.
[7]	陈文杰,胡正银,胡靖,庞弘燊,何雨娟. 多维数据驱动的粮食安全分析与智能决策系统研究与实践[J]. 数据与计算发展前沿, 2021, 3(6): 1-14.
[8]	鹿旭东,宋伟凤,郭伟,崔立真,林岳,姜涛. 大数据驱动的创新方法论与创新服务平台[J]. 数据与计算发展前沿, 2021, 3(5): 141-155.
[9]	张婕,郭印. 基于大数据语言实验平台的隐私安全研究[J]. 数据与计算发展前沿, 2020, 2(6): 90-102.
[10]	陈梅丽,马英克,李茹姣,鲍一明. 基因组学数据分析方法现状和展望[J]. 数据与计算发展前沿, 2020, 2(2): 1-19.
[11]	王文生,郭雷风. 大数据技术农业应用[J]. 数据与计算发展前沿, 2020, 2(2): 101-110.
[12]	陈雷,袁媛. 基于深度迁移学习的农业病害图像识别[J]. 数据与计算发展前沿, 2020, 2(2): 111-119.
[13]	王卷乐,程凯,韩雪华,张敏. 大数据驱动的资源学科领域数据分析前沿与应用[J]. 数据与计算发展前沿, 2020, 2(2): 20-30.
[14]	孟珍,王学志,谢志敏,胡良霖,陈之端,马俊才,佟继周,张艳玲,周园春. IA：一种科学数据云分析服务管理引擎[J]. 数据与计算发展前沿, 2020, 2(2): 31-39.
[15]	李姿昕,张能,熊斌,胡云凤,赵新鹏,黄海友. 材料科学数据库在材料研发中的应用与展望[J]. 数据与计算发展前沿, 2020, 2(2): 78-90.