Frontiers of Data and Computing

Select

The Data and Computing Platform Is An Important Infrastructure Which Drives Modern Scientific Research Development

Fangyu Liao,Xuehai Hong,Yang Wang,Dawei Chu

Frontiers of Data and Computing 2019, 1 (1): 2-10. DOI: 10.11871/jfdc.issn.2096.742X.2019.01.002

Abstract （2142）

HTML （320）

PDF（pc）（7745KB）（1267）

[Objective] To demonstrate the important driving role of data and computing in scientific research, this paper studies the development and essence of data, computing, and scientific research. [Methods] This paper expounds the essence of data technology and computing technology. Several typical cases are used to illustrate that data and computing, viewed as one of the keys to scientific activities, greatly expand the depth and breadth of scientific and technological innovation research. [Results] Data are quantitative or qualitative records of natural, social phenomena and scientific experiments, which is the important for scientific research. The technology of data refers to a series of scientific and technological activities such as collecting, classifying, transporting, storing, analyzing and visualizing data. Its goal is to turn data into information, knowledge and pattern for human beings to understand the natural world and human society. Since the invention of von Neumann computer, the rapid development of computing technology driven by Moore's law has made the research and application of data technology and artificial intelligence more active. The collaborative progress of data technology, artificial intelligence and computing technology has brought about a leap forward in human understanding of natural and social knowledge and pattern. [Conclusions] Therefore, data technology, computing technology and artificial intelligence provide the most basic technology platform for building the "digital twin" body of the "human-machine-object" ternary fusion. Advanced data and computing technology, represented by big data and artificial intelligence technology, will integrate theory, experiment and simulation, and form a new scientific research paradigm. In the development process of scientific research in the past thousand years, human, capital, tools (scientific instruments) and methods (theories) have become the necessary input elements of scientific research. In the past decades, computer has aided scientists to carry out a lot of calculation work. It becomes a type of tools of the input elements. However, with the rapid development of data and computing technology, data and computing technology not only plays an auxiliary and supporting role in scientific research, but also can rely on the logical method of themselves to carry out scientific research in the "digital twin" body of "human-machine-object" three-dimensional integration. Therefore, data and computing technology will be regard as a new and indispensable input elements of scientific research.

Table and Figures | Reference | Related Articles | Metrics

Select

A Brief Review of Theory and Systematic Technologies for Big Data

Qiangsheng Hua,Zhigao Zheng,Zhenyu Hu,Zhiman Zhong,Changfu Lin,Feng Zhao,Hai Jin,Xuanhua Shi

Frontiers of Data and Computing 2019, 1 (1): 22-34. DOI: 10.11871/jfdc.issn.2096.742X.2019.01.004

Abstract （1778）

HTML （180）

PDF（pc）（10220KB）（1278）

[Objective] The article mainly gives a brief review for big data theory and systems, including the research background, the technical architecture and the key technologies following by estimating future research directions. [Method] On the basis of the brief introduction of the big data processing theory, this paper introduces the key technologies for big data systems by the three aspects: the data parallel processing methods, the Resource Description Framework (RDF) graph data query and matching, and the big data analysis technologies. [Results] The speed of data generation will be accelerated further more in near future, thus how to quickly process the data on the edge side would lead a research trend. [Conclusion] In the future, developing new technologies for big data theory and systems still warrant further attention, on which the researches on data processing by edge computing and fog computing in the scenario of Internet of Things era highlight based.

Table and Figures | Reference | Related Articles | Metrics

Select

Application-driven Big Data and Artificial Intelligence Integration Platform Construction

Bo Kang,Zijun Xia,Xiangfei Meng

Frontiers of Data and Computing 2019, 1 (1): 35-45. DOI: 10.11871/jfdc.issn.2096.742X.2019.01.005

Abstract （1891）

HTML （170）

PDF（pc）（12472KB）（1176）

[Objective]In order to provide references for computational innovations, an industrial needs driven integration platform for big data and artificial intelligence analysis and application is proposed to promote the traditional industry intelligence and intelligent technology industrialization. [Methods]Based on the integration of both data feature understanding and platform requirements in industry-oriented application scenarios, the application-driven platform hierarchy in supercomputer center is designed in a fused architecture consists of supercomputing, big data, cloud computing, artificial intelligence and internet of things, which contains implications on physical facilities, system software and management system. In the supercomputer center, it mainly integrates service-related hardware facilities for big data, super-computing and cloud computing to realize data sharing, high-performance processing, and data security control. By eliminating the difference between various data sources, the platform provides an unified standard data access interface for upper-layer applications, which promotes standardization of big data processing in related industries for resource and data sharing. As an important field of big data applications, the high-efficiency big data application platform for industrials combines with the industrial cloud platform to realize data collection, transmission, collaboration and application by integrating the physical device, virtual network and big data analysis methods. The characteristics of industrial-based big data and artificial intelligence require innovative applications that support the production tasks, such as design, production, sales, operation and maintenance. [Results]Based on the platform, it has achieved typical applications in industrial fields such as equipment manufacturing, networked vehicles, medical health, etc., showing good applicability. In manufacturing, the platform is a tool for product supplier quality management control, carrying out abnormal inspection and prediction of parts and components, and achieving management ability to control the entire product chain. In networked vehicle, by collecting vehicle driving data and using deep learning modeling, it is possible to analyze the safety of autonomous driving and driving behavior. In disease screening, big data and artificial intelligence analysis for radiological imaging, pathology images, and electronic medical records can help doctors complete analysis of repetitive tasks and complex tasks. [Limitations]As a public open platform to provide services, institutional credibility and data security are important issue to be solved in the next step. [Conclusions]Application-driven big data and artificial intelligence integration platform acts as an important part of social development and government-controllable intelligent industry science development ecology, which further solves the practical problems that insufficient innovation ability in China's intelligent industry.

Table and Figures | Reference | Related Articles | Metrics

Select

GPVis: A Scientific Visualization System for Large Scale Data

Guihua Shan,Jun Liu,Guan Li,Yang Gao,Tao Xu,Dong Tian

Frontiers of Data and Computing 2019, 1 (1): 46-62. DOI: 10.11871/jfdc.issn.2096.742X.2019.01.006

Abstract （1833）

HTML （121）

PDF（pc）（43072KB）（1089）

[Objective]To solve a series of problems brought about by large-scale scientific data visualization, and to provide a flexible and scalable scientific data visualization framework, this paper proposes GPVis, a scientific visualization system for large-scale data. [Methods]In this paper, we analyzed the challenges and opportunities faced by scientific data visualization at both the method and tool level. A new visual computing and service framework, GPVis, is proposed by using advanced technologies such as data pre-organization, graphics rendering, high-performance computing, human-computer interaction, VR/AR, etc.. [Results]For some common visualization methods, this paper proposes several visualization processing models for GPVis framework, and enumerates several application cases of the system in typical fields with provided specific implementation methods and results. In these cases, different types of visualization need of scientific researchers for data analysis were met. [Limitations]GPVis needs more intelligence for data analysis which leads us to incorporate artificial intelligence technology into future developments and introduce more natural human-computer interaction methods. [Conclusions]GPVis provides a powerful and scalable platform framework for large-scale scientific data visualization, enabling flexible component design for different data types and application requirements. As the system continues to evolve by providing more complete framework functions and visualization algorithms, it will be applied to more scientific fields in the future.

Table and Figures | Reference | Related Articles | Metrics

Select

Angel ⁺: A Large-Scale Machine Learning Platform on Angel

Zhipeng Zhang,Jiawei Jiang,Lele Yu,Bin Cui

Frontiers of Data and Computing 2019, 1 (1): 63-72. DOI: 10.11871/jfdc.issn.2096.742X.2019.01.007

Abstract （1555）

HTML （127）

PDF（pc）（8750KB）（1236）

[Objective] Real-world data becomes much more complex, sparse and high-dimensional for the big data shock in this era. According to this, modern ML models are designed in a deep, complicated way, which arises challenges when designing a distributed machine learning (ML) system. Though researchers have developed many efficient centralized ML systems like TensorFlow, PyTorch and XGBoost, these systems suffer from the following two problems: (1) They cannot integrate well with existing big data systems, (2) they are not general enough and are usually designed for specific ML models. [Methods] To tackle these challenges, we introduce Angel ⁺, a large-scale ML platform based on parameter servers. [Results] With the power of parameter servers, Angel ⁺can efficiently support existing big data systems and ML systems without neither breaking the core of big data systems, Apache Spark for instance, nor degrades the computation efficiency of current ML frameworks like PyTorch. Furthermore, Angel ⁺ provides algorithms like model averaging, gradient compression and heterogeneous-aware stochastic gradient descent, to deal with the huge communication cost and the straggler problem in distributed training process. [Conclusions] We also enhance the usability of Angel ⁺by providing efficient implementation for many ML models. We conduct extensive experiments to demonstrate the superiority of Angel ⁺.

Table and Figures | Reference | Related Articles | Metrics

Select

Survey on Federated RDF Systems

Peng Peng,Lei Zou

Frontiers of Data and Computing 2019, 1 (1): 73-81. DOI: 10.11871/jfdc.issn.2096.742X.2019.01.008

Abstract （1778）

HTML （103）

PDF（pc）（7483KB）（966）

[Objective] Resource Description Framework (RDF), a standard model for knowledge representation, has been widely used in various scientific data management applications to represent the scientific data as a knowledge graph. Meanwhile, Simple Protocol And RDF Query Language (SPARQL) is a structured query language to access RDF repository. As more and more data publishers release their datasets in the model of RDF, how to integrate the RDF datasets provided by different data publishers into a federated RDF system becomes a challenge. [Coverage] In this paper we provide an overview of the studies of federated RDF systems. [Methods]The major differences among different federated RDF systems are different strategies for source selection guided query decomposition and query processing optimization. [Results] Existing query decomposition and source selection strategies in federated RDF systems can be divided into two categories: metadata-based and ASK-based strategies; Query optimization strategies in existing federated RDF systems are some joint optimizations based on System-R style dynamic programming. [Limitations] Existing federated RDF systems still do not discuss how to support SPARQL 1.1. [Conclusions] Federated RDF systems can integrate distributed RDF graphs among different sources, which means that it is an important future research direction.

Table and Figures | Reference | Related Articles | Metrics

Select

SKS: A Platform for Big Data Based Scientific Knowledge Graph

Yuanchun Zhou,Qingling Chang,Yi Du

Frontiers of Data and Computing 2019, 1 (1): 82-93. DOI: 10.11871/jfdc.issn.2096.742X.2019.01.009

Abstract （2254）

HTML （182）

PDF（pc）（11789KB）（1595）

[Objective] The big data knowledge graph in the field of science and technology is dedicated to providing researchers with more accurate, comprehensive, deeper and broader search and analysis results, and thus providing a practical and valuable reference for disciplinary research. [Scope of the literature] The article focuses on the research of data-based scientific and technological evaluation methods at home and abroad, the interdisciplinary research based on knowledge graph, the key technical methods in the construction of knowledge graph and the application of knowledge graph based on domain knowledge. [Methods] This paper presents a large-data knowledge graph platform SKS in the field of science and technology. Based on the overall architecture of the SKS platform, we expound the key technologies and platform tools for constructing knowledge graph in the field of science and technology, and gives relevant key technologies and applications in different fields. [Results] The SKS platform and application provide a precise, multi-dimensional and interrelated intelligent retrieval service for researchers while constructing a resource knowledge management system for related fields. [Limitations] The big data knowledge graph in the field of science and technology is constantly developing. The data quality (the error caused by the data fusion and the quality of source data) affects the application effect of the platform to a certain extent. In the future, we hope to carry out more in data disambiguation. [Conclusions] The big data knowledge graph in the field of science and technology, with its strong semantic processing ability and relationship exploration ability, can organize massive data of personnel, institutions, achievements, events, etc. in the field of science and technology in a better way, which provides auxiliary functions for technology evaluation. The application effects in specific projects are recognized by corresponding domain experts.

Table and Figures | Reference | Related Articles | Metrics

Select

Big Data 3.0—The Key Technologies of Big Data in Post-Hadoop Era

Wanggen Liu,Yuanhao Sun

Frontiers of Data and Computing 2019, 1 (1): 94-104. DOI: 10.11871/jfdc.issn.2096.742X.2019.01.010

Abstract （2180）

HTML （210）

PDF（pc）（11457KB）（1310）

[Objective] Since cloud computing and new hardware technology quickly adopted by industry, more and more users complain about the architect of Hadoop because of its property of high complexity, not mature nor stable, and not flexible for cloud computing. Transwarp redesigned the big data software stack in order to make users be able to use big data technology better and easier. [Methods] The new stack includes a new Resource Management and Scheduling layer, which can be able to manage tasks within different kinds of life cycle; a new Storage Management Layer which is able to add or remove different storage plugins for different data types and acts as a new distributed storage; a unified DAG-based computing engine which can be used for data warehouse, stream computing, graph computing and etc. A development interface supporting SQL and Python is designed for developers to reduce the coding complexity. [Results] Big data technology finally can work well with cloud computing by using Kubernetes for resource management. Besides, applications can work well with big data system software using these technologies on one unified platform. [Conclusions] After we refined the big data system stack, we not only solved the technical issues related to Hadoop, but also make big data system software works well with cloud computing and new hardware, which specifies the research direction of big data technology in the future.

Table and Figures | Reference | Related Articles | Metrics

Select

PaddlePaddle: An Open-Source Deep Learning Platform from Industrial Practice

Yanjun Ma,Dianhai Yu,Tian Wu,Haifeng Wang

Frontiers of Data and Computing 2019, 1 (1): 105-115. DOI: 10.11871/jfdc.issn.2096.742X.2019.01.011

Abstract （7470）

HTML （758）

PDF（pc）（7401KB）（3572）

[Objective] Deep learning is widely recognized as core technology driving the breakthroughs in artificial intelligence. Deep learning frameworks can be considered as the operating system in the era of artificial intelligence. PaddlePaddle, as the only fully-functioning open-source deep learning platform in China, is introduced comprehensively. [Methods] In this paper, a brief history of the deep learning frameworks is introduced, followed by an overview of PaddlePaddle, which is comprised of the core framework, toolkits and service platforms. After that, we elaborate on the core technologies of PaddlePaddle, including the front-end programming language, the modeling paradigm etc. Finally, the main innovations in PaddlePaddle are summarized. [Results] PaddlePaddle has been intensively tested in Baidu production for years, with unique features in supporting distributed training with ultra-large data and fast inference on server, mobile as well as edges. [Conclusions] The main innovations, research and development trends are discussed systematically.

Table and Figures | Reference | Related Articles | Metrics

Highlights