[Objective] To demonstrate the important driving role of data and computing in scientific research, this paper studies the development and essence of data, computing, and scientific research. [Methods] This paper expounds the essence of data technology and computing technology. Several typical cases are used to illustrate that data and computing, viewed as one of the keys to scientific activities, greatly expand the depth and breadth of scientific and technological innovation research. [Results] Data are quantitative or qualitative records of natural, social phenomena and scientific experiments, which is the important for scientific research. The technology of data refers to a series of scientific and technological activities such as collecting, classifying, transporting, storing, analyzing and visualizing data. Its goal is to turn data into information, knowledge and pattern for human beings to understand the natural world and human society. Since the invention of von Neumann computer, the rapid development of computing technology driven by Moore's law has made the research and application of data technology and artificial intelligence more active. The collaborative progress of data technology, artificial intelligence and computing technology has brought about a leap forward in human understanding of natural and social knowledge and pattern. [Conclusions] Therefore, data technology, computing technology and artificial intelligence provide the most basic technology platform for building the "digital twin" body of the "human-machine-object" ternary fusion. Advanced data and computing technology, represented by big data and artificial intelligence technology, will integrate theory, experiment and simulation, and form a new scientific research paradigm. In the development process of scientific research in the past thousand years, human, capital, tools (scientific instruments) and methods (theories) have become the necessary input elements of scientific research. In the past decades, computer has aided scientists to carry out a lot of calculation work. It becomes a type of tools of the input elements. However, with the rapid development of data and computing technology, data and computing technology not only plays an auxiliary and supporting role in scientific research, but also can rely on the logical method of themselves to carry out scientific research in the "digital twin" body of "human-machine-object" three-dimensional integration. Therefore, data and computing technology will be regard as a new and indispensable input elements of scientific research.
[Objective]In order to reduce the difficulty of parallel programming and accelerate the development of application program , this paper designs and implements a parallel framework software, SC_Tangram, in which SC represents scientific computing and Tangram implies flexible assembly. [Methods]To guarantee the massively parallel scalability and adaptivity, the programming model Charm++ is adopted in the runtime system layer of the framework. By the method of component software development, SC_Tangram encapsulates and hides the common parts and can be invoked by users in term of component or configuration file interfaces. [Results]For the current development stage, the framework has been applied to mechanical calculation, phase field simulation and other applications. The experimental results show that it can perform more efficient computations. [Limitations]At present, the functional modules of the framework software are not comprehensive, so it is necessary to develop corresponding interfaces for different application requirements. [Conclusions]SC_Tangram can support the development of common and characteristic components for applications. With the development of more functional components in the framework, it will be applied to more fields of scientific computing in the future.
[Objective] The article mainly gives a brief review for big data theory and systems, including the research background, the technical architecture and the key technologies following by estimating future research directions. [Method] On the basis of the brief introduction of the big data processing theory, this paper introduces the key technologies for big data systems by the three aspects: the data parallel processing methods, the Resource Description Framework (RDF) graph data query and matching, and the big data analysis technologies. [Results] The speed of data generation will be accelerated further more in near future, thus how to quickly process the data on the edge side would lead a research trend. [Conclusion] In the future, developing new technologies for big data theory and systems still warrant further attention, on which the researches on data processing by edge computing and fog computing in the scenario of Internet of Things era highlight based.
[Objective]In order to provide references for computational innovations, an industrial needs driven integration platform for big data and artificial intelligence analysis and application is proposed to promote the traditional industry intelligence and intelligent technology industrialization. [Methods]Based on the integration of both data feature understanding and platform requirements in industry-oriented application scenarios, the application-driven platform hierarchy in supercomputer center is designed in a fused architecture consists of supercomputing, big data, cloud computing, artificial intelligence and internet of things, which contains implications on physical facilities, system software and management system. In the supercomputer center, it mainly integrates service-related hardware facilities for big data, super-computing and cloud computing to realize data sharing, high-performance processing, and data security control. By eliminating the difference between various data sources, the platform provides an unified standard data access interface for upper-layer applications, which promotes standardization of big data processing in related industries for resource and data sharing. As an important field of big data applications, the high-efficiency big data application platform for industrials combines with the industrial cloud platform to realize data collection, transmission, collaboration and application by integrating the physical device, virtual network and big data analysis methods. The characteristics of industrial-based big data and artificial intelligence require innovative applications that support the production tasks, such as design, production, sales, operation and maintenance. [Results]Based on the platform, it has achieved typical applications in industrial fields such as equipment manufacturing, networked vehicles, medical health, etc., showing good applicability. In manufacturing, the platform is a tool for product supplier quality management control, carrying out abnormal inspection and prediction of parts and components, and achieving management ability to control the entire product chain. In networked vehicle, by collecting vehicle driving data and using deep learning modeling, it is possible to analyze the safety of autonomous driving and driving behavior. In disease screening, big data and artificial intelligence analysis for radiological imaging, pathology images, and electronic medical records can help doctors complete analysis of repetitive tasks and complex tasks. [Limitations]As a public open platform to provide services, institutional credibility and data security are important issue to be solved in the next step. [Conclusions]Application-driven big data and artificial intelligence integration platform acts as an important part of social development and government-controllable intelligent industry science development ecology, which further solves the practical problems that insufficient innovation ability in China's intelligent industry.
[Objective]To solve a series of problems brought about by large-scale scientific data visualization, and to provide a flexible and scalable scientific data visualization framework, this paper proposes GPVis, a scientific visualization system for large-scale data. [Methods]In this paper, we analyzed the challenges and opportunities faced by scientific data visualization at both the method and tool level. A new visual computing and service framework, GPVis, is proposed by using advanced technologies such as data pre-organization, graphics rendering, high-performance computing, human-computer interaction, VR/AR, etc.. [Results]For some common visualization methods, this paper proposes several visualization processing models for GPVis framework, and enumerates several application cases of the system in typical fields with provided specific implementation methods and results. In these cases, different types of visualization need of scientific researchers for data analysis were met. [Limitations]GPVis needs more intelligence for data analysis which leads us to incorporate artificial intelligence technology into future developments and introduce more natural human-computer interaction methods. [Conclusions]GPVis provides a powerful and scalable platform framework for large-scale scientific data visualization, enabling flexible component design for different data types and application requirements. As the system continues to evolve by providing more complete framework functions and visualization algorithms, it will be applied to more scientific fields in the future.
[Objective] Real-world data becomes much more complex, sparse and high-dimensional for the big data shock in this era. According to this, modern ML models are designed in a deep, complicated way, which arises challenges when designing a distributed machine learning (ML) system. Though researchers have developed many efficient centralized ML systems like TensorFlow, PyTorch and XGBoost, these systems suffer from the following two problems: (1) They cannot integrate well with existing big data systems, (2) they are not general enough and are usually designed for specific ML models. [Methods] To tackle these challenges, we introduce Angel +, a large-scale ML platform based on parameter servers. [Results] With the power of parameter servers, Angel +can efficiently support existing big data systems and ML systems without neither breaking the core of big data systems, Apache Spark for instance, nor degrades the computation efficiency of current ML frameworks like PyTorch. Furthermore, Angel + provides algorithms like model averaging, gradient compression and heterogeneous-aware stochastic gradient descent, to deal with the huge communication cost and the straggler problem in distributed training process. [Conclusions] We also enhance the usability of Angel +by providing efficient implementation for many ML models. We conduct extensive experiments to demonstrate the superiority of Angel +.
[Objective] Resource Description Framework (RDF), a standard model for knowledge representation, has been widely used in various scientific data management applications to represent the scientific data as a knowledge graph. Meanwhile, Simple Protocol And RDF Query Language (SPARQL) is a structured query language to access RDF repository. As more and more data publishers release their datasets in the model of RDF, how to integrate the RDF datasets provided by different data publishers into a federated RDF system becomes a challenge. [Coverage] In this paper we provide an overview of the studies of federated RDF systems. [Methods]The major differences among different federated RDF systems are different strategies for source selection guided query decomposition and query processing optimization. [Results] Existing query decomposition and source selection strategies in federated RDF systems can be divided into two categories: metadata-based and ASK-based strategies; Query optimization strategies in existing federated RDF systems are some joint optimizations based on System-R style dynamic programming. [Limitations] Existing federated RDF systems still do not discuss how to support SPARQL 1.1. [Conclusions] Federated RDF systems can integrate distributed RDF graphs among different sources, which means that it is an important future research direction.
[Objective] The big data knowledge graph in the field of science and technology is dedicated to providing researchers with more accurate, comprehensive, deeper and broader search and analysis results, and thus providing a practical and valuable reference for disciplinary research. [Scope of the literature] The article focuses on the research of data-based scientific and technological evaluation methods at home and abroad, the interdisciplinary research based on knowledge graph, the key technical methods in the construction of knowledge graph and the application of knowledge graph based on domain knowledge. [Methods] This paper presents a large-data knowledge graph platform SKS in the field of science and technology. Based on the overall architecture of the SKS platform, we expound the key technologies and platform tools for constructing knowledge graph in the field of science and technology, and gives relevant key technologies and applications in different fields. [Results] The SKS platform and application provide a precise, multi-dimensional and interrelated intelligent retrieval service for researchers while constructing a resource knowledge management system for related fields. [Limitations] The big data knowledge graph in the field of science and technology is constantly developing. The data quality (the error caused by the data fusion and the quality of source data) affects the application effect of the platform to a certain extent. In the future, we hope to carry out more in data disambiguation. [Conclusions] The big data knowledge graph in the field of science and technology, with its strong semantic processing ability and relationship exploration ability, can organize massive data of personnel, institutions, achievements, events, etc. in the field of science and technology in a better way, which provides auxiliary functions for technology evaluation. The application effects in specific projects are recognized by corresponding domain experts.
[Objective] Since cloud computing and new hardware technology quickly adopted by industry, more and more users complain about the architect of Hadoop because of its property of high complexity, not mature nor stable, and not flexible for cloud computing. Transwarp redesigned the big data software stack in order to make users be able to use big data technology better and easier. [Methods] The new stack includes a new Resource Management and Scheduling layer, which can be able to manage tasks within different kinds of life cycle; a new Storage Management Layer which is able to add or remove different storage plugins for different data types and acts as a new distributed storage; a unified DAG-based computing engine which can be used for data warehouse, stream computing, graph computing and etc. A development interface supporting SQL and Python is designed for developers to reduce the coding complexity. [Results] Big data technology finally can work well with cloud computing by using Kubernetes for resource management. Besides, applications can work well with big data system software using these technologies on one unified platform. [Conclusions] After we refined the big data system stack, we not only solved the technical issues related to Hadoop, but also make big data system software works well with cloud computing and new hardware, which specifies the research direction of big data technology in the future.
[Objective] Deep learning is widely recognized as core technology driving the breakthroughs in artificial intelligence. Deep learning frameworks can be considered as the operating system in the era of artificial intelligence. PaddlePaddle, as the only fully-functioning open-source deep learning platform in China, is introduced comprehensively. [Methods] In this paper, a brief history of the deep learning frameworks is introduced, followed by an overview of PaddlePaddle, which is comprised of the core framework, toolkits and service platforms. After that, we elaborate on the core technologies of PaddlePaddle, including the front-end programming language, the modeling paradigm etc. Finally, the main innovations in PaddlePaddle are summarized. [Results] PaddlePaddle has been intensively tested in Baidu production for years, with unique features in supporting distributed training with ultra-large data and fast inference on server, mobile as well as edges. [Conclusions] The main innovations, research and development trends are discussed systematically.
[Objective] The article mainly introduces the research background, technical framework and key technologies related to data mid-end, as well as its application in the industry, and proposes the future research and application development direction based on the technology development trend in the end. [Methods] In the research background part, the existing researches on data mid-end and related fields in China and other countries are summarized. The chapter on technical architecture synthesizes the research results at home and abroad by sniffing application in various industries, and puts forwards the general architecture of data min-end. The industry application section introduces the application situation and value of data mid-end in the Internet, traditional industries and government departments. The future trend and prospect part discusses the future development of data mid-end based on relevant technologies. [Results] Based on the relevant technical framework in the article, data mid-end has been preliminarily applied and used in relevant industries, with Internet, finance, government affairs and other industries leading the trend. [Conclusion] The relevant technologies of data mid-end will be developed towards much more automatic and intelligent. The upper business applications supported by data mid-end will register explosive growth in various industries, attributed to a series of relevant technological breakthroughs.