[Objective] Through a comprehensive review of the current status and future development of genomics data analysis methods, we provide suggestions for the improvement of algorithm and tool development of related omics data analysis in precision medicine, precision breeding, biosafety, biodiversity and molecular evolution. [Results] The analysis of genomics data mainly includes that of genomic, transcriptomic and epigenomic data. At present, the analysis of genomics data faces challenges primarily because the data are massive, multidimensional and heterogeneous. This review will elaborate on the current status, applications, challenges, and prospects of algorithm and tool development for genomics data analysis. [Conclusions] The future directions of algorithm and tool development for genomics data analysis are to make full use of advanced technologies such as artificial intelligence, statistical models, and knowledge graphs, and to continuously optimize and develop more advanced algorithms and robust models that are of error tolerance, high accuracy, and high efficiency with low cost of computing resources.
[Objective] Driven by big data and supported by information technology, it makes it possible to break through and solve the soul problem of the comprehensive study of resource science, promoting the new development and innovative application of resource science, and the innovative application of resource science. [Methods] Based on the domain demand of the resource discipline, this paper expounds the frontier of data analysis technology in the resource discipline, including remote sensing monitoring, resource surveying, resource network mining and resource comprehensive analysis, and takes the “The Big Data Driven Resource Discipline Innovation Platform” supported by 13th Five-year Informatization Plan of Chinese Academy of Sciences as an example to demonstrate its typical application architecture. [Results] Based on application cases, three big data-driven scenarios in the typical application of scientific research activities in resource discipline are presented, including ecological risk prevention of transportation and pipeline control in China-Mongolia-Russia economic corridor, assessment of the carrying capacity of resources and environment in Beijing-Tianjin-Hebei region, and assessment of the beautiful China driven by big data. [Conclusions] The data analysis technologies driven by big data in the field of resource discipline have great potential and some of them have been applied in reality. However, more new methods and models adapted to the development of resource discipline are needed to promote its paradigm shift to comprehensive scientific research.
[Objective] With the development of scientific big data technology, problem-oriented analysis becomes normal case. Therefore, in views of the high cost of data migration and the reliance of data analysis on scientific big data, it is necessary to provide a scientific data analysis service engine in the data cloud, providing efficient extended computing and storage resources, optional algorithm resource libraries, and high-efficiency access interfaces with convenient user interaction tools and secure user access policy. Then, scientists can get rid of problems including large-scale data migration and adaptation to programming languages, algorithm environments, version issues, and resource calls, etc.. [Methods] An interactive analysis service management engine in scientific data cloud is presented. In our solution, resource nodes are scaled out through automatic registration. Resource nodes can be physical hosts or virtual hosts. When the utilization rate of computing resources reaches the threshold, the management node starts resource registration. Subsequently, a resource host is to be registered and the available container instances are added into the pool. The optional algorithm resource libraries, high-efficiency access interfaces for data resources and computing resources are versioned in the form of container mirrors for constructing the computing resource pools. The health of the container instance pool is maintained inside of the host. The instance lifecycle management is performed according to the maximum usage time and maximum silent time of each instance. With the always maintained fixed size resource pool, the container instance of the internal resource pool is in one out of four states, that is, preparing, ready, in use, and disappearing. There are several components set in the scientific analysis service system, including the proxy component, the orchestration module component, the user authentication component, the monitoring management component, buffer component, and a cache database. When a user accesses, the resources are conveyed according to the algorithm library selection and resource pool utilization rate, and a unique identity port (PID) is assigned for user access through proxy configuration. The access is in a secure encrypted network to interact with programming components or interactive application components that can use data and computing resources on the cloud. Each interactive component is in a separate container instance for effective resource isolation. [Results] Based on the interactive analysis service management engine in scientific data cloud, iAnalysis (IA for short), an interactive analysis cloud service system V1.0, gives a unified cloud resource management service for scientific data analysis. It can not only be used directly by end-user scientists through the IA's service portal, but also be called by other existing data systems in the form of docker container. By now, IA has provided several scientific cloud analysis services in the fields of life and health, ecological environment, meteorology, and hydrology, etc. It has been applied to major projects such as the Strategic Priority Research Program of the Chinese Academy of Sciences (both A and B) and the Major Project of the State Tobacco Monopoly Administration. It has also been applied to several National Scientific Data Centers, such as the National Microbial Science Data Center, National Space Science Data Center, and public platforms such as GSCloud (www.gscloud.cn) and DarwinTree (www.darwintree.cn). It also provides common coding tools for “R”, “TensorFlow”, “Data Science”, “All Spark”, and so on. Users can access the interactive programming component (iJupyter) or interactive application component (iWorkflow) through https to use data resources and computing resources of the data cloud.
[Objective] High Energy Photon Source (HEPS) is one of the key scientific and technological infrastructure projects in China. It supplies an important platform for original and innovative research in basic science and engineering to meet the demands of national significant strategy, on which high-throughput synchronous radiation experiments can be carried out, including ultra-high spatial resolution, time resolution and energy resolution X-ray diffraction experiments. It is estimated that over 200TB on average and even 500TB at maximum raw data will be produced every day from the 15 public beamlines and stations in the first construction phase. These experimental data will be permanently stored in a central storage, shared with research members, and can be near real-time processed and analyzed. [Methods] The science data platform consists of multiple systems, including IT infrastructure, scientific software, network, computing unit, storage device and public information service. [Results] It will provide scientific researchers, engineers and users with networks, computing units, storage devices, and other infrastructure capabilities for scientific research collaboration, data transmission, data storage, data analysis, data sharing and services for scientific software, general software, general information system, and information security, etc.
[Objective] Multi-aspect rating system could help customers better understand the item or service, because it provides not only the overall rating but also more detailed aspect ratings. By modeling the rating patterns on multi-aspect rating systems, we can better find out latent rating groups and quantitatively understand the rating behaviors lie in these groups. This can also help service providers improve their service and attract more targeted customers. However, due to the complex nature of multi-aspect rating system, it is challenging to model its rating patterns. [Methods] To address this problem, in this paper, we propose a two-step framework to learn the rating patterns from multi-aspect rating systems. Specifically, we first propose a multi-factorization relationship learning (MFRL) method to obtain the user and item aspect factor matrices. In MFRL, we unify matrix factorization, multi-task learning and task relationship learning into one optimization framework. And then, we model the rating patterns by exploiting group-wise overall rating prediction via mixture regression, whose inputs are the factor vectors of users and items learned from MFRL method. [Results] We apply the proposed framework on a real-world dataset (i.e., the crawled hotel rating dataset from TripAdvisor.com) to evaluate the performance of our proposed method. Extensive experimental results demonstrate the effectiveness of the proposed framework. [Conclusions] Individual and Group Heterogeneity could affect the behaviors behind the rating acts, which should be taken into account in modeling the rating patterns.
[Objective] With the advent of the "Big Data" era, big data technology has become one of the hottest technologies attracting material science researchers because it can significantly accelerate the development of materials. The material big data technology based on the material database platform is one of the three core technologies of "Materials Genome Engineering". Therefore, the construction of a material database is very important to acceleration of the development of new materials. [Methods] This article summarizes the constructions and applications of material science databases at home and abroad, and puts forward future research directions based on the development trend of material science databases. [Results] The advancement of the material genome (engineering) concept and the rapid development of big data technologies have promoted the establishment of a large number of material science databases at home and abroad. Compared with developed countries, the material science database construction in China is relatively late. However, with the support of the ‘Thirteenth Five-Year Plan’ national key research and development program of China, the construction of China's material science database platform is expected to achieve initial results in the next few years. [Conclusions] The construction of material science database has become an indispensable element in the development process of material genome engineering technology. But there are still many difficulties to be resolved in the process of database construction and application. The development of a material science database remains a challenging task.
[Objective] As one of the special data science methods, big data brings great opportunities for geological research. Meanwhile, the characteristics of geological data such as multi-source heterogeneity, spatial-temporal correlation, multi-scale and uncertainty bring great challenges for the data processing. [Methods] On the basis of a detailed analysis of aforementioned characteristics, this study proposes a geological data processing framework to solve problems of multi-source data integration and heterogeneous data synthesis in geoscience field combined with a variety of big data technologies like data association, middleware systems, micro services and container technique. Besides, geological models are embedded in this framework in order to improve the expertness of data process. [Results] The framework and its key technologies have been applied in the construction of the National Glacier and Frozen Soil Scientific Data Center, the disaster datasets for the China-Pakistan Corridor as well as the High and Cold Environment United Observation Cloud. [Conclusions] This study is expected to broaden the data processing dimension and support multi-theme, multi-scale research and knowledge discovery in geoscience. In future, it will be adapted to the processing of geological data from a wider range of sources such as the internet, social networks, and printed media. The integration of artificial intelligence technologies will enable the framework to provide smarter and faster geological data processing results.
[Objective] Agriculture is an important area for big data technology applications. By reviewing and analyzing the focuses and directions of big data applications in agriculture, we want to find an effective way to promote the development of agricultural big data technologies. [Methods] This paper introduces the big data technology from a holistic perspective, and puts forward the development demands and characteristics of agricultural big data from the key fields including precision agriculture, agricultural Internet of things, agricultural remote sensing and so on. [Results] In this paper, the definition and characteristics of agriculture big data technologies are analyzed from the perspective of government and academia. The key technologies related to the acquisition, management and processing of agricultural big data are introduced. The big data based intelligent control of precision agriculture, agricultural production environment monitoring, agricultural remote sensing and early warning technologies are analyzed. [Limitations] As one of the important fields of big data applications, agricultural big data needs to be further strengthened in aspects of data platform, management mechanism, technology support and many others. [Conclusions] Agricultural big data research will be an important direction of agricultural and rural innovation and development in the next step. It will play an important role in reshaping agricultural production relations, building agricultural information ecosystems, improving the rural governance system, and helping the development of green agriculture.
[Objective] This paper focuses on the issue of image recognition of agricultural diseases and explores the integration of different machine learning methods under different data scales to improve the accuracy of agricultural disease image recognition. [Methods] Focusing on the problem of machine learning modeling under the condition of small scale of agricultural disease image data, the deep transfer learning method is introduced and some specific experiments are conducted to explore how to improve the modeling effect under the condition of small samples. [Results] On the high-quality agricultural disease image data set, the introduced deep transfer learning method can effectively improve the accuracy of agricultural disease image recognition. [Limitations] In the machine learning method based on deep neural networks, both the quality and the scale of agricultural disease images have certain influence on the modeling effect. In the future, we will further explore the modeling method with better universality in data quality and scale. [Conclusions] In the research of agricultural disease image recognition technology, the adaptation of deep transfer learning method can effectively improve the machine learning modeling effect and the final disease image recognition accuracy under the condition of small samples, which can provide good technical support for the subsequent construction of various agricultural disease image recognition systems.
[Objective] In this paper, we aim to improve the detection performance for small objects by considering the characteristics of small objects under deep learning-based detection frameworks. [Methods] This paper improves small object detection and recognition performance from different aspects, including feature fusion, context learning and attention mechanism. Since the features of the small object are not evident, a bidirectional feature fusion method is proposed to improve the feature expression capability for small objects. In addition, a novel method is proposed to improve the detection performance by using the context information of small objects. Furthermore, to better identify the categories of small objects, an attention transfer method is proposed to improve the recognition rate. [Results] Experimental results show that the three proposed methods can significantly improve the detection and recognition performance for small objects on public datasets. [Conclusions] The research on feature fusion, context utilization and attention mechanism is very valuable for improving small object detection in complex scenes.
[Objective] The distributed testing framework is a method for large-scale testing by clusters. It controls a large number of cheap hosts through a central control system and makes them work in a standardized mode. It has important practical significance for testing large-scale systems. [Methods] The article first introduces the distributed task cluster deployment scheme, the master central control architecture, and the design of a testing framework consisting of three implementation modules. Then introduces the software and hardware environment of the distributed test framework and the object cloud storage system architecture. In the end, the storage system performing single bucket tests and throughput tests is presented. [Conclusions] The test results show that the large-scale distributed testing framework has characteristics such as fast, multi-mode, and high efficiency, which satisfy the performance requirements of large-scale system testing.
[Objective] The Sustainable Development Goals (SDGs) have become the most important sustainable development issue in the world. However, the high rate of missing data related to SDGs indicators has affected the UN’s effective monitoring of implementation of sustainable development goals in various countries. Completion of the missing data in SDGs is technically challenging, and is of great significance in urging countries to achieve sustainable development goals. [Methods] This paper proposes a transfer learning method named TLM, which incorporates with MIC (maximal information coefficient) for feature selection. It can construct features for the target data from other public data and build a prediction model with related regression technology to predict the missing values of the target data. [Results] This article takes the data set of SDGs indicator 3.2.1 in a specific country as an example and uses TLM to predict the missing values of target data. The effectiveness of TLM is verified. [Limitations] Due to the many factors that can affect SDGs indicators, exploring more correlation analysis methods which can be combined with TLM to make more accurate predictions of missing values is the focus of our future research. [Conclusions] The TLM method which combines with MIC and transfer learning can improve the accuracy of data prediction. Besides, it can provide effective reference value predictions for researchers in the related fields of SDGs when dealing with data missing problems.
[Objective] In this paper, the kernel functions of PhoToNs, which is an astronomical N-body simulation software based on the fast multipole method (FMM) and particle grid method (PM), are accelerated and optimized for CUDA on a multi-GPU platform. [Methods] The main optimization methods adopted in CUDA kernels include: algorithm parameter optimization, use of page-locked memory and CUDA streams, and use of mixed precision and fast math library. [Results] The kernel function of short range force interaction is deeply optimized, which achieves a speedup of about 410 times faster on four Titan V GPUs than the pure MPI code running on four Intel Xeon CPU cores. [Conclusions] Optimization methods in this paper can support further algorithm research and hyper-scale N-body simulation on other high performance GPU-based heterogeneous platforms.