数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (2): 49-59.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.02.006

doi: 10.11871/jfdc.issn.2096-742X.2025.02.006

• 专刊:中国科技云10周年 • 上一篇    下一篇

SKA-MWA天文数据存储优化与高效预处理方法研究

周晗1(),唐家宁1,薛梦瑶2,吴开超1,*(),张波1   

  1. 1.中国科学院计算机网络信息中心,北京 100083
    2.中国科学院国家天文台,北京 100101
  • 收稿日期:2025-02-19 出版日期:2025-04-20 发布日期:2025-04-23
  • 通讯作者: 吴开超
  • 作者简介:周晗,中国科学院大学,中国计算机网络信息中心,科技云技术与应用发展部,硕士研究生,主要研究方向为容器化与并行计算。
    本文主要承担工作为:并行优化方案的研究、论文撰写。
    ZHOU Han is a master’s student at the Computer Science of Computer Network Information Center, Chinese Adacamy of Sciences. Her research interests include containerization and parallel computing.
    In this paper, she is responsible for the research of parallel optimization schemes and article writing.
    E-mail: zhouhan221@mails.ucas.ac.cn|吴开超,中国科学院计算机网络信息中心,科技云技术与应用发展部,正高级工程师,主要研究方向为数据密集型计算、天文数据处理。
    本文中主要承担工作:指导实验方案整体设计及优化。
    WU Kaichao is currently a professor at the Computer Science of Computer Network Information Center, Chinese Academy of Sciences. His research interests include data-intensive computing and astronomical data processing.
    In this paper, he is responsible for the guidance of experimental plan design and optimization.
    E-mail: kaichao@cnic.cn
  • 基金资助:
    国家自然科学基金(6217023073)

Research on Storage Optimization and Efficient Pre-Processing Methods for SKA-MWA Astronomical Data

ZHOU Han1(),TANG Jianing1,XUE Mengyao2,WU Kaichao1,*(),ZHANG Bo1   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China
    2. National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100101, China
  • Received:2025-02-19 Online:2025-04-20 Published:2025-04-23
  • Contact: WU Kaichao

摘要:

【背景】默奇森宽场阵列(Murchison Widefield Array,MWA)是平方公里阵列(Square Kilometre Array,SKA)低频先导望远镜,广泛用于脉冲星等天文现象的研究,由于其数据传输读写规模大,数据处理存在耦合,导致其读写性能低,影响数据处理效率。【目的】为提高MWA数据处理效率,通过预处理优化存储布局,缓解数据处理的读写瓶颈。【方法】通过分析MWA数据特性及计算流程,提出纵向的数据布局策略。本地计算模式与流水线架构的结合,实现高效的数据预处理,完成数据布局调整。【结果】该方案优化了数据存取策略,引入打包、压缩使得文件数减少到1/40,数据量减少到70%,结合本地计算模式,降低共享存储I/O负载,可大大提升天文数据分析的效率。采用本地计算模式的数据预处理方案,数据预处理计算效率提升了3倍以上。【结论】本文提出的数据布局策略与预处理的优化方法,提升了SKA-MWA天文数据的存储性能和后续波束合成的计算效率,为天文计算提供了高质量数据支撑,该方法具有普适性,有广泛的应用前景。

关键词: 数据预处理, 分布式并行, 存储优化, 本地存储

Abstract:

[Context] The Murchison Widefield Array (MWA) is a low-frequency precursor telescope for the Square Kilometre Array (SKA), which is widely used in the study of astronomical phenomena such as pulsars. Due to the large scale of data transmission and storage, coupled with challenges in data processing, the read-write performance is low, thereby affecting the efficiency of data processing. [Object] To enhance the data processing efficiency of the MWA, a pre-processing optimization of storage layout is proposed to alleviate the read-write bottlenecks. [Methods] By analyzing the data characteristics and computational workflows of the MWA, a vertical data layout strategy is introduced. This approach, combining local computation modes with a pipeline architecture, achieves efficient data pre-processing and layout adjustment. [Results] The proposed solution optimizes the data access strategy by incorporating packing and compression techniques that reduce the number of files by a factor of 40 and the data volume by 70%. With the local computation mode, the shared storage I/O load is reduced, significantly enhancing the efficiency of astronomical data analysis. The data pre-processing solution using local computation mode achieves threefold improvement in computational efficiency. [Conclusions] The data layout strategy and pre-processing optimization methods proposed in this study can significantly improve the storage performance of SKA-MWA astronomical data and the computational efficiency of subsequent beamforming. This approach provides high-quality data support for astronomical computations and is universally applicable with broad prospects for future application.

Key words: data preprocessing, distributed parallelism, storage optimization, local storage