数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (4): 182-195.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.04.015

doi: 10.11871/jfdc.issn.2096-742X.2025.04.015

• 技术与应用 • 上一篇    下一篇

气象大模型计算资源评估技术方法

石宜衡(),王琦祎,孙婧,赵春燕*(),邓帅,吴鹏,姚旺   

  1. 国家气象信息中心北京 100081
  • 收稿日期:2025-03-20 出版日期:2025-08-20 发布日期:2025-08-21
  • 通讯作者: 赵春燕
  • 作者简介:石宜衡,国家气象信息中心先进计算室,博士,工程师,主要研究领域为人工智能气象应用与支撑。
    本文承担工作为:论文方法论构建以及论文撰写。
    SHI Yiheng, PH.D., is a full-time enginee at the Advanced Computing Division, National Meteorological Information Centre. Her main research interests include AI-driven meteorological applications and intelligent computing infrastructure support.
    In this paper, she is responsible for methodology design and manuscript preparation for this study.
    E-mail:shiyiheng@cma.gov.cn|赵春燕,国家气象信息中心先进计算室,正高级工程师,主要研究领域为气象智能应用支撑技术。
    本文中承担工作为论文整理和分析以及论文修改。
    ZHAO Chunyan, is a professorate senior engineer at the Advanced Computing Division, National Meteorological Information Center. Her research interests include AI-driven meteorological application and supporting technologies.
    In this paper, she is responsible for analytical framework design, and manuscript revision.
    E-mail:zhaocy@cma.gov.cn
  • 基金资助:
    国家自然科学基金资助项目-面向我国碳中和最优路径实现的自然-社会系统多尺度相互作用模式耦合;数据监测支持和决策支撑研究的顶层设计(423412021008705);光合基金项目-基于TrajGRU雷达回波外推算法的短临降水预报模型在国产加速器上的实践(202302034726);国家气象信息中心青年科技基金课题-应用于气象领域的加速器评测技术研究(NMICQJ11-202407)

Computation Resource Assessment Methodology for Large Meteorological AI Models

SHI Yiheng(),WANG Qiyi,SUN Jing,ZHAO Chunyan*(),DENG Shuai,WU Peng,YAO Wang   

  1. National Meteorological Information Center, Beijing 100081, China
  • Received:2025-03-20 Online:2025-08-20 Published:2025-08-21
  • Contact: ZHAO Chunyan

摘要:

【目的】近年来,气象大模型在天气预报领域内展现出超越传统数值方法的潜力。然而,其规模化训练和部署面临严峻的计算资源挑战。现有资源评估方法主要针对自然语言处理(NLP)领域的大模型,难以适应气象任务的动态计算需求(如时空多维特性)以及气象模型架构的独特性,导致资源利用率低下和算力成本高昂。为此,本研究旨在构建一个的气象大模型计算资源评估框架,通过量化模型的参数量、计算量、显存占用与通信开销,为硬件配置与资源分配提供理论依据,以降低计算成本,并确保气象大模型的高效稳定研发运行。【方法】提出多粒度计算资源联合评估框架(Multi-Granularity Computing Resource Joint Evaluation Framework,MGCRJEF),这个框架通过分模块建立参数量计算模型、时空感知FLOPs评估模型、显存占用模型及分布式通信分析模型,并结合气象数据的时空异质性特征,全面评估气象大模型对硬件资源的核心需求。【结果】以基于Swin-Transformer架构的盘古气象大模型(Pangu-Weather)为案例进行分析。该框架揭示了该模型的资源需求特点,例如在高分辨率输入时显存占用增加,在多节点训练中通信开销成为性能瓶颈,这些发现为资源优化提供了实践指导。此外,该框架计算的资源需求与实际的资源消耗基本一致,验证了该框架的合理性和有效性。【结论】MGCRJEF框架为气象大模型的资源需求评估提供了标准化方法,支持智算硬件环境下的资源规划,为气象领域的模型部署和硬件优化提供了理论依据和实践参考。

关键词: 气象大模型, 多粒度计算资源评估, 资源优化, 智能计算

Abstract:

[Objective] In recent years, large meteorological AI models have demonstrated the potential to surpass traditional numerical methods in weather forecasting. However, the model training and deployment require significant computational resources. The existing resource assessment methods, primarily designed for large-scale models in natural language processing (NLP), are struggling to accommodate the dynamic computational demands of meteorological tasks, such as spatiotemporal multidimensionality, and the unique architectures of meteorological models. This results in inefficient resource utilization and high computational costs. To address these challenges, this study proposes a computational resource assessment framework for large meteorological models. By quantifying parameters, computational load, memory usage, and communication overhead, the framework provides a theoretical foundation for hardware configuration and resource allocation, aiming to reduce computational costs and ensure efficient and stable development and operation of large meteorological models. [Methods] We introduce the Multi-Granularity Computing Resource Joint Evaluation Framework (MGCRJEF), which establishes modular models for parameter calculation, spatiotemporal-aware FLOPs assessment, memory usage prediction, and distributed communication analysis. By incorporating the spatiotemporal heterogeneity of meteorological data, it comprehensively evaluates the core hardware resource requirements of large meteorological models. [Results] Using the Pangu-Weather model, which is based on the Swin-Transformer architecture, as a case study, the framework uncovers the model’s resource demand characteristics. For instance, memory usage increases significantly with higher input resolutions, while communication overhead becomes a major performance bottleneck during multi-node training. These insights provide practical guidance for optimizing resource allocation. Furthermore, the framework’s estimated resource demands closely align with actual consumption, demonstrating its accuracy and effectiveness. [Conclusions] The MGCRJEF framework provides a standardized approach to assessing the resource demands of large meteorological models, facilitating resource planning in intelligent computing hardware environments. It offers both theoretical and practical references for model deployment and hardware optimization in the field of meteorology.

Key words: large meteorological AI models, multi-granularity computing resource joint evaluation framework, resource optimization, intelligent computing