气象大模型计算资源评估技术方法

doi:10.11871/jfdc.issn.2096-742X.2025.04.015

数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (4): 182-195.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.04.015

doi: 10.11871/jfdc.issn.2096-742X.2025.04.015

气象大模型计算资源评估技术方法

石宜衡(),王琦祎,孙婧,赵春燕^*(),邓帅,吴鹏,姚旺

国家气象信息中心，北京 100081

收稿日期:2025-03-20 出版日期:2025-08-20 发布日期:2025-08-21
通讯作者: 赵春燕
作者简介:石宜衡，国家气象信息中心先进计算室，博士，工程师，主要研究领域为人工智能气象应用与支撑。
本文承担工作为：论文方法论构建以及论文撰写。
SHI Yiheng, PH.D., is a full-time enginee at the Advanced Computing Division, National Meteorological Information Centre. Her main research interests include AI-driven meteorological applications and intelligent computing infrastructure support.
In this paper, she is responsible for methodology design and manuscript preparation for this study.
E-mail：shiyiheng@cma.gov.cn|赵春燕，国家气象信息中心先进计算室，正高级工程师，主要研究领域为气象智能应用支撑技术。
本文中承担工作为论文整理和分析以及论文修改。
ZHAO Chunyan, is a professorate senior engineer at the Advanced Computing Division, National Meteorological Information Center. Her research interests include AI-driven meteorological application and supporting technologies.
In this paper, she is responsible for analytical framework design, and manuscript revision.
E-mail：zhaocy@cma.gov.cn
基金资助:
国家自然科学基金资助项目-面向我国碳中和最优路径实现的自然-社会系统多尺度相互作用模式耦合;数据监测支持和决策支撑研究的顶层设计(423412021008705);光合基金项目-基于TrajGRU雷达回波外推算法的短临降水预报模型在国产加速器上的实践(202302034726);国家气象信息中心青年科技基金课题-应用于气象领域的加速器评测技术研究(NMICQJ11-202407)

Computation Resource Assessment Methodology for Large Meteorological AI Models

SHI Yiheng(),WANG Qiyi,SUN Jing,ZHAO Chunyan^*(),DENG Shuai,WU Peng,YAO Wang

National Meteorological Information Center, Beijing 100081, China

Received:2025-03-20 Online:2025-08-20 Published:2025-08-21
Contact: ZHAO Chunyan

摘要/Abstract

摘要：

【目的】近年来，气象大模型在天气预报领域内展现出超越传统数值方法的潜力。然而，其规模化训练和部署面临严峻的计算资源挑战。现有资源评估方法主要针对自然语言处理（NLP）领域的大模型，难以适应气象任务的动态计算需求（如时空多维特性）以及气象模型架构的独特性，导致资源利用率低下和算力成本高昂。为此，本研究旨在构建一个的气象大模型计算资源评估框架，通过量化模型的参数量、计算量、显存占用与通信开销，为硬件配置与资源分配提供理论依据，以降低计算成本，并确保气象大模型的高效稳定研发运行。【方法】提出多粒度计算资源联合评估框架（Multi-Granularity Computing Resource Joint Evaluation Framework，MGCRJEF），这个框架通过分模块建立参数量计算模型、时空感知FLOPs评估模型、显存占用模型及分布式通信分析模型，并结合气象数据的时空异质性特征，全面评估气象大模型对硬件资源的核心需求。【结果】以基于Swin-Transformer架构的盘古气象大模型（Pangu-Weather）为案例进行分析。该框架揭示了该模型的资源需求特点，例如在高分辨率输入时显存占用增加，在多节点训练中通信开销成为性能瓶颈，这些发现为资源优化提供了实践指导。此外，该框架计算的资源需求与实际的资源消耗基本一致，验证了该框架的合理性和有效性。【结论】MGCRJEF框架为气象大模型的资源需求评估提供了标准化方法，支持智算硬件环境下的资源规划，为气象领域的模型部署和硬件优化提供了理论依据和实践参考。

关键词: 气象大模型, 多粒度计算资源评估, 资源优化, 智能计算

Abstract:

[Objective] In recent years, large meteorological AI models have demonstrated the potential to surpass traditional numerical methods in weather forecasting. However, the model training and deployment require significant computational resources. The existing resource assessment methods, primarily designed for large-scale models in natural language processing (NLP), are struggling to accommodate the dynamic computational demands of meteorological tasks, such as spatiotemporal multidimensionality, and the unique architectures of meteorological models. This results in inefficient resource utilization and high computational costs. To address these challenges, this study proposes a computational resource assessment framework for large meteorological models. By quantifying parameters, computational load, memory usage, and communication overhead, the framework provides a theoretical foundation for hardware configuration and resource allocation, aiming to reduce computational costs and ensure efficient and stable development and operation of large meteorological models. [Methods] We introduce the Multi-Granularity Computing Resource Joint Evaluation Framework (MGCRJEF), which establishes modular models for parameter calculation, spatiotemporal-aware FLOPs assessment, memory usage prediction, and distributed communication analysis. By incorporating the spatiotemporal heterogeneity of meteorological data, it comprehensively evaluates the core hardware resource requirements of large meteorological models. [Results] Using the Pangu-Weather model, which is based on the Swin-Transformer architecture, as a case study, the framework uncovers the model’s resource demand characteristics. For instance, memory usage increases significantly with higher input resolutions, while communication overhead becomes a major performance bottleneck during multi-node training. These insights provide practical guidance for optimizing resource allocation. Furthermore, the framework’s estimated resource demands closely align with actual consumption, demonstrating its accuracy and effectiveness. [Conclusions] The MGCRJEF framework provides a standardized approach to assessing the resource demands of large meteorological models, facilitating resource planning in intelligent computing hardware environments. It offers both theoretical and practical references for model deployment and hardware optimization in the field of meteorology.

Key words: large meteorological AI models, multi-granularity computing resource joint evaluation framework, resource optimization, intelligent computing

石宜衡,王琦祎,孙婧,赵春燕,邓帅,吴鹏,姚旺. 气象大模型计算资源评估技术方法[J]. 数据与计算发展前沿, 2025, 7(4): 182-195.

SHI Yiheng,WANG Qiyi,SUN Jing,ZHAO Chunyan,DENG Shuai,WU Peng,YAO Wang. Computation Resource Assessment Methodology for Large Meteorological AI Models[J]. Frontiers of Data and Computing, 2025, 7(4): 182-195, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2025.04.015.

图/表 4

图1

图2

表1

表2

参考文献 29

[1]	PATHAK J, SUBRAMANIAN S, HARRINGTON P, et al. FourCastNet: A global data-driven high-resolution weather model using adaptive Fourier neural operators[J/OL]. arXiv, 2022. arXiv:2202.11214. https://arxiv.org/abs/2202.11214.
[2]	LAM R, SANCHEZ-GONZALEZ A, WILLSON M, et al. Learning skillful medium-range global weather forecasting[J]. Science, 2023, 382(6677): 1416-1421.
[3]	BI K, XIE L, ZHANG H, et al. Accurate medium-range global weather forecasting with 3D neural networks[J]. Nature, 2023, 619(7970): 533-538.
[4]	CHEN K, HAN T, GONG J, et al.FengWu:Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead[J/OL]. arXiv, 2023. arXiv:2304.02948. https://arxiv.org/abs/2304.02948.
[5]	CHEN L, ZHONG X, ZHANG F, et al. FuXi: a cascade machine learning forecasting system for 15-day global weather forecast[J]. npj Climate and Atmospheric Science, 2023, 6(1): 190.
[6]	ECMWF. ECMWF Annual Report 2022[M]. Reading: ECMWF Publications, 2022: 43.
[7]	RONNEBERGER O, FISCHER P, BROX T. U-Net: Convolutional Networks for Biomedical Image Segmentation[C]. Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015. Cham: Sp- ringer, 2015: 234-241.
[8]	LIU Z, LIN Y, CaoY, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[C]. IEEE International Conference on Computer Vision (ICCV). Montreal: IEEE, 2021: 9992-10002.
[9]	SCARSELLI F, GORI M, TSOI AC, Hagenbuchner M, Monfardini G. The Graph Neural Network Model[J]. IEEE Transactions on Neural Networks, 2009, 20(1): 61-80. doi: 10.1109/TNN.2008.2005605 pmid: 19068426
[10]	BROWN TB, MANN B, RYDER N, et al. Language Models are Few-Shot Learners[J/OL]. arXiv, 2020.arXiv:2005.14165. https://arxiv.org/abs/2005.14165
[11]	KAPLAN J., MCCANDLISH S., HENIGHAN T., et al. Scaling Laws for Neural Language Models[J/OL]. arXiv, 2020. arXiv:2001.08361. https://arxiv.org/abs/22001.08361
[12]	ZAHEER M, GURUGANESH G, DUBEY A, et al. Big Bird: Transformers for Longer Sequences[C]. Advances in Neural Information Processing Systems(NeurIPS). Virtual: NeurIPS Foundation, 2020: 12.
[13]	CHILD R, GRAY S, RADFORD A, SUTSKEVER I. Generating Long Sequences with Sparse Transformers[J/OL]. arXiv, 2019. arXiv:1904.10509. https://arxiv.org/abs/1904.10509.
[14]	RAJBHANDARI S, RASLEY J, RUWASE O, et al. Zero: Memory Optimizations Toward Training Trillion Parameter Models[C]. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2020). Atlanta: IEEE, 2020: 1-24.
[15]	BAUER P, THORPE A, BRUNET G. The Quiet Revolution of Numerical Weather Prediction[J]. Nature, 2015, 525(7567): 47-55.
[16]	SHOCEYBI M, PATWARY M, PURI R, et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism[J/OL]. arXiv, 2019.arXiv:1909.08053. https://arxiv.org/abs/1909.08053
[17]	NARAYANAN D, SHOCEYBI M, CASPER J, et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM[C]. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2021). St. Louis: IEEE, 2021: 1-15.
[18]	POPE R, DOUGLAS S, CHOWDHERY A, et al. Efficiently Scaling Transformer Inference[J/OL]. arXiv, 2022. arXiv:2211.05102. https://arxiv.org/abs/2211.05102
[19]	QI P, WAN X, HUANG G, LIN M. Zero Bubble (Almost) Pipeline Parallelism[C]. 12th International Conference on Learning Representations (ICLR 2024). Vienna: ICLR, 2024: 1-19.
[20]	AMINABADI RY, RAJBHANDARI S, AWAN AA, et al. DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale[C] Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2022). Dallas: IEEE, 2022: 1-15.
[21]	DENG Q, LU P, ZHAO S, YUAN N. U-Net: A Deep-Learning Method for Improving Summer Precipitation Forecasts in China[J]. Atmospheric and Oceanic Science Letters, 2023, 16(4): 100322.
[22]	TREBING K, STAŃCZYK T, MEHRKANOON S. SmaAt-UNet: Precipitation Nowcasting Using a Small Attention-UNet Architecture[J]. Pattern Recognition Letters, 2021, 145: 178-186.
[23]	TISHBY N, PEREIRA FC, BIALEK W. The Information Bottleneck Method[J/OL]. arXiv, 2000. arXiv:physics/0004057. https://arxiv.org/abs/physics/0004057.
[24]	SOHONI N S, ABERGER CR, LESZCZYNSKI M, et al. Low-Memory Neural Network Training: A Technical Report[J/OL]. arXiv, 2019. arXiv:1904.10631. https://arxiv.org/abs/1904.10631.
[25]	HUANG Y, CHENG Y, BAPNA A, et al. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism[C]. Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Vancouver: NeurIPS Foundation, 2019: 103-113.
[26]	FOLEY D, DANSKIN J. Ultra-Performance Pascal GPU and NVLink Interconnect[J]. IEEE Micro, 2017, 37(2): 7-17.
[27]	HAN Y, ZHANG Q, LI S, et al. Latency-Aware Unified Dynamic Networks for Efficient Image Recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 7760-7774.
[28]	Pangu-Weather[EB/OL]. GitHub. https://github.com/198808xc/Pangu-Weather.
[29]	CHOWDHERY A, NARANG S, DEVLIN J, et al. Pa- LM: Scaling Language Modeling with Pathways[J/OL]. arXiv, 2022. arXiv:2204.02311, https://arxiv.org/abs/2204.02311.

数	值	说明
隐藏层维度D	1,152	通过Github开源onnx推理模型的结构查看
前馈扩展系数r	4	控制MLP层参数量，文章未直接给出，一般为4
编码器-解码器层数L	8+8	前2层保持分辨率(8×360×181×D)，后6层下采样至(8×180×91×2D)，解码器对称设计

算维度	计算结果	说明
参数量	266 million	256 million（官方）
单次单样本训练 FLOPs	1254 TFLOPs
训练显存占用	59.19 GB	包括模型参数、梯度、优化器状态和激活值
推理显存占用	28.53 GB	包括模型参数、激活值
通信量	24.6GB	192块V100，节点间InfiniBand传输
训练时间	17天	192块V100训练100 epoch，训练16天

气象大模型计算资源评估技术方法

Computation Resource Assessment Methodology for Large Meteorological AI Models

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 4

参考文献 29

相关文章 0

编辑推荐

Metrics

本文评价