数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (5): 65-87.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.05.006

doi: 10.11871/jfdc.issn.2096-742X.2025.05.006

• 专刊:国产算力新力量,助力科学计算应用新发展 • 上一篇    下一篇

FlowAware:一种支持AI for Science任务的模型分布式自动并行方法

曾艳1(),吴宝福1,易广政1,黄成创1,邱扬1,陈越1,万健1,2,*(),胡帆3,金思聪1,梁迦隽1,李欣1   

  1. 1.杭州电子科技大学,浙江 杭州 310018
    2.浙江科技大学,浙江 杭州 310023
    3.浙江曙光信息技术有限公司,浙江 杭州 310013
  • 收稿日期:2025-02-28 出版日期:2025-10-20 发布日期:2025-10-23
  • 通讯作者: 万健
  • 作者简介:曾艳,博士,杭州电子科技大学副教授,研究方向包括分布式与并行计算、分布式机器学习及大数据处理。
    本文中负责概念构思、指导监督及方法论制定。
    ZENG Yan, Ph.D., is an Associate Professor at Hangzhou Dianzi University. Her research interests include distributed and parallel computing, distributed ma-chine learning, and big data.
    In this paper, she is responsible for conceptualization, supervision, and methodology design. E-mail: yz@hdu.edu.cn|万健,博士,杭州电子科技大学教授,主要研究领域为网格计算、服务计算和云计算。
    本文中负责项目管理及论文撰写、审阅与编辑。
    WAN Jian, Ph.D., is a Professor at Hangzhou Dianzi University. His research interests include grid computing, service computing, and cloud computing.
    In this paper, he is responsible for project administration, original draft writing, and review and editing of the manuscript.E-mail: wanjian@hdu.edu.cn

FlowAware: A Feature-Aware Automated Model Parallelization Method for AI-for-Science Tasks

ZENG Yan1(),WU Baofu1,YI Guangzheng1,HUANG Chengchuang1,QIU Yang1,CHEN Yue1,WAN Jian1,2,*(),HU Fan3,JIN Sicong1,LIANG Jiajun1,LI Xin1   

  1. 1. Hangzhou Dianzi University, Hangzhou, Zhejiang 310018, China
    2. Zhejiang University of Science and Technology, Hangzhou, Zhejiang 310023, China
    3. Zhejiang Sugon Information Technology Co., Ltd, Hangzhou, Zhejiang 310013, China
  • Received:2025-02-28 Online:2025-10-20 Published:2025-10-23
  • Contact: WAN Jian
  • Supported by:
    National Key Research and Development Program of China(2023YF-B3001501);National Natural Science Foundation of China (NSFC)(62302133);Key Research and Development Program of Zhejiang Province(2024C01026);Yangtze River Delta Project(2023Z-Y1068);Hangzhou Key Research Plan Project(2024SZD1A02);GHfund A(202302019816)

摘要:

【目的】 本研究旨在解决AI for Science领域中深度学习模型分布式并行计算策略设计与实现困难等导致AI for Science任务计算低效的问题。【方法】 本文提出了一种支持 AI for Science任务的模型分布式自动并行方法FlowAware。该方法基于AI for Science框架 JAX,深入分析AI for Science任务特征以及深度学习模型的算子结构和数据流特性,结合集群拓扑结构,构建模型分布式并行计算策略搜索空间;在此基础上以负载均衡和通信最优化为目标,为AI模型搜索最优分布式并行计算策略。【结果】 在类似GPU的加速器集群和GPU集群上进行了对比实验,实验结果表明,相比Alpa,FlowAware的吞吐量最高可提升7.8倍。【结论】 FlowAware为AI for Science任务中AI模型提供了高效的分布式并行策略搜索方法,并显著加速了AI模型的计算性能。

关键词: AI for Science, 深度学习, 分布式并行计算

Abstract:

[Objective] This study aims to address the inefficiency of AI-for-Science tasks caused by the design and implementation challenges of applying the distributed parallel computing strategies to deep learning models, as well as their inefficient execution. [Methods] We propose an automatic distributed parallelization method for AI-for-Science tasks, called FlowAware. Based on the AI-for-Science framework JAX, this approach thoroughly analyzes task characteristics, operator structures, and data flow properties of deep learning models. By incorporating cluster topology information, it constructs a search space for distributed parallel computing strategies. Guided by load balancing and communication optimization objectives, FlowAware automatically identifies optimal distributed parallel computing strategies for AI models. [Results] Comparative experiments conducted on both GPU-like accelerator clusters and GPU clusters demonstrated that FlowAware achieves a throughput improvement of up to 7.8×compared to Alpa. [Conclusions] FlowAware effectively enhances the search efficiency of distributed parallel computing strategies for AI models in scientific computing tasks and significantly improves their computational performance.

Key words: AI for Science, deep learning, distributed parallel computing