FlowAware: A Feature-Aware Automated Model Parallelization Method for AI-for-Science Tasks

doi:10.11871/jfdc.issn.2096-742X.2025.05.006

Abstract

Abstract:

[Objective] This study aims to address the inefficiency of AI-for-Science tasks caused by the design and implementation challenges of applying the distributed parallel computing strategies to deep learning models, as well as their inefficient execution. [Methods] We propose an automatic distributed parallelization method for AI-for-Science tasks, called FlowAware. Based on the AI-for-Science framework JAX, this approach thoroughly analyzes task characteristics, operator structures, and data flow properties of deep learning models. By incorporating cluster topology information, it constructs a search space for distributed parallel computing strategies. Guided by load balancing and communication optimization objectives, FlowAware automatically identifies optimal distributed parallel computing strategies for AI models. [Results] Comparative experiments conducted on both GPU-like accelerator clusters and GPU clusters demonstrated that FlowAware achieves a throughput improvement of up to 7.8×compared to Alpa. [Conclusions] FlowAware effectively enhances the search efficiency of distributed parallel computing strategies for AI models in scientific computing tasks and significantly improves their computational performance.

Key words: AI for Science, deep learning, distributed parallel computing

ZENG Yan,WU Baofu,YI Guangzheng,HUANG Chengchuang,QIU Yang,CHEN Yue,WAN Jian,HU Fan,JIN Sicong,LIANG Jiajun,LI Xin. FlowAware: A Feature-Aware Automated Model Parallelization Method for AI-for-Science Tasks[J]. Frontiers of Data and Computing, 2025, 7(5): 65-87, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2025.05.006.

Figures/Tables 18

Fig.1

Fig.2

Fig.3

Table 1

Table 2

Table 3

Fig.4

Fig.5

Fig.6

Fig.7

Table 4

Fig.8

Fig.9

Table 5

Fig.10

Fig.11

Fig.12

Table 6

References 20

[1]	ABRAMSON J, ADLER J, DUNGER J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3[J]. Nature, 2024, 630(8016): 493-500.
[2]	DU Y, WANG Y, HUANG Y, et al. M2Hub: Unlocking the Potential of Machine Learning for Materials Discovery[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems, 2024: 77359-77378.
[3]	LIAO Y L, SMIDT T. Equiformer: Equivariant graph attention transformer for 3d atomistic graphs[J]. arXiv preprint arXiv:2206.11990, 2022.
[4]	JUMPER J, EVANS R, PRITZEL A, et al. Highly accurate protein structure prediction with AlphaFold[J]. Nature, 2021, 596(7873): 583-589.
[5]	SZYMANSKI N J, RENDY B, FEI Y, et al. An autonomous laboratory for the accelerated synthesis of novel materials[J]. Nature, 2023, 624(7990): 86-91.
[6]	WANG H, ZHANG L, HAN J, et al. DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics[J]. Computer Physics Communications, 2018, 228: 178-184.
[7]	WANG H, DING Y, GU J, et al. QuantumNAS: Noiseadaptive search for robust quantum circuits[C]//The 28th IEEE international symposium on high-performance computer architecture (HPCA-28), 2022: 692-708.
[8]	BRADBURY J, FROSTIG R, HAWKINS P, et al. JAX: composable transformations of Python+NumPy programs[EB/OL]. 2018. http://github.com/google/jax.
[9]	ZHENG L, LI Z, ZHANG H, et al. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning[C]//16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022: 559-578.
[10]	HAN M, ZENG Y, SHU H, et al. Device placement using Laplacian PCA and graph attention networks[J]. The Computer Journal, 2025, 68(2): 175-186.
[11]	ZENG Y, HUANG C C, NI Y J, et al. An Auto-Parallel Method for Deep Learning Models Based on Genetic Algorithm[C]//2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, 2023: 230-235.
[12]	ZENG Y, YI G, YIN Y, et al. Aware: Adaptive distributed training with computation, communication and position awareness for deep learning model[C]// 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), IEEE, 2022: 1299-1306.
[13]	MA Y, YU D, WU T, et al. PaddlePaddle: An open-source deep learning platform from industrial practice[J]. Frontiers of Data and Computing, 2019, 1(1): 105-115.
[14]	Huawei MindSpore AI Development Framework[M]//Huawei Technologies Co. Ltd. Artificial Intelligence Technology, Singapore: Springer Nature, 2023: 137-162.
[15]	SHOEYBI M, PATWARY M, PURI R, et al. Megatron-lm: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint arXiv:1909.08053, 2019.
[16]	XU Q, YOU Y. An Efficient 2D Method for Training Super-Large Deep Learning Models[C]//2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023: 222-232.
[17]	WANG B, XU Q, BIAN Z, et al. Tesseract: parallelize the tensor parallelism efficiently[C]// Proceedings of the 51st International Conference on Parallel Processing, 2022: 1-11.
[18]	BIAN Z, XU Q, WANG B, et al. Maximizing Parallelism in Distributed Training for Huge Neural Networks[J]. arXiv preprint arXiv:2105.14450, 2021.
[19]	JIA Z, ZAHARIA M, AIKEN A. Beyond Data and Model Parallelism for Deep Neural Networks[C]// Proceedings of machine learning and systems: Vol. 1, 2018: 1-13.
[20]	JEON B, CAI L, SRIVASTAVA P, et al. Baechi: fast device placement of machine learning graphs[C]// Proceedings of the 11th ACM Symposium on Cloud Computing, Virtual Event USA: ACM, 2020: 416-430.

Device	Specification
GPU Cluster Hardware	CPU	Intel(R) Xeon(R) CPU E5-2650, 2.20GHz, 20 cores, 40 threads
	Memory	32GB
	GPU	16×NVIDIA Tesla P100, 12GB memory, 540GB/s bandwidth
GPU-like Accelerator Cluster Hardware	CPU	32-core processor
	Memory	64GB
	GPU Accelerator Cards	32 cards, 16GB VRAM each

Networks	Application Domain	Datasets	Task Type
ResNet	Machine Vision	cifar10	Training
PNasNet	Machine Vision	MNIST	Training
BERT	Natural Language Processing	WikiText-2	Training
T5	Natural Language Processing	C4	Training
Alphafold2	Protein Structure Prediction	bfd	Inference

Model	Nnmber of Backbone Layers				Operator Number
Model	1	2	3	4	Operator Number
ResNet-50	3	4	6	3	3467
ResNet-152	3	8	36	3	10131
ResNet-230	5	16	50	5	15227
ResNet-302	5	16	72	7	19931

Model	Number of Operators	Framework	Search Time
ResNet-152	10131	Alpa	127.7
ResNet-152	10131	FlowAware	38.5
ResNet-230	15227	Alpa	199.3
ResNet-230	15227	FlowAware	58.2
ResNet-302	19931	Alpa	264.1
ResNet-302	19931	FlowAware	85.5

Accelerator Card	Number of Accelerator Cards	Parameter Scale (Billion)	Scalability Efficiency (%)
GPU like accelerator card	4	7.5	—
	8	15	100.00%
	16	17.5	58.33%
	32	18.75	31.25%
GPU	8	7.5	—
GPU	16	10	66.67%