Implementation of Parallel FMM Based on Charm++

doi:10.11871/jfdc.issn.2096-742X.2020.03.009

Abstract

Abstract:

[Objective] This paper has implemented a parallel FMM based on Charm++ to take advantage of its over-decomposition and migratability. [Methods] It is achieved by analyzing communication, separating parallel tasks, and converting synchronous communication to asynchronous communication. Also, the SDAG was used to implement the basic communication calls and the LPT approximation strategy was adopted for dynamic load balancing. [Results] The results show that the implementation of parallel FMM based on Charm++ has the same accuracy as that of MPI implementation, and its execution speed on the thousand-core scale is better than that of MPI implementation. Over-decomposition and load-balancing strategy contribute to the execution time reduction by 10% in the unbalance particle distribution. [Limitations] The current implementation does not use the shared memory structure of Charm++ and needs further optimizations. Besides, the load balancing strategy is simple. [Conclusions] This paper gives a relatively general method to convert the MPI style programs to Charm++ style ones and proves that over-decomposition and load-balancing strategy can accelerate FMM execution.

Key words: Charm++, FMM, load balancing, over-decomposition

Ding Lei,Wang Wu,Jiang Jinrong,Zhao Lian. Implementation of Parallel FMM Based on Charm++[J]. Frontiers of Data and Computing, 2020, 2(3): 101-112.

Figures/Tables 11

Fig.1

Fig.2

Fig.3

Fig.4

Fig.5

Fig.6

Fig.7

Table 1

References 29

[1]	郑哲. N-Body算法分析及N-Body问题关键算粒在FPGA上的实验验证研究[D]. 上海交通大学, 2011.
[2]	Colella P. Defining software requirements for scientific computing [EB/OL]. [2020-04-06]. http://www.lanl.gov/conferences/salishan/salishan2005/supinski.pdf.
[3]	冯珑珑, 朱维善. 现代宇宙学中的数值模拟技术和应用[J]. 中国科学:物理学、力学、天文学, 2013,(06):687-707.
[4]	Papa M, Giuliani G, Bonasera A. Constrained molecular dynamics II: An N-body approach to nuclear systems[J]. Journal of Computational Physics, 2005,208(2):403-415.
[5]	王岳青. 多体模拟的并行优化及软件架构关键技术研究[D]. 国防科学技术大学, 2012.
[6]	Barnes J, Hut P. A hierarchical O (N log N) force-calculation algorithm[J]. nature, 1986,324(6096):446-449.
[7]	Greengard L, Rokhlin V. A fast algorithm for particle simulations[J]. Journal of Computational Physics, 1997,135(2):280-292.
[8]	王武, 冯仰德, 迟学斌. 树结构在N体问题中的应用[J]. 计算机应用研究, 2008,(01):42-44.
[9]	曹小林. 基于 JASMIN 框架的粒子模拟并行计算[J]. 科研信息化技术与应用, 2010,1(2):28-33.
[10]	Cruz F A, Knepley M G, Barba L A. PetFMM—A dynamically load‐balancing parallel fast multipole library[J]. International journal for numerical methods in engineering, 2011,85(4):403-428. doi: 10.1002/nme.v85.4
[11]	Winkel M, Speck R, Hübner H, et al. A massively parallel, multi-disciplinary Barnes-Hut tree code for extreme-scale N-body simulations[J]. Computer physics communications, 2012,183(4):880-889. doi: 10.1016/j.cpc.2011.12.013
[12]	Lashuk I, Chandramowlishwaran A, Langston H, et al. A massively parallel adaptive fast-multipole method on heterogeneous architectures[C]// Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 2009: 1-12.
[13]	Yokota R, Barba L A, Narumi T, et al. Petascale turbulence simulation using a highly parallel fast multipole method on GPUs[J]. Computer Physics Communications, 2013,184(3):445-455. doi: 10.1016/j.cpc.2012.09.011
[14]	王武, 王舒扬, 姜金荣, 等. 快速多极子方法在申威众核处理器上的实现和优化[J]. 计算机工程与科学, 2019 (7):3.
[15]	扶月月, 王武, 王乔. 基于FMM-PM方法的宇宙N体模拟在GPU上的实现和优化[J]. 数据与计算发展前沿, 2020,2(2):155-164.
[16]	Kale L V, Krishnan S. CHARM++:a portable concurrent object oriented system based on C++[J]. Acm Sigplan Notices, 1995,28(10):91-108. doi: 10.1145/167962
[17]	迟学斌, 赵莲, 王姗姗, 等. 高性能计算框架软件——SC_Tangram[J]. 数据与计算发展前沿, 2019,1(1):11-21.
[18]	Wang T, Chi X, Zhao L, et al. Parallel Unstructured Grid Partition Algorithm Based on Charm++[C]// 2019 IEEE International Conference on Computational Electromagnetics (ICCEM). IEEE, 2019: 1-3.
[19]	徐文皙. 基于谱域球谐展开的多层快速多极子算法研究[D]. 电子科技大学, 2007.
[20]	Acun B, Gupta A, Jain N, et al. Parallel programming with migratable objects:charm++ in practice[C]// High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for. 2014: 647-658.
[21]	Gursoy A, Kale L V. Dagger: Combining benefits of synchronous and asynchronous communication styles[C]// Proceedings of 8th International Parallel Processing Symposium. IEEE, 1994: 590-596.
[22]	Charm++ Documentation - Parallel Programming Laboratory [EB/OL]. [2020-03-01]. http://charm.cs.illinois.edu/manuals/pdf/charm++.pdf
[23]	杨际祥, 谭国真, 王荣生. 并行与分布式计算动态负载均衡策略综述[J]. 电子学报, 2010,38(5):1122-1130.
[24]	Cybenko G. Dynamic load balancing for distributed memory multiprocessors[J]. Journal of parallel and distributed computing, 1989,7(2):279-301. doi: 10.1016/0743-7315(89)90021-X
[25]	Hui C C, Chanson S T. Hydrodynamic load balancing[J]. IEEE Transactions on Parallel and Distributed Systems, 1999,10(11):1118-1137. doi: 10.1109/71.809572
[26]	Zheng G. Achieving high performance on extremely large parallel machines: performance prediction and load balancing[R]. 2005.
[27]	Willebeek-LeMair M H, Reeves A P. Strategies for dynamic load balancing on highly parallel computers[J]. IEEE Transactions on parallel and distributed systems, 1993,4(9):979-993. doi: 10.1109/71.243526
[28]	Xiao X. A Direct Proof of the 4/3 Bound of LPT Scheduling Rule[C]//2017 5th International Conference on Frontiers of Manufacturing Science and Measuring Technology (FMSMT 2017). Atlantis Press, 2017.
[29]	张云泉, 袁良, 袁国兴, 李希代. 2019年中国高性能计算机发展现状分析与展望[J]. 数据与计算发展前沿, 2020,2(1):18-26.

Chare 总数量	时间/秒	标准差	极差	均值
64	640.53	527032	2755781	10240000
128	591.73	85038	432258	10240000
256	576.40	112005	563038	10240000
512	585.27	164129	593829	10240000