Multi-Layer Game System of Power Supply Chain Based on Reinforcement Learning

doi:10.11871/jfdc.issn.2096-742X.2026.03.014

Abstract

Abstract:

[Objective] To address the multi-level game optimization problem in the power and coal supply chain, this study uses China’s provincial coal-fired power supply chain as a context. Addressing the limitations of traditional rule-based decision-making and game-theoretic equilibrium solutions in handling dynamic, high-dimensional action spaces and the learning adaptability of participants, this study constructs a multi-agent model encompassing provincial operators, municipal power plants, and coal mines. [Methods] This model employs a Stackelberg game framework for hierarchical coordination, incorporates Nash equilibrium to simulate intra-level competition, and integrates the TD3BC reinforcement learning algorithm to optimize agent decision-making. A unified price auction market clearing mechanism ensures supply and demand matching. [Results] By comparing the performance of three power plant game objectives—profit protection, pure cost optimization, and market-based bidding—the market-based bidding model demonstrates the best overall system efficiency and supply-demand balance. Furthermore, the implementation of the TD3BC algorithm significantly improves system total profit, market efficiency, and stability compared to traditional rule-based decision-making. [Limitations] This study is limited by the use of simplified market parameters and the lack of consideration of factors such as transportation topology, long-term contracts, and unit constraints in real markets. [Conclusions] The method combining reinforcement learning with multi-layer game theory can effectively optimize the decision-making of the power supply chain and provide theoretical support for the integrated operation of coal and electricity. The market-based bidding strategy is more suitable for the scenario of pursuing system efficiency and profit growth.

Key words: power supply chain, multi-layer game, reinforcement learning, market clearing mechanism, TD3BC algorithm, Stackelberg game, supply and demand balance

NIU Xinxin,LIU Yuxuan,WANG Yijing,YOU Bo,LI Xueen. Multi-Layer Game System of Power Supply Chain Based on Reinforcement Learning[J]. Frontiers of Data and Computing, 2026, 8(3): 166-180.

Figures/Tables 7

Fig.1

Fig.2

Fig.3

Fig.4

Fig.5

Table 1

Table 2

References 13

[1]	谭忠富, 张会娟, 刘文彦, 等. 煤电能源供应链风险控制研究综述[J]. 现代电力, 2014, 31(2): 66-74.
[2]	刘岩. 煤电资源区域整合供应模式综述及效能分析——基于宁夏煤电一体化的实践[J]. 中国市场, 2023, (14): 180-183.
[3]	李超. 电力企业煤炭跨区调运方案优化方法及应用研究[D]. 北京交通大学, 2012.
[4]	杨尊信. 煤电产业供需博弈模型研究[D]. 合肥工业大学, 2008.
[5]	张伟, 王志宏. 煤炭产业链稳定机制的博弈模型及分析[J]. 煤炭经济研究, 2009, (11): 54-56.
[6]	谭艳. 基于Stackelberg博弈模型的煤电一体化可行性研究[D]. 安徽理工大学, 2010.
[7]	高佳明, 张丽. 双碳目标下煤电产业链策略响应的演化博弈与实证分析[J]. 煤炭经济研究, 2024, 44(4): 65-77.
[8]	丁晓慧. 电煤供应链的耦合机理分析及优化研究[D]. 华北电力大学(北京), 2021.
[9]	戚金钰. 碳限额交易下煤电供应链博弈及演化稳定策略研究[D]. 中国矿业大学, 2024.
[10]	ZHENG L, FIEZ T, ALUMBAUGH Z, et al. Stackelberg actor-critic: Game-theoretic reinforcement learning algorithms[C]// Proceedings of the AAAI conference on artificial intelligence, 2022, 36(8): 9217-9224.
[11]	Nash J F. Non-cooperative games[M]// The Foundations of Price Theory, Vol4. Routledge, 2024: 329-340.
[12]	FUJIMOTO S, GU S S. A minimalist approach to offline reinforcement learning[J]. Advances in neural information processing systems, 2021, 34: 20132-20145.
[13]	YANG T, TANG H, BAI C, et al. Exploration in deep reinforcement learning: a comprehensive survey[J]. arXiv e-prints, 2021: arXiv: 2109.06668.

指标	博弈方式1（利润保护型）	博弈方式2（纯成本最优型）	博弈方式3（市场化竞价型）	对比
发电量（亿千瓦时）	473.225	443.308	499.985	方式3最高，接近满负荷；方式2最低，可能因成本优化抑制产量。
平均电价（元/kWh）	0.5	0.392	0.428	方式1最高，反映保护机制下价格支撑；方式2最低，成本导向压低价格。
自产煤量（万吨）	1,295.14	1,215.24	1,092.44	方式1最高，协同导向促进煤炭生产；方式3最低，竞争下优化供给。
自产煤售价（元/吨）	815.8	690.9	894.02	方式3最高，市场竞争推高煤价；方式2最低，成本优化压低售价。
吨煤边际利润（元/吨）	170.22	76.43	248.55	方式3最高，竞争激励高利润；方式2最低，纯成本导向牺牲利润空间。
省级协调奖励（万元）	754,410.86	2,710,184.15	2,618,031.13	方式2和3远高于方式1，成本和竞争导向更易获得省级激励。
系统总利润（万元）	2,311,260.44	3,589,141.4	4,0177,53.98	方式3最高，竞争驱动整体高效；方式1最低，保护机制分散利润。
供需平衡度	0.959	1.019	1.08	方式3最高（轻微供给过剩），方式1最低（轻微供给不足），反映竞争下产量扩张。
协调效率	0.326	0.755	0.652	方式2最高，成本优化提升协调；方式1最低，保护导向增加协调难度。

指标	博弈方式1	博弈方式2	博弈方式3	对比
省级平均奖励	809,455.54	2,411,857.5	2,387,996.1	方式2和3省级获益高，成本/竞争导向放大上级奖励。
电厂平均奖励	254,147.7	149,595.17	230,678.54	方式1最高，保护机制提升电厂稳定性；方式2最低，成本压力大。
煤矿平均奖励	28,262.91	25,059.59	35,137.85	方式3最高，竞争下煤矿获利多；方式2最低。
平均发电量（亿千瓦时）	448.187	447.206	499.984	方式3接近容量上限，竞争驱动扩张。
平均煤炭生产量（万吨）	1,305.51	1,253.66	1,204.5	方式1最高，协同促进生产。
平均电价（元/kWh）	0.5	0.392	0.428	方式1最高，保护支撑价格。
平均煤价（元/吨）	767.59	755.95	806.96	方式3最高，竞争推高煤价。
系统稳定性（越小越好）	0.039	0.028	0	方式3最稳定，竞争机制快速收敛；方式1波动较大。
电力市场HHI	0.204	0.205	0.204	无显著差异，均低集中（竞争性强）。
煤炭市场HHI	0.222	0.194	0.272	方式3最高（稍集中），方式2最低（更分散）。
电价波动性	0.001	0.001	0.001	均低，方式3微波动反映竞争动态。
煤价波动性	0.084	0.069	0.085	方式2最低，成本优化稳定价格。
市场效率指数（越高越好）	0.923	0.936	0.921	方式2最高，成本导向高效分配资源。