基于蒙特卡洛树搜索的通用博弈系统的构建与优化研究

doi:10.11871/jfdc.issn.2096-742X.2022.03.005

数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (3): 66-77.

CSTR: 32002.14.jfdc.CN10-1649/TP.2022.03.005

doi: 10.11871/jfdc.issn.2096-742X.2022.03.005

• 专刊:先进智能计算平台及应用(下) • 上一篇下一篇

基于蒙特卡洛树搜索的通用博弈系统的构建与优化研究

梁思立¹(),姜桂飞^1,^*(),陈泰劼²(),邓益超³(),战瑀璠⁴(),张玉志¹()

1.南开大学,软件学院,天津 300350
2.香港大学,工程学院,香港特别行政区 999077
3.新加坡国立大学,计算机学院,新加坡 117417
4.南开大学,金融学院,天津 300381

收稿日期:2022-02-07 出版日期:2022-06-20 发布日期:2022-06-20
通讯作者: 姜桂飞
作者简介:梁思立,南开大学,软件学院,硕士研究生,主要研究方向为通用博弈策略。
本文主要负责核心算法设计和实验实现。
LIANG Sili is currently a postgraduate at the College of Software, Nankai University, Tianjin, China. His research focuses on general game playing.
In this paper, his contributions are the design of the core part of the algorithm and the implementation of the experiments.
E-mail: 2120210551@mail.nankai.edu.cn|姜桂飞,南开大学,软件学院,博士,讲师,主要研究方向为通用博弈策略、算法博弈论、多智能体系统。发表SCI/EI论文15篇以上。
本文主要负责文章整体框架设计,论文写作与修改。
JIANG Guifei, Ph.D., is currently an assistant professor at the College of Software, Nankai University, Tianjin, China. Her research interests include general game playing, algorithmic game theory, multi-agent systems. She has published more than fifteen high quality research papers.
In this paper, her contributions are the design of the overall framework of the paper, and the manuscript writing and rev-ision.
E-mail: g.jiang@nankai.edu.cn|陈泰劼,香港大学,工程学院,研究助理,主要研究方向为深度学习和强化学习。
本文主要参与实验实现。
CHEN Taijie is currently a research ass-istant at the Department of Computer Scie-nce, University of Hong Kong, Hong Kong, China. His research interests include deep learning and reinforcement learning.|邓益超,新加坡国立大学,计算机学院,硕士研究生,主要研究方向为人工智能。
本文主要参与实验实现。
DENG Yichao is currently a postgradu-ate at the School of Computing, National University of Singapore. His research focuses on artificial inte-lligence.
In this paper, his contributions are the design and impleme-ntation of the experiments.
E-mail: e0792453@u.nus.edu|战瑀璠,南开大学,金融学院,硕士研究生,主要研究方向为金融科技。
本文主要负责模型的建立和写作。
ZHAN Yufan is currently a postgraduate at the School of Finance, Nankai Univ-ersity, Tianjin, China. Her research focuses on fintech.
In this paper, her contributions are the design and implemen-tation of the general game player.
E-mail: 2120202478@mail.nankai.edu.cn|张玉志,南开大学,软件学院,院长,博士,讲席教授,主要研究方向为人工智能。
本文主要参与论文整体框架指导。
ZHANG Yuzhi, Ph.D., is currently a dis-tinguished professor and the dean of the College of Software, Nankai University, Tianjin, China. His research focuses on arti-ficial intelligence.
In this paper, his contributions are the guidance of the overall organization of the paper.
E-mail: zyz@nankaiedu.cn
基金资助:
国家重点研发计划(2021YFB0300104);国家自然科学基金(61806102)

Optimizing Monte Carlo Tree Search for General Game Playing

LIANG Sili¹(),JIANG Guifei^1,^*(),CHEN Taijie²(),DENG Yichao³(),ZHAN Yufan⁴(),ZHANG Yuzhi¹()

1. College of Software, Nankai University, Tianjin 300350, China
2. Department of Computer Science, University of Hong Kong, Hong Kong 999077, China
3. School of Computing, National University of Singapore, Singapore 117417, Singapore
4. School of Finance, Nankai University, Tianjin 300381, China

Received:2022-02-07 Online:2022-06-20 Published:2022-06-20
Contact: JIANG Guifei

摘要/Abstract

摘要：

【背景】作为人工智能的主要研究领域,通用博弈策略（General Game Playing, 简称GGP）旨在构建具有通用智能的博弈系统。这些系统能够基于给定的博弈规则在没有人为干涉的情况下成功地进行多个甚至是全新构造的博弈。【目的】与专门的博弈系统不同,通用博弈系统所使用的策略生成算法并不针对特定博弈,而是能够根据给定的博弈规则自动生成博弈策略的具有通用性的算法。GGP发展至今已成为检测人工智能水平,特别是通用智能发展的重要研究领域。如何构建高效的通用博弈系统是GGP研究的主要问题。【文献范围】通用博弈策略的生成算法是构建通用博弈系统的关键技术。目前所使用的主流算法是蒙特卡洛树搜索算法及其变种。这类算法在工作过程中并不依赖特定的博弈信息,因而被广泛地应用于GGP领域。然而,由博弈规则推导出来的关于博弈的专门信息,往往对建立针对这一博弈的有效决策算法具有重要的作用。【方法】为此,本文通过在蒙特卡洛树搜索算法上增加记忆结构来存储在线博弈过程中的实时信息,用记忆结构中博弈状态的相似状态来估计该状态的好坏,以提高状态评估的准确性。【结果】本文基于这一方法构建了通用博弈系统并对其性能进行了全面地评估。实验结果表明,与原始的蒙特卡洛方法相比,本文所构建的通用博弈系统在决策水平和效率上都有显著提升,特别在双人信息对称的零和回合制博弈中胜率保持在55%以上,且其性能随着博弈规模的增大而显著提升,在Connect 5、Breakthrough等大规模的游戏上有着绝对优势,即达到100%胜率。【结论】这表明本文所提出的方法通过利用博弈的专门信息能够有效地提升蒙特卡洛树搜索算法的性能。

关键词: 通用博弈策略, 蒙特卡洛树搜索, 算法博弈论, 多智能体系统

Abstract:

[Background] General Game Playing (GGP) is concerned with creating intelligent agents that understand the rules of previously unknown games and learn to play these games well without human intervention. [Objective] Unlike specialized systems, a general game player cannot rely on algorithms designed in advance for specific games. Such a system rather requires a form of general intelligence that enables it to autonomously generate strategies based on the given game rules. With the decade of development, GGP provides an important testbed for AI, especially artificial general intelligence. The main problem of GGP is how to build an efficient general game player. Strategy generation is the core technique for building a general game player. [Scope of the literature] The main algorithms used in previous successful general game players are the Monte Carlo Tree Search (MCTS) algorithm and its variants. [Methods] To improve MCTS during the online real-time search, this paper incorporates it with a memory structure, where each entry contains information about a particular state. This memory is used to generate an approximate value estimation by combining the estimations of similar states. [Results] Based on this method, we implement and evaluate a general game player. The experimental results show that it can outperform the original Monte Carlo player in a variety of games. Especially, in two-person zero-sum and turn-based games with symmetric information, the built general game player achieves a winning rate of more than 55%, and its performance improves significantly with the increase of the game size, even 100% winning rate in large-scale games such as Connect 5 and Breakthrough. [Conclusions] These results have confirmed the feasibility of the proposed method to use game-dependent information for improving the performance of MCTS.

Key words: general game playing, Monte Carlo tree search, algorithmic game theory, multi-agent systems

梁思立,姜桂飞,陈泰劼,邓益超,战瑀璠,张玉志. 基于蒙特卡洛树搜索的通用博弈系统的构建与优化研究[J]. 数据与计算发展前沿, 2022, 4(3): 66-77.

LIANG Sili,JIANG Guifei,CHEN Taijie,DENG Yichao,ZHAN Yufan,ZHANG Yuzhi. Optimizing Monte Carlo Tree Search for General Game Playing[J]. Frontiers of Data and Computing, 2022, 4(3): 66-77, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2022.03.005.

图/表 10

图1

表1

描述博弈要素的命题集合[7]"

命题	命题为真时的含义
$r o l e r$	r是一个玩家
$i n i t p$	状态命题p是初始状态时为真
$t r u e p$	状态命题p在当前状态下为真
$l e g a l r a$	玩家r在当前状态下a是合法动作
$d o e s r a$	玩家r采取了动作a
$n e x t p$	状态命题p在下一状态下为真
$t e r m i n a l$	当前状态时结束状态
$g o a l r n$	玩家r在当前状态下的得分是n

表1

图2

表2

图3

图4

图5

图6

图7

图8

参考文献 27

[1]	Hsu F H. Behind Deep Blue: Building the Computer that Defeated the World Chess Champion[M]. Princeton University Press, 2002: x.
[2]	Genesereth M, Love N, Pell B. General game playing: Overview of the AAAI competition[J]. AI magazine, 2005, 26(2): 62-62.
[3]	Thielscher M. General Game Playing in AI Research and Education[C]// the German Annual Conference on Artificial Intelligence (KI), Springer, Berlin, Heidelberg, 2011, 7006: 26-37.
[4]	Browne C B, Powley E, Whitehouse D, et al. A Survey of Monte Carlo Tree Search Methods[J]. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(1): 1-43. doi: 10.1109/TCIAIG.2012.2186810
[5]	Świechowski M, Park H, Mańdziuk J, et al. Recent Adva-nces in General Game Playing[J/OL]. The Scientific Wor-ld Journal, 2015:1-22. https://pubmed.ncbi.nlm.nih.gov/ 26380375/.
[6]	Świechowski M, Mańdziuk J. A Hybrid Approach to Parallelization of Monte Carlo Tree Search in General Game Playing[M]// Challenging Problems and Solutions in Intelligent Systems, Springer, Cham, 2016: 199-215.
[7]	张海峰, 刘当一, 李文新. 通用对弈游戏:一个探索机器游戏智能的领域[J]. 软件学报, 2016, 27(11): 2814-2827.
[8]	Mnih V, Kavukcuoglu K, Silver D, et al. Human-level Control through Deep Reinforcement Learning[J]. nature, 2015, 518(7540): 529-533. doi: 10.1038/nature14236
[9]	Silver D, Huang A, Maddison C J, et al. Mastering the Game of Go with Deep Neural Networks and Tree Search[J]. nature, 2016, 529(7587): 484-489. doi: 10.1038/nature16961
[10]	Xiao C, Mei J, Müller M. Memory-Augmented Monte Carlo Tree Search[C]// the 32nd AAAI Conference on Artificial Intelligence, 2018: 1455-1461.
[11]	Pitrat J. A General Game Playing Program[M]// Findler N and Meltzer B: Artificial Intelligence and Heuristic Programming, Edinburgh University Press, 1971:125-155.
[12]	Pell B. Strategy Generation and Evaluation for Meta-Game Playing[D]// University of Cambridge, 1993.
[13]	Thielscher M. Systems with General Intelligence: A New Perspective (Invited Talk)[Z]// the 24th AAAI Confer-ence on Artificial Intelligence, 2010.
[14]	Clune J. Heuristic Evaluation Functions for General Game Playing[C]// the 22nd AAAI Conference on Artifi-cial Intelligence, 2007: 1134-1139.
[15]	Schiffel S, Thielscher M. Fluxplayer: A Successful Gen-eral Game Player[C]// the 22nd AAAI Conference on Artificial Intelligence, 2007: 1191-1196.
[16]	Finnsson H, Björnsson Y. Simulation-Based Approach to General Game Playing[C]// the 23rd AAAI Conference on Artificial Intelligence, 2008: 259-264.
[17]	Benacloch-Ayuso J L. RL-GGP[CP/OL]. http://users.dsic.upv.es/-flip/RLGGP/. 2012.
[18]	Finnsson H, Björnsson Y. Learning Simulation Control in General Game-Playing Agents[C]// the 24th AAAI Conference on Artificial Intelligence, 2010: 954-959.
[19]	Wang H, Emmerich M, Plaat A. Monte Carlo Q-learning for General Game Playing[J/OL]. arXiv preprint arXiv: 1802.05944, 2018.
[20]	Goldwaser A, Thielscher M. Deep Reinforcement Lea-rning for General Game Playing[C]// the 34th AAAI Conference on Artificial Intelligence, 2020, 34(02): 1701-1708.
[21]	Love N, Hinrichs T, Haley D, et al. General Game Pla-ying: Game Description Language Specification[R]. Stan-ford Logic Group Computer Science Department Stan-ford University. 2008.
[22]	Thielscher M. A General Game Description Language for Incomplete Information Games[C]// the 24th AAAI Conference on Artificial Intelligence, 2010: 994-999.
[23]	Thielscher M. GDL-III: A Description Language for Epistemic General Game Playing[C]// the 26th Internat-ional Joint Conference on Artificial Intelligence, 2017: 1276-1282.
[24]	Méhat J, Cazenave T. A Parallel General Game Player[J]. KI-künstliche Intelligenz, 2011, 25(1): 43-47. doi: 10.1007/s13218-010-0083-6
[25]	Auer P, Cesa-Bianchi N, Fischer P. Finite-Time Analysis of the Multi-armed Bandit Problem[J]. Machine learning, 2002, 47(2): 235-256. doi: 10.1023/A:1013689704352
[26]	Kocsis L, Szepesvári C. Bandit-Based Monte-Carlo Planning[C]// European conference on machine learning, Springer, Berlin, Heidelberg, 2006, 4212: 282-293.
[27]	Genesereth M, Björnsson Y. The International General Game Playing Competition[J]. AI Magazine, 2013, 34(2): 107-111. doi: 10.1609/aimag.v34i2.2475

游戏名称	玩家数目	合作与否	信息	行动顺序
Tic-Tac-Toe (Large) (5×5棋盘)	2	零和	信息对称	回合制
Connect 4 (6×7棋盘)	2	零和	信息对称	回合制
Connect 5 (8×8棋盘)	2	零和	信息对称	回合制
Breakthrough (6×6棋盘)	2	零和	信息对称	回合制
Babel	3	合作	信息对称	同时
Pacman3p (6×6棋盘)	3	合作/零和	非信息对称	回合制/同时混合

基于蒙特卡洛树搜索的通用博弈系统的构建与优化研究

Optimizing Monte Carlo Tree Search for General Game Playing

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 27

相关文章 0

编辑推荐

Metrics

本文评价