数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (3): 66-77.

CSTR: 32002.14.jfdc.CN10-1649/TP.2022.03.005

doi: 10.11871/jfdc.issn.2096-742X.2022.03.005

• 专刊:先进智能计算平台及应用(下) • 上一篇    下一篇

基于蒙特卡洛树搜索的通用博弈系统的构建与优化研究

梁思立1(),姜桂飞1,*(),陈泰劼2(),邓益超3(),战瑀璠4(),张玉志1()   

  1. 1.南开大学,软件学院,天津 300350
    2.香港大学,工程学院,香港特别行政区 999077
    3.新加坡国立大学,计算机学院,新加坡 117417
    4.南开大学,金融学院,天津 300381
  • 收稿日期:2022-02-07 出版日期:2022-06-20 发布日期:2022-06-20
  • 通讯作者: 姜桂飞
  • 作者简介:梁思立,南开大学,软件学院,硕士研究生,主要研究方向为通用博弈策略。
    本文主要负责核心算法设计和实验实现。
    LIANG Sili is currently a postgraduate at the College of Software, Nankai University, Tianjin, China. His research focuses on general game playing.
    In this paper, his contributions are the design of the core part of the algorithm and the implementation of the experiments.
    E-mail: 2120210551@mail.nankai.edu.cn|姜桂飞,南开大学,软件学院,博士,讲师,主要研究方向为通用博弈策略、算法博弈论、多智能体系统。发表SCI/EI论文15篇以上。
    本文主要负责文章整体框架设计,论文写作与修改。
    JIANG Guifei, Ph.D., is currently an assistant professor at the College of Software, Nankai University, Tianjin, China. Her research interests include general game playing, algorithmic game theory, multi-agent systems. She has published more than fifteen high quality research papers.
    In this paper, her contributions are the design of the overall framework of the paper, and the manuscript writing and rev-ision.
    E-mail: g.jiang@nankai.edu.cn|陈泰劼,香港大学,工程学院,研究助理,主要研究方向为深度学习和强化学习。
    本文主要参与实验实现。
    CHEN Taijie is currently a research ass-istant at the Department of Computer Scie-nce, University of Hong Kong, Hong Kong, China. His research interests include deep learning and reinforcement learning.|邓益超,新加坡国立大学,计算机学院,硕士研究生,主要研究方向为人工智能。
    本文主要参与实验实现。
    DENG Yichao is currently a postgradu-ate at the School of Computing, National University of Singapore. His research focuses on artificial inte-lligence.
    In this paper, his contributions are the design and impleme-ntation of the experiments.
    E-mail: e0792453@u.nus.edu|战瑀璠,南开大学,金融学院,硕士研究生,主要研究方向为金融科技。
    本文主要负责模型的建立和写作。
    ZHAN Yufan is currently a postgraduate at the School of Finance, Nankai Univ-ersity, Tianjin, China. Her research focuses on fintech.
    In this paper, her contributions are the design and implemen-tation of the general game player.
    E-mail: 2120202478@mail.nankai.edu.cn|张玉志,南开大学,软件学院,院长,博士,讲席教授,主要研究方向为人工智能。
    本文主要参与论文整体框架指导。
    ZHANG Yuzhi, Ph.D., is currently a dis-tinguished professor and the dean of the College of Software, Nankai University, Tianjin, China. His research focuses on arti-ficial intelligence.
    In this paper, his contributions are the guidance of the overall organization of the paper.
    E-mail: zyz@nankaiedu.cn
  • 基金资助:
    国家重点研发计划(2021YFB0300104);国家自然科学基金(61806102)

Optimizing Monte Carlo Tree Search for General Game Playing

LIANG Sili1(),JIANG Guifei1,*(),CHEN Taijie2(),DENG Yichao3(),ZHAN Yufan4(),ZHANG Yuzhi1()   

  1. 1. College of Software, Nankai University, Tianjin 300350, China
    2. Department of Computer Science, University of Hong Kong, Hong Kong 999077, China
    3. School of Computing, National University of Singapore, Singapore 117417, Singapore
    4. School of Finance, Nankai University, Tianjin 300381, China
  • Received:2022-02-07 Online:2022-06-20 Published:2022-06-20
  • Contact: JIANG Guifei

摘要:

【背景】作为人工智能的主要研究领域,通用博弈策略(General Game Playing, 简称GGP)旨在构建具有通用智能的博弈系统。这些系统能够基于给定的博弈规则在没有人为干涉的情况下成功地进行多个甚至是全新构造的博弈。【目的】与专门的博弈系统不同,通用博弈系统所使用的策略生成算法并不针对特定博弈,而是能够根据给定的博弈规则自动生成博弈策略的具有通用性的算法。GGP发展至今已成为检测人工智能水平,特别是通用智能发展的重要研究领域。如何构建高效的通用博弈系统是GGP研究的主要问题。【文献范围】通用博弈策略的生成算法是构建通用博弈系统的关键技术。目前所使用的主流算法是蒙特卡洛树搜索算法及其变种。这类算法在工作过程中并不依赖特定的博弈信息,因而被广泛地应用于GGP领域。然而,由博弈规则推导出来的关于博弈的专门信息,往往对建立针对这一博弈的有效决策算法具有重要的作用。【方法】为此,本文通过在蒙特卡洛树搜索算法上增加记忆结构来存储在线博弈过程中的实时信息,用记忆结构中博弈状态的相似状态来估计该状态的好坏,以提高状态评估的准确性。【结果】本文基于这一方法构建了通用博弈系统并对其性能进行了全面地评估。实验结果表明,与原始的蒙特卡洛方法相比,本文所构建的通用博弈系统在决策水平和效率上都有显著提升,特别在双人信息对称的零和回合制博弈中胜率保持在55%以上,且其性能随着博弈规模的增大而显著提升,在Connect 5、Breakthrough等大规模的游戏上有着绝对优势,即达到100%胜率。【结论】这表明本文所提出的方法通过利用博弈的专门信息能够有效地提升蒙特卡洛树搜索算法的性能。

关键词: 通用博弈策略, 蒙特卡洛树搜索, 算法博弈论, 多智能体系统

Abstract:

[Background] General Game Playing (GGP) is concerned with creating intelligent agents that understand the rules of previously unknown games and learn to play these games well without human intervention. [Objective] Unlike specialized systems, a general game player cannot rely on algorithms designed in advance for specific games. Such a system rather requires a form of general intelligence that enables it to autonomously generate strategies based on the given game rules. With the decade of development, GGP provides an important testbed for AI, especially artificial general intelligence. The main problem of GGP is how to build an efficient general game player. Strategy generation is the core technique for building a general game player. [Scope of the literature] The main algorithms used in previous successful general game players are the Monte Carlo Tree Search (MCTS) algorithm and its variants. [Methods] To improve MCTS during the online real-time search, this paper incorporates it with a memory structure, where each entry contains information about a particular state. This memory is used to generate an approximate value estimation by combining the estimations of similar states. [Results] Based on this method, we implement and evaluate a general game player. The experimental results show that it can outperform the original Monte Carlo player in a variety of games. Especially, in two-person zero-sum and turn-based games with symmetric information, the built general game player achieves a winning rate of more than 55%, and its performance improves significantly with the increase of the game size, even 100% winning rate in large-scale games such as Connect 5 and Breakthrough. [Conclusions] These results have confirmed the feasibility of the proposed method to use game-dependent information for improving the performance of MCTS.

Key words: general game playing, Monte Carlo tree search, algorithmic game theory, multi-agent systems