数据与计算发展前沿 ›› 2026, Vol. 8 ›› Issue (2): 204-214.

CSTR: 32002.14.jfdc.CN10-1649/TP.2026.02.015

doi: 10.11871/jfdc.issn.2096-742X.2026.02.015

• 技术与应用 • 上一篇    下一篇

零样本语音合成技术研究进展

王丹琳1,2(),唐云祁2,*()   

  1. 1 北京警察学院刑事科学技术系北京 102202
    2 中国人民公安大学侦查学院北京 100038
  • 收稿日期:2025-08-11 出版日期:2026-04-20 发布日期:2026-04-23
  • 通讯作者: *唐云祁(E-mail: tangyunqi@ppsuc.edu.cn
  • 作者简介:王丹琳,北京警察学院刑事科学技术系,讲师,主要研究方向为视听检验技术、智能图像识别与理解。
    本文承担工作:负责系统性文献调研、论文的撰写。
    WANG Danlin is a lecturer in the Department of Criminal Science and Technology, Beijing Police College. Her research interests include forensic audio-visual examination technology, as well as intelligent image recognition and interpretation.In this paper, she is mainly responsible for the systematic literature review and manuscript writing.
    E-mail: wangdl.1101@163.com|唐云祁,中国人民公安大学侦查学院,教授,博士生导师,主要研究方向为电子数据检验、智能图像识别与理解。
    本文承担工作:负责论文框架设计,论文修改与审定。
    TANG Yunqi is a professor and doctoral supervisor at the School of Criminal Investigation, People’s Public Security University of China. His research interests include electronic data examination and forensics, as well as intelligent image recognition and interpretation.
    In this paper, he is mainly responsible for the framework design, revision, and final approval of the paper.
    E-mail: tangyunqi@ppsuc.edu.cn

Advances in Zero-Shot Text-to-Speech Technology

WANG Danlin1,2(),TANG Yunqi2,*()   

  1. 1 Department of Criminal Science and Technology, Beijing Police College, Beijing 102202, China
    2 School of Investigation, People’s Public Security University of China, Beijing 100038, China
  • Received:2025-08-11 Online:2026-04-20 Published:2026-04-23

摘要:

【目的】 综述零样本语音合成技术研究现状与趋势,分析该技术在自然度、逼真度和多样性提升中的问题与挑战。【文献范围】基于2021-2025年国内外发表的相关学术论文,检索并分析涵盖不同模型架构、任务和应用场景的研究成果。【方法】 介绍了零样本语音合成技术的定义和核心特征,从深度学习生成模型架构出发,对现有主流的端到端零样本语音合成模型进行分类,并对代表性方法进行对比分析。【结果】 总结近年代表性方法在语音自然度、说话人相似度、多语言适应性与表达能力方面的进展,指出生成式人工智能与大语言模型的发展推动了跨语言迁移、上下文学习和多模态生成的新趋势,同时发现跨说话人迁移、情感与风格控制、多样性等仍存不足。【结论】 未来应进一步提升高保真与强表现力的语音生成能力,加强对情感、语体和风格等属性的建模,增强跨语言与跨说话人的泛化能力,同时兼顾轻量化与实时性,并探索大语言模型驱动的多模态一体化生成框架。

关键词: 语音合成, 零样本, 非自回归模型, 自回归模型, 混合模型, 大语言模型

Abstract:

[Purpose] This paper reviews the current research status and development trends of Zero-shot Text-to-Speech (TTS), and analyzes the challenges faced by this technology in improving naturalness, fidelity, and diversity. [Literature Scope] The review is based on relevant academic publications from 2021 to 2025, covering studies on different model architectures, tasks, and application scenarios. [Methods] The definition and core characteristics of Zero-shot TTS are introduced, and existing mainstream end-to-end Zero-shot TTS models are classified according to deep learning generative architectures. Representative methods are compared and analyzed. [Results] Recent advances are summarized in terms of speech naturalness, speaker similarity, multilingual adaptability, and expressive capacity. It is highlighted that the development of generative AI and large language models (LLMs) has promoted new trends in cross-lingual transfer, in-context learning, and multimodal generation. However, limitations remain in cross-speaker transfer stability, fine-grained modeling of emotion and speaking style, and diversity. [Conclusions] Future research should further enhance high-fidelity and expressive speech generation, strengthen the modeling of emotion, register, and style, improve generalization across languages and speakers, pursue lightweight and real-time deployment, and explore multimodal unified generation frameworks driven by large language models.

Key words: Text-to-Speech, Zero-shot, non-autoregressive models, autoregressive models, hybrid models, large language models