零样本语音合成技术研究进展

doi:10.11871/jfdc.issn.2096-742X.2026.02.015

数据与计算发展前沿 ›› 2026, Vol. 8 ›› Issue (2): 204-214.

CSTR: 32002.14.jfdc.CN10-1649/TP.2026.02.015

doi: 10.11871/jfdc.issn.2096-742X.2026.02.015

零样本语音合成技术研究进展

王丹琳^1,²(),唐云祁^2,^*()

¹ 北京警察学院，刑事科学技术系，北京 102202
² 中国人民公安大学，侦查学院，北京 100038

收稿日期:2025-08-11 出版日期:2026-04-20 发布日期:2026-04-23
通讯作者: *唐云祁（E-mail: tangyunqi@ppsuc.edu.cn）
作者简介:王丹琳，北京警察学院刑事科学技术系，讲师，主要研究方向为视听检验技术、智能图像识别与理解。
本文承担工作：负责系统性文献调研、论文的撰写。
WANG Danlin is a lecturer in the Department of Criminal Science and Technology, Beijing Police College. Her research interests include forensic audio-visual examination technology, as well as intelligent image recognition and interpretation.In this paper, she is mainly responsible for the systematic literature review and manuscript writing.
E-mail: wangdl.1101@163.com|唐云祁，中国人民公安大学侦查学院，教授，博士生导师，主要研究方向为电子数据检验、智能图像识别与理解。
本文承担工作：负责论文框架设计，论文修改与审定。
TANG Yunqi is a professor and doctoral supervisor at the School of Criminal Investigation, People’s Public Security University of China. His research interests include electronic data examination and forensics, as well as intelligent image recognition and interpretation.
In this paper, he is mainly responsible for the framework design, revision, and final approval of the paper.
E-mail: tangyunqi@ppsuc.edu.cn

Advances in Zero-Shot Text-to-Speech Technology

WANG Danlin^1,²(),TANG Yunqi^2,^*()

¹ Department of Criminal Science and Technology, Beijing Police College, Beijing 102202, China
² School of Investigation, People’s Public Security University of China, Beijing 100038, China

Received:2025-08-11 Online:2026-04-20 Published:2026-04-23

摘要/Abstract

摘要：

【目的】 综述零样本语音合成技术研究现状与趋势，分析该技术在自然度、逼真度和多样性提升中的问题与挑战。【文献范围】基于2021-2025年国内外发表的相关学术论文，检索并分析涵盖不同模型架构、任务和应用场景的研究成果。【方法】 介绍了零样本语音合成技术的定义和核心特征，从深度学习生成模型架构出发，对现有主流的端到端零样本语音合成模型进行分类，并对代表性方法进行对比分析。【结果】 总结近年代表性方法在语音自然度、说话人相似度、多语言适应性与表达能力方面的进展，指出生成式人工智能与大语言模型的发展推动了跨语言迁移、上下文学习和多模态生成的新趋势，同时发现跨说话人迁移、情感与风格控制、多样性等仍存不足。【结论】 未来应进一步提升高保真与强表现力的语音生成能力，加强对情感、语体和风格等属性的建模，增强跨语言与跨说话人的泛化能力，同时兼顾轻量化与实时性，并探索大语言模型驱动的多模态一体化生成框架。

关键词: 语音合成, 零样本, 非自回归模型, 自回归模型, 混合模型, 大语言模型

Abstract:

[Purpose] This paper reviews the current research status and development trends of Zero-shot Text-to-Speech (TTS), and analyzes the challenges faced by this technology in improving naturalness, fidelity, and diversity. [Literature Scope] The review is based on relevant academic publications from 2021 to 2025, covering studies on different model architectures, tasks, and application scenarios. [Methods] The definition and core characteristics of Zero-shot TTS are introduced, and existing mainstream end-to-end Zero-shot TTS models are classified according to deep learning generative architectures. Representative methods are compared and analyzed. [Results] Recent advances are summarized in terms of speech naturalness, speaker similarity, multilingual adaptability, and expressive capacity. It is highlighted that the development of generative AI and large language models (LLMs) has promoted new trends in cross-lingual transfer, in-context learning, and multimodal generation. However, limitations remain in cross-speaker transfer stability, fine-grained modeling of emotion and speaking style, and diversity. [Conclusions] Future research should further enhance high-fidelity and expressive speech generation, strengthen the modeling of emotion, register, and style, improve generalization across languages and speakers, pursue lightweight and real-time deployment, and explore multimodal unified generation frameworks driven by large language models.

Key words: Text-to-Speech, Zero-shot, non-autoregressive models, autoregressive models, hybrid models, large language models

王丹琳, 唐云祁. 零样本语音合成技术研究进展[J]. 数据与计算发展前沿, 2026, 8(2): 204-214.

WANG Danlin, TANG Yunqi. Advances in Zero-Shot Text-to-Speech Technology[J]. Frontiers of Data and Computing, 2026, 8(2): 204-214, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2026.02.015.

图/表 3

图1

图2

表1

参考文献 41

[1]	TAN X, QIN T, SOONG F, et al. A survey on neural speech synthesis[J]. arXiv, 2021, 2106.15561.
[2]	JIA Y, ZHANG Y, WEISS R, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis[C]// Advances in Neural Information Processing Systems (NeurIPS). 2018, 31.
[3]	CASANOVA E, WEBER J, SHULBY C D, et al. YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone[C]// International Conference on Machine Learning (ICML). PMLR, 2022: 2709-2720.
[4]	WANG C, CHEN S, WU Y, et al. Neural codec language models are zero-shot Text-to-Speech synthesizers[J]. arXiv, 2023, 2301.02111.
[5]	LI Y A, HAN C, RAGHAVAN V, et al. StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models[J]. Advances in Neural Information Processing Systems, 2023, 36: 19594-19621. pmid: 39866554
[6]	Peng P, Huang P Y, Li S W, et al. Voicecraft: Zero-shot speech editing and text-to-speech in the wild[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), 2024: 12442-12462.
[7]	LAM P, ZHANG H, CHEN N F, et al. PRESENT: zero-shot text-to-prosody control[J]. IEEE Signal Processing Letters, 2025.
[8]	DU Z, WANG Y, CHEN Q, et al. CosyVoice 2: Scalable streaming speech synthesis with large language models[J]. arXiv, 2024, 2412.10117.
[9]	JIANG Z, REN Y, LI R, et al. MegaTTS 3: Sparse alignment enhanced latent diffusion transformer for zero-shot speech synthesis[J]. arXiv, 2025, 2502. 18924.
[10]	LI Y A, HAN C, MESGARANI N. StyleTTS: A style-based generative model for natural and diverse text-to-speech synthesis[J]. IEEE Journal of Selected Topics in Signal Processing, 2025.
[11]	GUO Y, DU C, CHEN X, et al. EmoDiff: Intensity-controllable emotional text-to-speech with soft-label guidance[C]// ICASSP 2023—IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2023: 1-5.
[12]	CHEN Y, NIU Z, MA Z, et al. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching[J]. arXiv, 2024, 2410.06885.
[13]	唐浩彬, 张旭龙, 王健宗, 等. 表现性语音合成综述[J]. 大数据, 2023, 9(6): 53-71. doi: 10.11959/j.issn.2096-0271.2022082
[14]	KONG J, KIM J, BAE J. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis[J]. Advances in Neural Information Processing Systems, 2020, 33: 17022-17033.
[15]	LEE S, PING W, GINSBURG B, et al. BigVGAN: A universal neural vocoder with large-scale training[J]. arXiv, 2022, 2206. 04658.
[16]	QIU Z, TANG J, ZHANG Y, et al. A voice cloning method based on the improved HiFi-GAN model[J]. Computational Intelligence and Neuroscience, 2022, 2022: 6707304.
[17]	LI Y, YU C, SUN G, et al. Cross-Utterance conditioned VAE for speech generation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 4263-4276. doi: 10.1109/TASLP.2024.3453598
[18]	KIM J, KONG J, SON J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech[C]// International Conference on Machine Learning (ICML). PMLR, 2021: 5530-5540.
[19]	CASANOVA E, SHULBY C, GÖLGE E, et al. SC-GlowTTS: An efficient zero-shot multi-speaker text-to-speech model[J]. arXiv, 2021, 2104. 05557.
[20]	KIM S, SHIH K, SANTOS J F, et al. P-Flow: A fast and data-efficient Zero-shot TTS through speech prompting[J]. Advances in Neural Information Processing Systems, 2023, 36: 74213-74228.
[21]	BILINSKI P, MERRITT T, EZZERG A, et al. Creating new voices using normalizing flows[C]// Interspeech 2022. 2022: 2958-2962.
[22]	LE M, VYAS A, SHI B, et al. Voicebox: Text-guided multilingual universal speech generation at scale[J]. Advances in Neural Information Processing Systems, 2023, 36: 14005-14034.
[23]	BANG C W, CHUN C. Effective zero-shot multi-speaker text-to-speech technique using information perturbation and a speaker encoder[J]. Sensors, 2023, 23(23): 9591. doi: 10.3390/s23239591
[24]	SHEN K, JU Z, TAN X, et al. NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers[J]. arXiv, 2023, 2304. 09116.
[25]	JU Z, WANG Y, SHEN K, et al. NaturalSpeech 3: zero-shot speech synthesis with factorized codec and diffusion models[J]. arXiv, 2024, 2403. 03100.
[26]	ESKIMEZ S E, WANG X, THAKKER M, et al. E2 TTS: Embarrassingly easy fully non-autoregressive Zero-shot TTS[C]// IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024: 682-689.
[27]	GOSWAMI N, HARADA T. SA-TTS: Speaker attractor text-to-speech, learning to speak by learning to separate[J]. arXiv, 2022, 2207.06011.
[28]	刘朝辞. 基于隐含表征的语音合成韵律建模方法研究[D]. 合肥: 中国科学技术大学, 2025.
[29]	岳焕景, 王嘉玮, 杨敬钰. 基于音素级韵律建模的自回归零样本语音合成[J]. 湖南大学学报(自然科学版), 2025, 52(4): 114-123.
[30]	PANKOV V, PRONINA V, KUZMIN A, et al. DINO-VITS: Data-efficient Zero-shot TTS with self-supervised speaker verification loss for noise robustness[J]. arXiv, 2023, 2311.09770.
[31]	付毅冲. 零样本个性化语音合成的研究[D]. 北京: 北京邮电大学, 2024.
[32]	LI J, ZHANG L. ZSE-VITS: A zero-shot expressive voice cloning method based on VITS[J]. Electronics, 2023, 12(4): 820. doi: 10.3390/electronics12040820
[33]	CASANOVA E, DAVIS K, GÖLGE E, et al. XTTS: A massively multilingual zero-shot Text-to-Speech model[J]. arXiv, 2024, 2406.04904.
[34]	王陈偲, 杨思燕, 苗启广. 基于XTTS模型的声音克隆系统研究[J/OL]. 计算机科学, 1-10 [2025-09-12]. https://link.cnki.net/urlid/50.1075.tp.20250815.0949.022.
[35]	JIANG Z, LIU J, REN Y, et al. Mega-TTS 2: Boosting prompting mechanisms for zero-shot speech synthesis[J]. arXiv, 2023, 2307.07218.
[36]	LEI Y, YANG S, CONG J, et al. Glow-WaveGAN 2: High-quality zero-shot Text-to-Speech synthesis and any-to-any voice conversion[J]. arXiv, 2022, 2207.01832.
[37]	CHEN S, LIU S, ZHOU L, et al. VALL-E 2: Neural codec language models are human-parity zero-shot Text-to-Speech synthesizers[J]. arXiv, 2024, 2406.05370.
[38]	韩冰, 钱彦旻. VALL-E R: 利用单调对齐策略的鲁棒且高效零样本语音合成[J/OL]. 信号处理, 1-13 [2025-09-12]. https://link.cnki.net/urlid/11.2406.tn.20250728.1918.006.
[39]	ZHANG Z, ZHOU L, WANG C, et al. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling[J]. arXiv, 2023, 2303. 03926.
[40]	ANASTASSIOU P, CHEN J, CHEN J, et al. Seed-TTS: A family of high-quality versatile speech generation models[J]. arXiv, 2024, 2406. 02430.
[41]	YE Z, ZHU X, CHAN C M, et al. LLASA: Scaling train-time and inference-time compute for LLaMA-based speech synthesis[J]. arXiv, 2025, 2502. 04128.

合成方法	是否跨语言	是否支持对话	是否支持情感/风格/韵律控制
SC-GlowTTS	否	否	否
P-Flow	否	否	否
Voicebox	是	否	隐式风格控制
Grad-TTS	否	否	隐式风格控制
NaturalSpeech 2	是	否	隐式风格控制、显式韵律控制
NaturalSpeech 3	是	否	显式风格/韵律控制
E2-TTS	否	否	否
SATTS	否	否	否
VoiceCraft	否	否	隐式风格控制
Present	否	否	显式风格/韵律控制
VALL-E	否	否	隐式风格/韵律控制
VALL-EX	是	否	隐式风格/韵律控制
VALL-ER	否	否	显式情感/韵律控制、隐式风格控制
StyleTTS	是	否	隐式风格控制
StyleTTS 2	是	否	隐式情感/风格/韵律控制
SeedTTS	是	否	隐式情感/风格/韵律控制
XTTS	是	否	隐式风格/韵律控制
DINO-VITS	否	否	否
YourTTS	是	否	隐式风格/韵律控制
ZSE-VITS	否	否	显式情感/韵律控制、隐式风格控制
MegaTTS 2	是	否	隐式风格/韵律控制
MegaTTS 3	是	否	显式风格/韵律控制
Glow-WaveGAN 2	否	否	否
CosyVoice2	是	是	隐式情感/风格/韵律控制
LLASA	是	否	隐式情感/风格/韵律控制

零样本语音合成技术研究进展

Advances in Zero-Shot Text-to-Speech Technology

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 3

参考文献 41

相关文章 11

编辑推荐

Metrics

本文评价

[1]	王成,曾诗容,王楚文,蒋昌俊. 大语言模型驱动的市场交易行为模拟[J]. 数据与计算发展前沿, 2026, 8(1): 2-13.
[2]	武智晖,黄绍晗,张逸飞,齐家兴,肖智文,曾畅,栾钟治. 基于检索增强的日志问答系统[J]. 数据与计算发展前沿, 2026, 8(1): 64-76.
[3]	吴建华, 刘振宇, 曾睿, 王文瑄, 易勇, 王世轶. 基于LoRA微调大模型在网络安全等级保护测评报告质量检测中的应用研究[J]. 数据与计算发展前沿, 2025, 7(6): 111-123.
[4]	董伟, 廖佳纯, 姚思诚, 陈海粟, 阚苏南. 大语言模型赋能的面向微数据脱敏的敏感信息识别分析方法[J]. 数据与计算发展前沿, 2025, 7(6): 35-43.
[5]	刘典玉,刘青凯,肖雨阳,王杰. CAE-Bench: 面向结构力学仿真的大语言模型评估基准[J]. 数据与计算发展前沿, 2025, 7(4): 155-168.
[6]	屈志勇,王晓光,周纯葆,史源香,乔嘉伟. 面向国产超算系统的大模型训练优化方法[J]. 数据与计算发展前沿, 2025, 7(2): 120-129.
[7]	王子健, 李凯, 曹荣强, 周纯葆. 基于检查点的大模型弹性训练方法研究[J]. 数据与计算发展前沿, 2025, 7(1): 135-151.
[8]	韦一金, 陈彦清, 王秀东, 樊景超. 基于大语言模型的《中国小麦品种志》信息提取[J]. 数据与计算发展前沿, 2025, 7(1): 175-185.
[9]	马秋平, 张琪, 赵晓凡. 图表问答研究综述[J]. 数据与计算发展前沿, 2025, 7(1): 19-37.
[10]	裴炳森,李欣,蒋章涛,刘明帅. 基于大语言模型的司法文本摘要生成与评价技术研究[J]. 数据与计算发展前沿, 2024, 6(6): 62-73.
[11]	韦一金, 樊景超. 基于ChatGLM2-6B的农业政策问答系统[J]. 数据与计算发展前沿, 2024, 6(4): 116-127.