Frontiers of Data and Computing ›› 2026, Vol. 8 ›› Issue (2): 204-214.

CSTR: 32002.14.jfdc.CN10-1649/TP.2026.02.015

doi: 10.11871/jfdc.issn.2096-742X.2026.02.015

• Technology and Application • Previous Articles     Next Articles

Advances in Zero-Shot Text-to-Speech Technology

WANG Danlin1,2(),TANG Yunqi2,*()   

  1. 1 Department of Criminal Science and Technology, Beijing Police College, Beijing 102202, China
    2 School of Investigation, People’s Public Security University of China, Beijing 100038, China
  • Received:2025-08-11 Online:2026-04-20 Published:2026-04-23

Abstract:

[Purpose] This paper reviews the current research status and development trends of Zero-shot Text-to-Speech (TTS), and analyzes the challenges faced by this technology in improving naturalness, fidelity, and diversity. [Literature Scope] The review is based on relevant academic publications from 2021 to 2025, covering studies on different model architectures, tasks, and application scenarios. [Methods] The definition and core characteristics of Zero-shot TTS are introduced, and existing mainstream end-to-end Zero-shot TTS models are classified according to deep learning generative architectures. Representative methods are compared and analyzed. [Results] Recent advances are summarized in terms of speech naturalness, speaker similarity, multilingual adaptability, and expressive capacity. It is highlighted that the development of generative AI and large language models (LLMs) has promoted new trends in cross-lingual transfer, in-context learning, and multimodal generation. However, limitations remain in cross-speaker transfer stability, fine-grained modeling of emotion and speaking style, and diversity. [Conclusions] Future research should further enhance high-fidelity and expressive speech generation, strengthen the modeling of emotion, register, and style, improve generalization across languages and speakers, pursue lightweight and real-time deployment, and explore multimodal unified generation frameworks driven by large language models.

Key words: Text-to-Speech, Zero-shot, non-autoregressive models, autoregressive models, hybrid models, large language models