语音识别技术研究进展与挑战

doi:10.11871/jfdc.issn.2096-742X.2019.02.003

数据与计算发展前沿 ›› 2019, Vol. 1 ›› Issue (2): 26-36.

doi: 10.11871/jfdc.issn.2096-742X.2019.02.003

所属专题： “人工智能”专刊

语音识别技术研究进展与挑战

刘庆峰,高建清^*(),万根顺

科大讯飞股份有限公司,安徽合肥 230088

收稿日期:2019-09-17 出版日期:2019-12-20 发布日期:2020-01-15
通讯作者: 高建清
作者简介:刘庆峰,1973年生,科大讯飞股份有限公司董事长,中国科学技术大学信号与信息处理专业博士学位,语音及语言信息处理国家工程实验室主任,中国科学技术大学兼职教授、博导,十届、十一届、十二届、十三届全国人大代表,全国大学生创新创业联盟首任理事长,中国语音产业联盟理事长。研究方向为信号处理,语音及语言信息处理。
本文承担工作为：框架的整体结构设计、研究指导。
Liu Qingfeng was born in 1973. He received the Ph.D. degree of signal and information processing from the University of Science and Technology of China (USTC). He is the CEO of IFLYTEK, as well as the director of National Engineering Laboratory for Speech and Language Information Processing, and the adjunct professor and PhD supervisor of USTC. He was selected as National People’s Congress deputy four times since 10^th NPC. He serves as the first chairman of national union for college students’ innovation and entrepreneurship, and the chairman of Speech Industry Alliance of China. His research interests include signal processing as well as speech and language information processing.
Liu Qingfeng contributed to the organization of the paper and supervised the research.
E-mail: qfliu@iflytek.com|高建清,1983年生,中国科学技术大学电子与信息专业工程博士学位,科大讯飞AI研究院副院长。研究方向为语音识别、语音及语音信息处理、对话系统。
本文承担工作为：本文第1节,第2.1节的主要贡献者,全文的修改。
Gao Jianqing was born in 1983 and received D.Eng. degree in electronics and information from the University of Science and Technology of China (USTC). He is the vice dean of IFLYTEK AI Research. His research interests include automatic speech recognition, speech and language information processing and spoken dialogue system.
Gao Jianqing contributed to the chapter 1, 2.1 and revised the entire paper.|万根顺,1989年生,江苏大学通信与信息系统专业硕士学位,科大讯飞AI研究院研究主管。研究方向为语音识别、语音及语音信息处理。
本文承担工作为：本文第2.2、2.3、2.4节的主要贡献者。
Wan Genshun was born in 1989 and received B.Eng. degree in communication and information system from Jiangsu University. He is the director of research of IFLYTEK AI Research. His research interests include automatic speech recognition as well as speech and language information processing.
Wan Genshun contributed to the chapter 2.2, 2.3 and 2.4.
E-mail:gswan@iflytek.com

The Research Development and Challenge of Automatic Speech Recognition

Liu Qingfeng,Gao Jianqing^*(),Wan Genshun

IFLYTEK, Hefei, Anhui 230088, China

Received:2019-09-17 Online:2019-12-20 Published:2020-01-15
Contact: Gao Jianqing

摘要/Abstract

摘要：

【目的】本文对语音识别系统的主流技术框架及主要挑战进行了系统而全面的介绍,为语音识别领域的进一步技术研究提供参考。【方法】首先,介绍了端到端语音识别框架的主流方案;然后,提出了语音识别应用中的四大挑战性问题,即恶劣场景的识别问题、中英文混合识别问题、专业术语的识别问题以及低资源小语种识别问题。【结果】针对端到端框架稳定性不足的问题,提出了带有强化和过滤注意力机制的改进方案。针对语音识别中的挑战性难题,探讨了主流的解决方案及未来的发展方向。【结论】端到端框架的大规模商用仍存在较大挑战,四大挑战性问题的解决将对语音识别的行业应用推广起到关键的作用。

关键词: 语音识别, 端到端, 远场识别, 中英文混合, 专业术语

Abstract:

[Objective] This paper firstly introduces the start-of-art technical framework and main challenges of Automatic Speech Recognition (ASR) systems, then provides reference for further research in the field of ASR. [Methods] Firstly, the newest framework of end-to-end speech recognition is introduced, including the Connectionist Temporal Classification(CTC) and attention based framework. Secondly, four challenging problems in ASR applications are presented, including the recognition of noisy and distant field speech, the recognition of code-switching, the recognition of domain related terms, and minority language speech recognition with limited resources. [Results] For the problem of robustness of end-to-end ASR system, an improved enhancement method and filtering attention mechanism is proposed. The start-of-art methods and future development directions are discussed regarding to the challenging problems of ASR systems. [Conclusions] There is a major challenge for the commercialization of the end-to-end ASR systems, and the research on four challenging problems plays a key role in the application of ASR systems.

Key words: automatic speech recognition, end-to-end, distant filed speech, code-switch, domain related terms

刘庆峰, 高建清, 万根顺. 语音识别技术研究进展与挑战[J]. 数据与计算发展前沿, 2019, 1(2): 26-36.

Liu Qingfeng, Gao Jianqing, Wan Genshun. The Research Development and Challenge of Automatic Speech Recognition[J]. Frontiers of Data and Computing, 2019, 1(2): 26-36.

图/表 2

参考文献 39

[1]	Hinton G E, Osindero S, Teh Y W . A fast learning algorithm for deep belief nets[J]. Neural computation, 2006,18(7):1527-1554.
[2]	Arel I, Rose D C, Karnowski T P . Deep machine learning-A new frontier in artificial intelligence research [Research Frontier][J]. Computational Intelligence Magazine, IEEE, 2010,5(4):13-18.
[3]	DAVIS K. H, BIDDULPH R, BALASHEK S . Automatic recognition of spoken digits[J]. Journal of the Acoustical Society of America, 1952,24(6):637.
[4]	Vintsyuk TK . Speech Discrimination by Dynamic Programming. Cybernetics and Systems Analysis, 1968,4(1):81-88.
[5]	Ferguson J D . Application of hidden Markov models to text and speech[EB]. 1980.
[6]	RABINER L R . A tutorial on hidden Markov models and selected applications in speech recognition[J]. Readings in Speech Recognition, 1990,77(2):267-296.
[7]	Mohamed G. E. Dahl, and G. E. Hinton . Deep belief networks for phone recognition. in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.
[8]	Sainath T N, Kingsbury B, Ramabhadran B , et al. Making deep belief networks effective for large vocabulary continuous speech recognition. Auto-matic Speech Recognition and Understanding (ASRU), 2011: 30-35.
[9]	Mohamed A, Dahl G E, Hinton G . Acoustic modeling using deep belief networks[J]. Audio, Speech, and Language Processing, IEEE Transactions on, 2012,20(1):14-22.
[10]	Dahl G E, Yu D, Deng L , et al. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition[J]. Audio, Speech, and Language Processing, IEEE Transactions on, 2012,20(1):30-42.
[11]	Hinton G, Deng L, Yu D , et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups[J]. Signal Processing Magazine, IEEE, 2012,29(6):82-97.
[12]	HOCHREITER S, SCHMIDHUBER J . Long short-term memory[J]. Neural Computation, 1997,9(8):1735-1780.
[13]	ZHANG Y, CHEN G G, YU D , et al. Highway long short-term memory RNNS for distant speech recognition[C]2016 IEEE International Conference on Acoustics, Speech and Signal Processing, March 20-25,Shanghai, China. Piscataway: IEEE Press, 2016.
[14]	LECUN Y, BENGIO Y. Convolutional networks for images, speech and time-series[M]. Cambridge: MIT Press, 1995.
[15]	ABDEL-HAMID O, MOHAMED A R, JIANG H , et al. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition[C]//2012 IEEE International Conference on Acoustics, Speech and Signal Processing, March 20, 2012, Kyoto, Japan. Piscataway: IEEE Press, 2012: 4277-4280.
[16]	ABDEL-HAMID O, MOHAMED A R, JIANG H , et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2014,22(10):1533-1545.
[17]	ABDEL-HAMID O, DENG L, YU D . Exploring convolutional neural network structures and optimization techniques for speech recognition[J]. 25-29 August, Interspeech, 2013,58(4):1173-5.
[18]	SAINATH T N, MOHAMED A R, KINGSBURY B , et al. Deep convolutional neural networks for LVCSR[C]//2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 26-30,2013, Vancouver, BC, Canada. Piscataway: IEEE Press, 2013: 8614-8618.
[19]	SAINATH T N, VINYALS O, SENIOR A , et al. Convolutional, long short-term memory, fully connected deep neural networks[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing, April 19-24,Brisbane, QLD, Australia. Piscataway: IEEE Press, 2015: 4580-4584.
[20]	JELINEK F . The development of an experimental discrete dictation recognizer[J]. Readings in Speech Recognition, 1990,73(11):1616-1624.
[21]	BENGIO Y, DUCHARME R, VINCENT P . A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003(3):1137-1155.
[22]	SCHWENK H, GAUVAIN J L. Training neural network language models on very large corpora[C]//Conference on Human Language Technology & Empirical Methods in Natural Language Processing, October 6-8, 2005, Vancouver, British Columbia, Canada. New York: ACM Press, 2005: 201-208.
[23]	ARıSOY E, SAINATH T N, KINGSBURY B , et al. Deep neural network language models[C]//NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, June 8, 2012, Montreal, Canada. New York: ACM Press, 2012: 20-28.
[24]	MIKOLOV T, KARAFIAT M, BURGET L, et al. Recurrent neural network based language model [C]// Interspeech, Conference of the International Speech Communication Association, September 26-30, 2010, Makuhari, Chiba, Japan. [S.l.:s.n.], 2010: 1045-1048.
[25]	G. Pundak, and T. N. Sainath. Lower Frame Rate Neural Network Acoustic Models, Interspeech, 2016.
[26]	W. Chan, N. Jaitly, Q. V. Le, O. Vinyals , Listen, attend and spell, CoRR, vol. abs/1508. 01211, 2015.
[27]	R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, N. Jaitly , A Comparison of Sequence-to-sequence Models for Speech Recognition, Interspeech, 2017.
[28]	Hannun A . Sequence Modeling with CTC[J]. 2017.
[29]	Chiu C C, Sainath T N, Wu Y , et al. State-of-the-art Speech Recognition With Sequence-to-Sequence Models[J]. 2017.
[30]	Models G W L . COLD FUSION: TRAINING SEQ2SEQ MODELS TO[J]. 2017.
[31]	Gulcehre C, Firat O, Xu K , et al. On Using Monolingual Corpora in Neural Machine Translation[J]. Computer Science, 2015.
[32]	Renduchintala A, Ding S, Wiesner M, et al. Multi-Modal Data Augmentation for End-to-end ASR [C]// Interspeech 2018.
[33]	JON B, SHINJI W, EMMANUEL V, et al., 2018. The fifth ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines [C]//INTERSPEECH. 1561-1565.
[34]	DUJ, GAOT, 2018. The USTC-iFlytek systems for CHiME-5 challenge [C] //The 5th International Workshopon Speech Processing in Everyday Environments.
[35]	高天 . 复杂环境下基于深度学习的语音信号预处理方法研究[D]. 中国科学技术大学, 2018.
[36]	Guo J, Lu S, Cai H , et al. Long Text Generation via Adversarial Training with Leaked Information[J]. 2017.
[37]	Pundak G, Sainath T N, Prabhavalkar R , et al. Deep context: end-to-end contextual speech recognition[J]. 2018.
[38]	Changhao Shan, Chao Weng , et al. Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System. ICASSP2019:5631-5635.
[39]	Li B, Zhang Y, Sainath T , et al. Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synjournal with Bytes[J]. 2018.

语音识别技术研究进展与挑战

The Research Development and Challenge of Automatic Speech Recognition

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 2

参考文献 39

相关文章 0

编辑推荐

Metrics

本文评价