The Research Development and Challenge of Automatic Speech Recognition

doi:10.11871/jfdc.issn.2096-742X.2019.02.003

Abstract

Abstract:

[Objective] This paper firstly introduces the start-of-art technical framework and main challenges of Automatic Speech Recognition (ASR) systems, then provides reference for further research in the field of ASR. [Methods] Firstly, the newest framework of end-to-end speech recognition is introduced, including the Connectionist Temporal Classification(CTC) and attention based framework. Secondly, four challenging problems in ASR applications are presented, including the recognition of noisy and distant field speech, the recognition of code-switching, the recognition of domain related terms, and minority language speech recognition with limited resources. [Results] For the problem of robustness of end-to-end ASR system, an improved enhancement method and filtering attention mechanism is proposed. The start-of-art methods and future development directions are discussed regarding to the challenging problems of ASR systems. [Conclusions] There is a major challenge for the commercialization of the end-to-end ASR systems, and the research on four challenging problems plays a key role in the application of ASR systems.

Key words: automatic speech recognition, end-to-end, distant filed speech, code-switch, domain related terms

Liu Qingfeng, Gao Jianqing, Wan Genshun. The Research Development and Challenge of Automatic Speech Recognition[J]. Frontiers of Data and Computing, 2019, 1(2): 26-36.

Figures/Tables 2

References 39

[1]	Hinton G E, Osindero S, Teh Y W . A fast learning algorithm for deep belief nets[J]. Neural computation, 2006,18(7):1527-1554.
[2]	Arel I, Rose D C, Karnowski T P . Deep machine learning-A new frontier in artificial intelligence research [Research Frontier][J]. Computational Intelligence Magazine, IEEE, 2010,5(4):13-18.
[3]	DAVIS K. H, BIDDULPH R, BALASHEK S . Automatic recognition of spoken digits[J]. Journal of the Acoustical Society of America, 1952,24(6):637.
[4]	Vintsyuk TK . Speech Discrimination by Dynamic Programming. Cybernetics and Systems Analysis, 1968,4(1):81-88.
[5]	Ferguson J D . Application of hidden Markov models to text and speech[EB]. 1980.
[6]	RABINER L R . A tutorial on hidden Markov models and selected applications in speech recognition[J]. Readings in Speech Recognition, 1990,77(2):267-296.
[7]	Mohamed G. E. Dahl, and G. E. Hinton . Deep belief networks for phone recognition. in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.
[8]	Sainath T N, Kingsbury B, Ramabhadran B , et al. Making deep belief networks effective for large vocabulary continuous speech recognition. Auto-matic Speech Recognition and Understanding (ASRU), 2011: 30-35.
[9]	Mohamed A, Dahl G E, Hinton G . Acoustic modeling using deep belief networks[J]. Audio, Speech, and Language Processing, IEEE Transactions on, 2012,20(1):14-22.
[10]	Dahl G E, Yu D, Deng L , et al. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition[J]. Audio, Speech, and Language Processing, IEEE Transactions on, 2012,20(1):30-42.
[11]	Hinton G, Deng L, Yu D , et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups[J]. Signal Processing Magazine, IEEE, 2012,29(6):82-97.
[12]	HOCHREITER S, SCHMIDHUBER J . Long short-term memory[J]. Neural Computation, 1997,9(8):1735-1780.
[13]	ZHANG Y, CHEN G G, YU D , et al. Highway long short-term memory RNNS for distant speech recognition[C]2016 IEEE International Conference on Acoustics, Speech and Signal Processing, March 20-25,Shanghai, China. Piscataway: IEEE Press, 2016.
[14]	LECUN Y, BENGIO Y. Convolutional networks for images, speech and time-series[M]. Cambridge: MIT Press, 1995.
[15]	ABDEL-HAMID O, MOHAMED A R, JIANG H , et al. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition[C]//2012 IEEE International Conference on Acoustics, Speech and Signal Processing, March 20, 2012, Kyoto, Japan. Piscataway: IEEE Press, 2012: 4277-4280.
[16]	ABDEL-HAMID O, MOHAMED A R, JIANG H , et al. Convolutional neural networks for speech recognition[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2014,22(10):1533-1545.
[17]	ABDEL-HAMID O, DENG L, YU D . Exploring convolutional neural network structures and optimization techniques for speech recognition[J]. 25-29 August, Interspeech, 2013,58(4):1173-5.
[18]	SAINATH T N, MOHAMED A R, KINGSBURY B , et al. Deep convolutional neural networks for LVCSR[C]//2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 26-30,2013, Vancouver, BC, Canada. Piscataway: IEEE Press, 2013: 8614-8618.
[19]	SAINATH T N, VINYALS O, SENIOR A , et al. Convolutional, long short-term memory, fully connected deep neural networks[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing, April 19-24,Brisbane, QLD, Australia. Piscataway: IEEE Press, 2015: 4580-4584.
[20]	JELINEK F . The development of an experimental discrete dictation recognizer[J]. Readings in Speech Recognition, 1990,73(11):1616-1624.
[21]	BENGIO Y, DUCHARME R, VINCENT P . A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003(3):1137-1155.
[22]	SCHWENK H, GAUVAIN J L. Training neural network language models on very large corpora[C]//Conference on Human Language Technology & Empirical Methods in Natural Language Processing, October 6-8, 2005, Vancouver, British Columbia, Canada. New York: ACM Press, 2005: 201-208.
[23]	ARıSOY E, SAINATH T N, KINGSBURY B , et al. Deep neural network language models[C]//NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, June 8, 2012, Montreal, Canada. New York: ACM Press, 2012: 20-28.
[24]	MIKOLOV T, KARAFIAT M, BURGET L, et al. Recurrent neural network based language model [C]// Interspeech, Conference of the International Speech Communication Association, September 26-30, 2010, Makuhari, Chiba, Japan. [S.l.:s.n.], 2010: 1045-1048.
[25]	G. Pundak, and T. N. Sainath. Lower Frame Rate Neural Network Acoustic Models, Interspeech, 2016.
[26]	W. Chan, N. Jaitly, Q. V. Le, O. Vinyals , Listen, attend and spell, CoRR, vol. abs/1508. 01211, 2015.
[27]	R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, N. Jaitly , A Comparison of Sequence-to-sequence Models for Speech Recognition, Interspeech, 2017.
[28]	Hannun A . Sequence Modeling with CTC[J]. 2017.
[29]	Chiu C C, Sainath T N, Wu Y , et al. State-of-the-art Speech Recognition With Sequence-to-Sequence Models[J]. 2017.
[30]	Models G W L . COLD FUSION: TRAINING SEQ2SEQ MODELS TO[J]. 2017.
[31]	Gulcehre C, Firat O, Xu K , et al. On Using Monolingual Corpora in Neural Machine Translation[J]. Computer Science, 2015.
[32]	Renduchintala A, Ding S, Wiesner M, et al. Multi-Modal Data Augmentation for End-to-end ASR [C]// Interspeech 2018.
[33]	JON B, SHINJI W, EMMANUEL V, et al., 2018. The fifth ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines [C]//INTERSPEECH. 1561-1565.
[34]	DUJ, GAOT, 2018. The USTC-iFlytek systems for CHiME-5 challenge [C] //The 5th International Workshopon Speech Processing in Everyday Environments.
[35]	高天 . 复杂环境下基于深度学习的语音信号预处理方法研究[D]. 中国科学技术大学, 2018.
[36]	Guo J, Lu S, Cai H , et al. Long Text Generation via Adversarial Training with Leaked Information[J]. 2017.
[37]	Pundak G, Sainath T N, Prabhavalkar R , et al. Deep context: end-to-end contextual speech recognition[J]. 2018.
[38]	Changhao Shan, Chao Weng , et al. Component Fusion: Learning Replaceable Language Model Component for End-to-end Speech Recognition System. ICASSP2019:5631-5635.
[39]	Li B, Zhang Y, Sainath T , et al. Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synjournal with Bytes[J]. 2018.