Speech recognition

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

For the human linguistic concept, see Speech perception.

Some speech recognition systems require "training" (also called "enrollment") where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker-independent"^[1] systems. Systems that use training are called "speaker dependent".

Speech recognition applications include voice user interfaces such as voice dialing (e.g. "call home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control, search key words (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), determining speaker characteristics,^[2] speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed direct voice input). Automatic pronunciation assessment is used in education such as for spoken language learning.

The term voice recognition^[3]^[4]^[5] or speaker identification^[6]^[7]^[8] refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.

From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems.

1952 – Three Bell Labs researchers, Stephen Balashek, R. Biddulph, and K. H. Davis built a system called "Audrey"^[10] for single-speaker digit recognition. Their system located the formants in the power spectrum of each utterance.^[11]

[9]

1960 – developed and published the source-filter model of speech production.

Gunnar Fant

1962 – demonstrated its 16-word "Shoebox" machine's speech recognition capability at the 1962 World's Fair.^[12]

IBM

1966 – (LPC), a speech coding method, was first proposed by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone (NTT), while working on speech recognition.^[13]

Linear predictive coding

1969 – Funding at dried up for several years when, in 1969, the influential John Pierce wrote an open letter that was critical of and defunded speech recognition research.^[14] This defunding lasted until Pierce retired and James L. Flanagan took over.

Bell Labs

Applications[edit]

In-car systems[edit]

Typically a manual control input, for example by means of a finger control on the steering-wheel, enables the speech recognition system and this is signaled to the driver by an audio prompt. Following the audio prompt, the system has a "listening window" during which it may accept a speech input for recognition.

Simple voice commands may be used to initiate phone calls, select radio stations or play music from a compatible smartphone, MP3 player or music-loaded flash drive. Voice recognition capabilities vary between car make and model. Some of the most recent car models offer natural-language speech recognition in place of a fixed set of commands, allowing the driver to use full sentences and common phrases. With such systems there is, therefore, no need for the user to memorize a set of fixed command words.

Vocabulary size and confusability

Speaker dependence versus independence

Isolated, discontinuous or continuous speech

Task and language constraints

Read versus spontaneous speech

Adverse conditions

Further information[edit]

Conferences and journals[edit]

Popular speech recognition conferences held each year or two include SpeechTEK and SpeechTEK Europe, ICASSP, Interspeech/Eurospeech, and the IEEE ASRU. Conferences in the field of natural language processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the IEEE Transactions on Speech and Audio Processing (later renamed IEEE Transactions on Audio, Speech and Language Processing and since Sept 2014 renamed IEEE/ACM Transactions on Audio, Speech and Language Processing—after merging with an ACM publication), Computer Speech and Language, and Speech Communication.

Books[edit]

Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek and "Spoken Language Processing (2001)" by Xuedong Huang etc., "Computer Speech", by Manfred R. Schroeder, second edition published in 2004, and "Speech Processing: A Dynamic and Optimization-Oriented Approach" published in 2003 by Li Deng and Doug O'Shaughnessey. The updated textbook Speech and Language Processing (2008) by Jurafsky and Martin presents the basics and the state of the art for ASR. Speaker recognition also uses the same features, most of the same front-end processing, and classification techniques as is done in speech recognition. A comprehensive textbook, "Fundamentals of Speaker Recognition" is an in depth source for up to date details on the theory and practice.^[138] A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).

A good and accessible introduction to speech recognition technology and its history is provided by the general audience book "The Voice in the Machine. Building Computers That Understand Speech" by Roberto Pieraccini (2012).

The most recent book on speech recognition is Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer) written by Microsoft researchers D. Yu and L. Deng and published near the end of 2014, with highly mathematically oriented technical detail on how deep learning methods are derived and implemented in modern speech recognition systems based on DNNs and related deep learning methods.^[84] A related book, published earlier in 2014, "Deep Learning: Methods and Applications" by L. Deng and D. Yu provides a less technical but more methodology-focused overview of DNN-based speech recognition during 2009–2014, placed within the more general context of deep learning applications including not only speech recognition but also image recognition, natural language processing, information retrieval, multimodal processing, and multitask learning.^[80]

Software[edit]

In terms of freely available resources, Carnegie Mellon University's Sphinx toolkit is one place to start to both learn about speech recognition and to start experimenting. Another resource (free but copyrighted) is the HTK book (and the accompanying HTK toolkit). For more recent and state-of-the-art techniques, Kaldi toolkit can be used.^[139] In 2017 Mozilla launched the open source project called Common Voice^[140] to gather big database of voices that would help build free speech recognition project DeepSpeech (available free at GitHub),^[141] using Google's open source platform TensorFlow.^[142] When Mozilla redirected funding away from the project in 2020, it was forked by its original developers as Coqui STT^[143] using the same open-source license.^[144]^[145]

Google Gboard supports speech recognition on all Android applications. It can be activated through the microphone icon.^[146]

The commercial cloud based speech recognition APIs are broadly available.

For more software resources, see List of speech recognition software.

Cole, Ronald; ; Uszkoreit, Hans; Varile, Giovanni Battista; Zaenen, Annie; Zampolli; Zue, Victor, eds. (1997). Survey of the state of the art in human language technology. Cambridge Studies in Natural Language Processing. Vol. XII–XIII. Cambridge University Press. ISBN 978-0-521-59277-2.

Mariani, Joseph

Junqua, J.-C.; Haton, J.-P. (1995). Robustness in Automatic Speech Recognition: Fundamentals and Applications. Kluwer Academic Publishers. 978-0-7923-9646-8.

ISBN

Karat, Clare-Marie; Vergo, John; Nahamoo, David (2007). "Conversational Interface Technologies". In ; Jacko, Julie A. (eds.). The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies, and Emerging Applications (Human Factors and Ergonomics). Lawrence Erlbaum Associates Inc. ISBN 978-0-8058-5870-9.

Sears, Andrew

Pieraccini, Roberto (2012). The Voice in the Machine. Building Computers That Understand Speech. The MIT Press. 978-0262016858.

ISBN

Pirani, Giancarlo, ed. (2013). Advanced algorithms and architectures for speech understanding. Springer Science & Business Media. 978-3-642-84341-9.

ISBN

Signer, Beat and Hoste, Lode: , In Proceedings of ICMI 2013, 15th International Conference on Multimodal Interaction, Sydney, Australia, December 2013

SpeeG2: A Speech- and Gesture-based Interface for Efficient Controller-free Text Entry

Woelfel, Matthias; McDonough, John (26 May 2009). Distant Speech Recognition. Wiley. 978-0470517048.

ISBN

at Curlie

Speech recognition

[9]

Gunnar Fant

IBM

Linear predictive coding

Bell Labs

Applications[edit]

In-car systems[edit]

Further information[edit]

Conferences and journals[edit]

Books[edit]

Software[edit]

Mariani, Joseph

ISBN

Sears, Andrew

ISBN

ISBN

SpeeG2: A Speech- and Gesture-based Interface for Efficient Controller-free Text Entry

ISBN

Speech Technology