Speech coding

Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.^[1]

Common applications of speech coding are mobile telephony and voice over IP (VoIP).^[2] The most widely used speech coding technique in mobile telephony is linear predictive coding (LPC), while the most widely used in VoIP applications are the LPC and modified discrete cosine transform (MDCT) techniques.

The techniques employed in speech coding are similar to those used in audio data compression and audio coding where appreciation of psychoacoustics is used to transmit only data that is relevant to the human auditory system. For example, in voiceband speech coding, only information in the frequency band 400 to 3500 Hz is transmitted but the reconstructed signal retains adequate intelligibility.

Speech coding differs from other forms of audio coding in that speech is a simpler signal than other audio signals, and statistical information is available about the properties of speech. As a result, some auditory information that is relevant in general audio coding can be unnecessary in the speech coding context. Speech coding stresses the preservation of intelligibility and pleasantness of speech while using a constrained amount of transmitted data.^[3] In addition, most speech applications require low coding delay, as latency interferes with speech interaction.^[4]

Sample companding viewed as a form of speech coding[edit]

The A-law and μ-law algorithms used in G.711 PCM digital telephony can be seen as an earlier precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 bits of resolution.^[7] Logarithmic companding are consistent with human hearing perception in that a low-amplitude noise is heard along a low-amplitude speech signal but is masked by a high-amplitude one. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a periodic waveform having a single fundamental frequency with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.

A wide variety of other algorithms were tried at the time, mostly delta modulation variants, but after careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made an excellent engineering compromise. Their audio performance remains acceptable, and there was no need to replace them in the stationary phone network.

In 2008, G.711.1 codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.^[8]

Speech coding

Sample companding viewed as a form of speech coding[edit]

AMR-WB

AAC-LD

G.722

Lyra

Digital signal processing

Speech interface guideline

Speech processing

Speech synthesis

Vector quantization

ITU-T Test Signals for Telecommunication Systems Test Samples

ITU-T Perceptual evaluation of speech quality (PESQ) tool Sources