Speech processing is the study of

speech Speech is a human vocal communication using language. Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are th ...

signals and the processing methods of signals. The signals are usually processed in a

digital Digital usually refers to something using discrete digits, often binary digits. Technology and computing Hardware *Digital electronics, electronic circuits which operate using digital signals ** Digital camera, which captures and stores digital ...

representation, so speech processing can be regarded as a special case of

digital signal processing Digital signal processing (DSP) is the use of digital processing, such as by computers or more specialized digital signal processors, to perform a wide variety of signal processing operations. The digital signals processed in this manner are ...

, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. The input is called

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...

and the output is called

speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal langua ...

History

Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple

phonetic Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. ...

elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker. Pioneering works in field of speech recognition using analysis of its spectrum were reported in 1940s.

Linear predictive coding Linear predictive coding (LPC) is a method used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive mod ...

(LPC), a speech processing algorithm, was first proposed by

Fumitada Itakura is a Japanese scientist. He did pioneering work in statistical signal processing, and its application to speech analysis, synthesis and coding, including the development of the linear predictive coding (LPC) and line spectral pairs (LSP) method ...

Nagoya University , abbreviated to or NU, is a Japanese national research university located in Chikusa-ku, Nagoya. It was the seventh Imperial University in Japan, one of the first five Designated National University and selected as a Top Type university of ...

and Shuzo Saito of

Nippon Telegraph and Telephone , commonly known as NTT, is a Japanese telecommunications company headquartered in Tokyo, Japan. Ranked 55th in ''Fortune'' Global 500, NTT is the fourth largest telecommunications company in the world in terms of revenue, as well as the third la ...

(NTT) in 1966. Further developments in LPC technology were made by

Bishnu S. Atal Bishnu S. Atal (born 1933) is an Indian physicist and engineer. He is a noted researcher in acoustics, and is best known for developments in speech coding. He advanced linear predictive coding (LPC) during the late 1960s to 1970s, and developed ...

and Manfred R. Schroeder at

Bell Labs Nokia Bell Labs, originally named Bell Telephone Laboratories (1925–1984), then AT&T Bell Laboratories (1984–1996) and Bell Labs Innovations (1996–2007), is an American industrial research and scientific development company owned by mul ...

during the 1970s. LPC was the basis for

voice-over-IP Voice over Internet Protocol (VoIP), also called IP telephony, is a method and group of technologies for the delivery of voice communications and multimedia sessions over Internet Protocol (IP) networks, such as the Internet. The terms Internet ...

(VoIP) technology, as well as speech synthesizer chips, such as the

Texas Instruments LPC Speech Chips The Texas Instruments LPC Speech Chips are a series of speech synthesizer digital signal processor integrated circuits created by Texas Instruments beginning in 1978. They continued to be developed and marketed for many years, though the speech ...

used in the Speak & Spell toys from 1978. One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by

Lawrence Rabiner Lawrence R. Rabiner (born 28 September 1943) is an electrical engineer working in the fields of digital signal processing and speech processing; in particular in digital signal processing for automatic speech recognition. He has worked on system ...

and others at Bell Labs was used by

AT&T AT&T Inc. is an American multinational telecommunications holding company headquartered at Whitacre Tower in Downtown Dallas, Texas. It is the world's largest telecommunications company by revenue and the third largest provider of mobile ...

in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary. By the early 2000s, the dominant speech processing strategy started to shift away from

Hidden Markov Models A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ...

towards more modern

neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

and

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...

Techniques

Dynamic time warping

Dynamic time warping (DTW) is an

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

for measuring similarity between two temporal sequences, which may vary in speed. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.

Hidden Markov models

A hidden Markov model can be represented as the simplest dynamic Bayesian network. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying the Markov property, the

conditional probability distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the ...

of the hidden variable ''x''(''t'') at time ''t'', given the values of the hidden variable ''x'' at all times, depends ''only'' on the value of the hidden variable ''x''(''t'' − 1). Similarly, the value of the observed variable ''y''(''t'') only depends on the value of the hidden variable ''x''(''t'') (both at time ''t'').

Artificial neural networks

An artificial neural network (ANN) is based on a collection of connected units or nodes called

artificial neuron An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs (representing ...

s, which loosely model the

neuron A neuron, neurone, or nerve cell is an electrically excitable cell that communicates with other cells via specialized connections called synapses. The neuron is the main component of nervous tissue in all animals except sponges and placozoa ...

s in a biological

brain A brain is an organ (biology), organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It is located in the head, usually close to the sensory organs for senses such as Visual perception, vision. I ...

. Each connection, like the

synapse In the nervous system, a synapse is a structure that permits a neuron (or nerve cell) to pass an electrical or chemical signal to another neuron or to the target effector cell. Synapses are essential to the transmission of nervous impulses from ...

s in a biological

, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a

real number In mathematics, a real number is a number that can be used to measure a ''continuous'' one-dimensional quantity such as a distance, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small variations. Every ...

, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs.

Phase-aware processing

Phase is usually supposed to be random uniform variable and thus useless. This is due wrapping of phase: result of

arctangent In mathematics, the inverse trigonometric functions (occasionally also called arcus functions, antitrigonometric functions or cyclometric functions) are the inverse functions of the trigonometric functions (with suitably restricted domains). Spe ...

function is not continuous due to periodical jumps on

2 \pi

. After phase unwrapping (see, Chapter 2.3; Instantaneous phase and frequency), it can be expressed as:

\phi(h,l) = \phi_(h,l) + \Psi(h,l)

, where

\phi_(h,l) = \omega_0(l') _\Delta t

is linear phase (

_\Delta t

is temporal shift at each frame of analysis),

\Psi(h,l)

is phase contribution of the vocal tract and phase source. Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase and its derivatives by time ( instantaneous frequency) and frequency (

group delay In signal processing, group delay and phase delay are delay times experienced by a signal's various frequency components when the signal passes through a system that is linear time-invariant (LTI), such as a microphone, coaxial cable, amplifier, ...

), smoothing of phase across frequency. Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase.

Applications

* Interactive Voice Systems *

Virtual Assistants Virtual may refer to: * Virtual (horse), a thoroughbred racehorse * Virtual channel, a channel designation which differs from that of the actual radio channel (or range of frequencies) on which the signal travels * Virtual function, a programming ...

* Voice Identification * Emotion Recognition * Call Center Automation *

Robotics Robotics is an interdisciplinary branch of computer science and engineering. Robotics involves design, construction, operation, and use of robots. The goal of robotics is to design machines that can help and assist humans. Robotics integrat ...

References

{{Authority control Speech Signal processing