An acoustic model is used in

automatic speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also k ...

to represent the relationship between an

audio signal An audio signal is a representation of sound, typically using either a changing level of electrical voltage for analog signals or a series of binary numbers for Digital signal (signal processing), digital signals. Audio signals have frequencies i ...

and the

phonemes A phoneme () is any set of similar speech sounds that are perceptually regarded by the speakers of a language as a single basic sound—a smallest possible phonetic unit—that helps distinguish one word from another. All languages con ...

or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. It is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word.

Background

Modern

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

systems use both an acoustic model and a

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

to represent the statistical properties of speech. The acoustic model models the relationship between the audio signal and the phonetic units in the language. The language model is responsible for modeling the word sequences in the language. These two models are combined to get the top-ranked word sequences corresponding to a given audio segment. Most modern

speech Speech is the use of the human voice as a medium for language. Spoken language combines vowel and consonant sounds to form units of meaning like words, which belong to a language's lexicon. There are many different intentional speech acts, suc ...

recognition systems operate on the audio in small chunks known as frames with an approximate duration of 10ms per frame. The raw audio signal from each frame can be transformed by applying the mel-frequency cepstrum. The coefficients from this transformation are commonly known as mel frequency cepstral coefficients (MFCC)s and are used as an input to the acoustic model along with other features. Recently, the use of

Convolutional Neural Networks A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different type ...

has led to big improvements in acoustic modeling.T. Sainath ''et al.''., "Convolutional neural networks for LVCSR," ''ICASSP'', 2013.

Speech audio characteristics

Audio can be

encoded In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...

at different

sampling rate In signal processing, sampling is the reduction of a continuous-time signal to a discrete-time signal. A common example is the conversion of a sound wave to a sequence of "samples". A sample is a value of the signal at a point in time and/or s ...

s (i.e. samples per second – the most common being: 8, 16, 32, 44.1, 48, and 96 kHz), and different bits per sample (the most common being: 8-bits, 16-bits, 24-bits or 32-bits). Speech recognition engines work best if the acoustic model they use was trained with speech audio which was recorded at the same sampling rate/bits per sample as the speech being recognized.

Telephony-based speech recognition

The limiting factor for

telephony Telephony ( ) is the field of technology involving the development, application, and deployment of telecommunications services for the purpose of electronic transmission of voice, fax, or data, between distant parties. The history of telephony is ...

based speech recognition is the bandwidth at which speech can be transmitted. For example, a standard land-line telephone only has a bandwidth of 64 kbit/s at a sampling rate of 8 kHz and 8-bits per sample (8000 samples per second * 8-bits per sample = 64000 bit/s). Therefore, for telephony based speech recognition, acoustic models should be trained with 8 kHz/8-bit speech audio files. In the case of

Voice over IP Voice over Internet Protocol (VoIP), also known as IP telephony, is a set of technologies used primarily for voice communication sessions over Internet Protocol (IP) networks, such as the Internet. VoIP enables voice calls to be transmitted as ...

, the

codec A codec is a computer hardware or software component that encodes or decodes a data stream or signal. ''Codec'' is a portmanteau of coder/decoder. In electronic communications, an endec is a device that acts as both an encoder and a decoder o ...

determines the sampling rate/bits per sample of speech transmission. Codecs with a higher sampling rate/bits per sample for speech transmission (which improve the sound quality) necessitate acoustic models trained with audio data that matches that sampling rate/bits per sample.

Desktop-based speech recognition

For speech recognition on a standard desktop PC, the limiting factor is the

sound card A sound card (also known as an audio card) is an internal expansion card that provides input and output of audio signals to and from a computer under the control of computer programs. The term ''sound card'' is also applied to external audio ...

. Most sound cards today can record at sampling rates of between 16 kHz-48 kHz of audio, with bit rates of 8 to 16-bits per sample, and playback at up to 96 kHz. As a general rule, a speech recognition engine works better with acoustic models trained with speech audio data recorded at higher sampling rates/bits per sample. But using audio with too high a sampling rate/bits per sample can slow the recognition engine down. A compromise is needed. Thus for desktop speech recognition, the current standard is acoustic models trained with speech audio data recorded at sampling rates of 16 kHz/16bits per sample.

References

{{reflist

External links

Japanese acoustic models
for the use with

Julius Julius may refer to: People * Julius (name), a masculine given name and surname (includes a list of people with the name) * Julius (nomen), the name of a Roman family (includes a list of Ancient Romans with the name) ** Julius Caesar (100– ...

open source acoustic models
at VoxForge
HTK WSJ acoustic models
for HTK Computational linguistics Speech recognition