The noisy channel model is a framework used in
spell checker In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic ...
s,
question answering
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
,
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...
, and
machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
.
In this model, the goal is to find the intended word given a word where the
letters have been scrambled in some manner.
In spell-checking
See Chapter B of.
Given an alphabet
, let
be the set
of all finite strings over
. Let the dictionary
of valid words be some subset of
, i.e.,
.
The noisy channel is the matrix
:
,
where
is the intended word and
is the scrambled word that was actually received.
The goal of the noisy channel model is to find the intended word given the
scrambled word that was received. The decision function
is a function that, given a scrambled word, returns
the intended word.
Methods of constructing a decision function include the
maximum likelihood rule, the
maximum a posteriori rule, and the
minimum distance rule.
In some cases, it may be better to accept the scrambled word as the intended
word rather than attempt to find an intended word in the dictionary. For
example, the word ''
schönfinkeling
In mathematics and computer science, currying is the technique of translating the evaluation of a function that takes multiple arguments into evaluating a sequence of functions, each with a single argument. For example, currying a function f th ...
'' may not be in the dictionary, but might
in fact be the intended word.
Example
Consider the English alphabet
. Some subset
makes up the dictionary of valid English
words.
There are several mistakes that may occur while typing, including:
# Missing letters, e.g., ' instead of ''letter''
# Accidental letter additions, e.g., ' instead of ''mistake''
# Swapping letters, e.g., ' instead of ''received''
# Replacing letters, e.g., ' instead of ''finite''
To construct the noisy channel matrix
, we must consider
the probability of each mistake, given the intended word
(
for all
and
). These probabilities may be gathered, for
example, by considering the
Damerau–Levenshtein distance In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein.) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Leve ...
between
and
or by comparing the draft of an essay with one that has
been manually edited for spelling.
In machine translation
See chapter 1, and chapter 25 of.
Suppose we want to translate a foreign language to English, we could model
directly: the probability that we have English sentence E given foreign sentence F, then we pick the most likely one
. However, by Bayes law, we have the equivalent equation:
The benefit of the noisy-channel model is in terms of data: If collecting a
parallel corpus
A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libra ...
is costly, then we would have only a small parallel corpus, so we can only train a moderately good English-to-foreign translation model, and a moderately good foreign-to-English translation model. However, we can collect a large corpus in the foreign language only, and a large corpus in the English language only, to train two good language models. Combining these four models, we immediately get a good English-to-foreign translator and a good foreign-to-English translator.
The cost of noisy-channel model is that using Bayesian inference is more costly than using a translation model directly. Instead of reading out the most likely translation by
, it would have to read out predictions by both the translation model and the language model, multiply them, and search for the highest number.
In speech recognition
Speech recognition can be thought of as translating from a sound-language to a text-language. Consequently, we have
where
is the probability that a speech sound S is produced if the speaker is intending to say text T. Intuitively, this equation states that the most likely text is a text that's both a likely text in the language, and produces the speech sound with high probability.
The utility of the noisy-channel model is not in capacity. Theoretically, any noisy-channel model can be replicated by a direct
model. However, the noisy-channel model factors the model into two parts which are appropriate for the situation, and consequently it is generally more well-behaved.
When a human speaks, it does not produce the sound directly, but first produces the text it wants to speak in the language centers of the brain, then the text is translated into sound by the motor cortex, vocal cords, and other parts of the body. The noisy-channel model matches this model of the human, and so it is appropriate. This is justified in the practical success of noisy-channel model in speech recognition.
Example
Consider the sound-language sentence (written in
IPA for English
Like many other languages, English has wide variation in pronunciation, both historically and from dialect to dialect. In general, however, the regional dialects of English share a largely similar (but not identical) phonological system. Am ...
) S = ''aɪ wʊd laɪk wʌn tuː''. There are three possible texts
:
*
I would like one to.
*
I would like one too.
*
I would like one two.
that are equally likely, in the sense that
. With a good English language model, we would have
, since the second sentence is grammatical, the first is not quite, but close to a grammatical one (such as "I would like one to
o"), while the third one is far from grammatical.
Consequently, the noisy-channel model would output
as the best transcription.
See also
*
Coding theory
Coding theory is the study of the properties of codes and their respective fitness for specific applications. Codes are used for data compression, cryptography, error detection and correction, data transmission and data storage. Codes are stud ...
References
*
{{Refend
Automatic identification and data capture
Computational linguistics
Statistical natural language processing