The Viterbi algorithm is a
dynamic programming
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
I ...
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
for obtaining the
maximum a posteriori probability estimate of the most
likely
Likely may refer to:
*Probability
*Likelihood function
*Likely (surname)
*Likely, British Columbia, Canada, a community
* Likely, California, United States, a census-designated place
* Likely McBrien (1892-1956), leading Australian rules football ...
sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of
Markov information source
In mathematics, a Markov information source, or simply, a Markov source, is an information source whose underlying dynamics are given by a stationary finite Markov chain.
Formal definition
An information source is a sequence of random variables ...
s and
hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
s (HMM).
The algorithm has found universal application in decoding the
convolutional code
In telecommunication, a convolutional code is a type of error-correcting code that generates parity symbols via the sliding application of a boolean polynomial function to a data stream. The sliding application represents the 'convolution' of th ...
s used in both
CDMA
Code-division multiple access (CDMA) is a channel access method used by various radio communication technologies. CDMA is an example of multiple access, where several transmitters can send information simultaneously over a single communication ...
and
GSM
The Global System for Mobile Communications (GSM) is a standard developed by the European Telecommunications Standards Institute (ETSI) to describe the protocols for second-generation ( 2G) digital cellular networks used by mobile devices such as ...
digital cellular,
dial-up
Dial-up Internet access is a form of Internet access that uses the facilities of the public switched telephone network (PSTN) to establish a connection to an Internet service provider (ISP) by dialing a telephone number on a conventional telepho ...
modems, satellite, deep-space communications, and
802.11
IEEE 802.11 is part of the IEEE 802 set of local area network (LAN) technical standards, and specifies the set of media access control (MAC) and physical layer (PHY) protocols for implementing wireless local area network (WLAN) computer c ...
wireless LANs. It is now also commonly used in
speech recognition
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...
,
speech synthesis
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...
,
diarization,
keyword spotting Keyword spotting (or more simply, word spotting) is a problem that was historically first defined in the context of speech processing.
In speech processing, keyword spotting deals with the identification of keywords
Keyword may refer to:
Computin ...
,
computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
, and
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
. For example, in
speech-to-text
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...
(speech recognition), the acoustic signal is treated as the observed sequence of events, and a string of text is considered to be the "hidden cause" of the acoustic signal. The Viterbi algorithm finds the most likely string of text given the acoustic signal.
History
The Viterbi algorithm is named after
Andrew Viterbi
Andrew James Viterbi (born Andrea Giacomo Viterbi, March 9, 1935) is an American electrical engineer and businessman who co-founded Qualcomm Inc. and invented the Viterbi algorithm. He is the Presidential Chair Professor of Electrical Engineerin ...
, who proposed it in 1967 as a decoding algorithm for
convolutional codes
In telecommunication, a convolutional code is a type of error-correcting code that generates parity symbols via the sliding application of a boolean polynomial function to a data stream. The sliding application represents the 'convolution' of th ...
over noisy digital communication links. It has, however, a history of
multiple invention
Multiple may refer to:
Economics
*Multiple finance, a method used to analyze stock prices
*Multiples of the price-to-earnings ratio
*Chain stores, are also referred to as 'Multiples'
* Box office multiple, the ratio of a film's total gross to th ...
, with at least seven independent discoveries, including those by Viterbi,
Needleman and Wunsch, and
Wagner and Fischer.
It was introduced to
Natural Language Processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
as a method of
part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definiti ...
as early as 1987.
''Viterbi path'' and ''Viterbi algorithm'' have become standard terms for the application of dynamic programming algorithms to maximization problems involving probabilities.
[
For example, in ]statistical parsing Statistical parsing is a group of parsing methods within natural language processing. The methods have in common that they associate grammar rules with a probability. Grammar rules are traditionally viewed in computational linguistics as defining ...
a dynamic programming algorithm can be used to discover the single most likely context-free derivation (parse) of a string, which is commonly called the "Viterbi parse". Another application is in target tracking
Target may refer to:
Physical items
* Shooting target, used in marksmanship training and various shooting sports
** Bullseye (target), the goal one for which one aims in many of these sports
** Aiming point, in field artillery, f ...
, where the track is computed that assigns a maximum likelihood to a sequence of observations.
Extensions
A generalization of the Viterbi algorithm, termed the ''max-sum algorithm'' (or ''max-product algorithm'') can be used to find the most likely assignment of all or some subset of latent variable
In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
s in a large number of graphical model
A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability ...
s, e.g. Bayesian network
A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
s, Markov random field
In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to b ...
s and conditional random field
Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consi ...
s. The latent variables need, in general, to be connected in a way somewhat similar to a hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
(HMM), with a limited number of connections between variables and some type of linear structure among the variables. The general algorithm involves ''message passing'' and is substantially similar to the belief propagation
A belief is an attitude that something is the case, or that some proposition is true. In epistemology, philosophers use the term "belief" to refer to attitudes about the world which can be either true or false. To believe something is to take i ...
algorithm (which is the generalization of the forward-backward algorithm).
With the algorithm called iterative Viterbi decoding Iterative Viterbi decoding is an algorithm that spots the subsequence ''S'' of an observation ''O'' = having the highest average probability (i.e., probability scaled by the length of ''S'') of being generated by a given hidden Markov model ''M'' w ...
one can find the subsequence of an observation that matches best (on average) to a given hidden Markov model. This algorithm is proposed by Qi Wang et al. to deal with turbo code
In information theory, turbo codes (originally in French ''Turbocodes'') are a class of high-performance forward error correction (FEC) codes developed around 1990–91, but first published in 1993. They were the first practical codes to closely ...
. Iterative Viterbi decoding works by iteratively invoking a modified Viterbi algorithm, reestimating the score for a filler until convergence.
An alternative algorithm, the Lazy Viterbi algorithm, has been proposed. For many applications of practical interest, under reasonable noise conditions, the lazy decoder (using Lazy Viterbi algorithm) is much faster than the original Viterbi decoder
A Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that has been
encoded using a convolutional code or trellis code.
There are other algorithms for decoding a convolutionally encoded stream (for example, the Fano algorithm). ...
(using Viterbi algorithm). While the original Viterbi algorithm calculates every node in the trellis
Trellis may refer to:
Structures
* Trellis (architecture), an architectural structure often used to support plants (especially vineyards)
* Trellis drainage pattern, a drainage system
Technology
* Trellis (graph), a special kind of graph used ...
of possible outcomes, the Lazy Viterbi algorithm maintains a prioritized list of nodes to evaluate in order, and the number of calculations required is typically fewer (and never more) than the ordinary Viterbi algorithm for the same result. However, it is not so easy to parallelize in hardware.
Pseudocode
This algorithm generates a path , which is a sequence of states that generate the observations with , where is the number of possible observations in the observation space .
Two 2-dimensional tables of size are constructed:
* Each element observation space
Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instrumen ...
O=\,
* the state space
A state space is the set of all possible configurations of a system. It is a useful abstraction for reasoning about the behavior of a given system and is widely used in the fields of artificial intelligence and game theory.
For instance, the t ...
S=\ ,
* an array of initial probabilities \Pi = (\pi_1,\pi_2,\dots,\pi_K) such that \pi_i stores the probability that x_1 = s_i ,
* a sequence of observations Y=(y_1,y_2,\ldots, y_T) such that y_t=o_i if the observation at time t is o_i ,
* transition matrix A of size K\times K such that A_ stores the transition probability
A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happe ...
of transiting from state s_i to state s_j ,
* emission matrix B of size K\times N such that B_ stores the probability of observing o_j from state s_i .
;Output
* The most likely hidden state sequence X=(x_1,x_2,\ldots,x_T)
function ''VITERBI''(O,S,\Pi,Y,A,B):X
for each state i=1,2,\ldots,K do
T_1 ,1leftarrow\pi_i\cdot B_
T_2 ,1leftarrow 0
end for
for each observation j = 2,3,\ldots,T do
for each state i =1,2,\ldots,K do
end for
end for
x_T\leftarrow s_
for j=T,T-1,\ldots,2 do
z_\leftarrow T_2 _j,j/math>
x_\leftarrow s_
end for
return X
end function
Restated in a succinct near- Python:
function ''viterbi''(O, S, \Pi, Tm, Em): best\_path Tm: transition matrix Em: emission matrix
trellis \leftarrow matrix(length(S), length(O)) To hold probability of each state given each observation
pointers \leftarrow matrix(length(S), length(O)) To hold backpointer to best prior state
for s in range(length(S)): Determine each hidden state's probability at time 0…
trellis, 0
The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline of ...
\leftarrow \Pi \cdot Em , O[0
for o in range(1, length(O)): …and after, tracking each state's most likely prior state, k
for s in range(length(S)):
k \leftarrow \arg\max(k\ \mathsf\ trellis[k, o-1] \cdot Tm[k, s] \cdot Em[s, o])
trellis[s, o] \leftarrow trellis[k, o-1] \cdot Tm[k, s] \cdot Em[s, o]
pointers[s, o] \leftarrow k
best\_path \leftarrow list()
k \leftarrow \arg\max(k\ \mathsf\ trellis, length(O)-1
The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline ...
) Find k of best final state
for o in range(length(O)-1, -1, -1): Backtrack from last observation
best\_path.insert(0, S Insert previous state on most likely path
k \leftarrow pointers, o
The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline of ...
/math> Use backpointer to find best previous state
return best\_path
;Explanation:
Suppose we are given a hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
(HMM) with state space S, initial probabilities \pi_i of being in state i and transition probabilities a_ of transitioning from state i to state j. Say, we observe outputs y_1,\dots, y_T. The most likely state sequence x_1,\dots,x_T that produces the observations is given by the recurrence relations[Xing E, slide 11.]
:
\begin
V_ &= \mathrm\big( y_1 \ , \ k \big) \cdot \pi_k, \\
V_ &= \max_ \left( \mathrm\big( y_t \ , \ k \big) \cdot a_ \cdot V_\right).
\end
Here V_ is the probability of the most probable state sequence \mathrm\big(x_1,\dots,x_t,y_1,\dots, y_t\big) responsible for the first t observations that have k as its final state. The Viterbi path can be retrieved by saving back pointers that remember which state x was used in the second equation. Let \mathrm(k,t) be the function that returns the value of x used to compute V_ if t > 1, or k if t=1. Then
:
\begin
x_T &= \arg\max_ (V_), \\
x_ &= \mathrm(x_t,t).
\end
Here we're using the standard definition of arg max
In mathematics, the arguments of the maxima (abbreviated arg max or argmax) are the points, or elements, of the domain of some function at which the function values are maximized.For clarity, we refer to the input (''x'') as ''points'' and th ...
.
The complexity of this implementation is O(T\times\left, \^2). A better estimation exists if the maximum in the internal loop is instead found by iterating only over states that directly link to the current state (i.e. there is an edge from k to j). Then using amortized analysis
In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case ...
one can show that the complexity is O(T\times(\left, \ + \left, \)), where E is the number of edges in the graph.
Example
Consider a village where all villagers are either healthy or have a fever, and only the village doctor can determine whether each has a fever. The doctor diagnoses fever by asking patients how they feel. The villagers may only answer that they feel normal, dizzy, or cold.
The doctor believes that the health condition of the patients operates as a discrete Markov chain
A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...
. There are two states, "Healthy" and "Fever", but the doctor cannot observe them directly; they are ''hidden'' from the doctor. On each day, there is a certain chance that a patient will tell the doctor "I feel normal", "I feel cold", or "I feel dizzy", depending on the patient's health condition.
The ''observations'' (normal, cold, dizzy) along with a ''hidden'' state (healthy, fever) form a hidden Markov model (HMM), and can be represented as follows in the Python programming language
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python is dynamically-typed and garbage-collected. It supports multiple programming par ...
:
obs = ("normal", "cold", "dizzy")
states = ("Healthy", "Fever")
start_p =
trans_p =
emit_p =
In this piece of code, start_p
represents the doctor's belief about which state the HMM is in when the patient first visits (all the doctor knows is that the patient tends to be healthy). The particular probability distribution used here is not the equilibrium one, which is (given the transition probabilities) approximately
. The transition_p
represents the change of the health condition in the underlying Markov chain. In this example, a patient who is healthy today has only a 30% chance of having a fever tomorrow. The emit_p
represents how likely each possible observation (normal, cold, or dizzy) is, given the underlying condition (healthy or fever). A patient who is healthy has a 50% chance of feeling normal; one who has a fever has a 60% chance of feeling dizzy.
A patient visits three days in a row, and the doctor discovers that the patient feels normal on the first day, cold on the second day, and dizzy on the third day. The doctor has a question: what is the most likely sequence of health conditions of the patient that would explain these observations? This is answered by the Viterbi algorithm.
def viterbi(obs, states, start_p, trans_p, emit_p):
V = []
for st in states:
V[0] t=
# Run Viterbi when t > 0
for t in range(1, len(obs)):
V.append()
for st in states:
max_tr_prob = V[t - 1] tates[0 ["prob"">.html" ;"title="tates[0">tates[0 ["prob"* trans_p tates[0 t* emit_p t bs[t
prev_st_selected = states[0]
for prev_st in states[1:]:
tr_prob = V[t - 1] [prev_st] ["prob"] * trans_p[prev_st] t* emit_p t bs[t
if tr_prob > max_tr_prob:
max_tr_prob = tr_prob
prev_st_selected = prev_st
max_prob = max_tr_prob
V t=
for line in dptable(V):
print(line)
opt = []
max_prob = 0.0
best_st = None
# Get most probable state and its backtrack
for st, data in V[-1].items():
if data["prob"] > max_prob:
max_prob = data["prob"]
best_st = st
opt.append(best_st)
previous = best_st
# Follow the backtrack till the first observation
for t in range(len(V) - 2, -1, -1):
opt.insert(0, V + 1 revious prev"
previous = V + 1 revious prev"
print ("The steps of states are " + " ".join(opt) + " with highest probability of %s" % max_prob)
def dptable(V):
# Print a table of steps from dictionary
yield " " * 5 + " ".join(("%3d" % i) for i in range(len(V)))
for state in V
yield "%.7s: " % state + " ".join("%.7s" % ("%lf" % vtate
Tate is an institution that houses, in a network of four art galleries, the United Kingdom's national collection of British art, and international modern and contemporary art. It is not a government institution, but its main sponsor is the U ...
prob" for v in V)
The function viterbi
takes the following arguments: obs
is the sequence of observations, e.g. normal', 'cold', 'dizzy'/code>; states
is the set of hidden states; start_p
is the start probability; trans_p
are the transition probabilities; and emit_p
are the emission probabilities. For simplicity of code, we assume that the observation sequence obs
is non-empty and that trans_p /code> and emit_p /code> is defined for all states i,j.
In the running example, the forward/Viterbi algorithm is used as follows:
viterbi(obs,
states,
start_p,
trans_p,
emit_p)
The output of the script is
$ python viterbi_example.py
0 1 2
Healthy: 0.30000 0.08400 0.00588
Fever: 0.04000 0.02700 0.01512
The steps of states are Healthy Healthy Fever with highest probability of 0.01512
This reveals that the observations normal', 'cold', 'dizzy'/code> were most likely generated by states Healthy', 'Healthy', 'Fever'/code>. In other words, given the observed activities, the patient was most likely to have been healthy on the first day and also on the second day (despite feeling cold that day), and only to have contracted a fever on the third day.
The operation of Viterbi's algorithm can be visualized by means of a
trellis diagram
In telecommunication, a convolutional code is a type of error-correcting code that generates parity symbols via the sliding application of a boolean polynomial function to a data stream. The sliding application represents the 'convolution' of the ...
. The Viterbi path is essentially the shortest
path through this trellis.
Soft output Viterbi algorithm
The soft output Viterbi algorithm (SOVA) is a variant of the classical Viterbi algorithm.
SOVA differs from the classical Viterbi algorithm in that it uses a modified path metric which takes into account the ''a priori probabilities'' of the input symbols, and produces a ''soft'' output indicating the ''reliability'' of the decision.
The first step in the SOVA is the selection of the survivor path, passing through one unique node at each time instant, ''t''. Since each node has 2 branches converging at it (with one branch being chosen to form the ''Survivor Path'', and the other being discarded), the difference in the branch metrics (or ''cost'') between the chosen and discarded branches indicate the ''amount of error'' in the choice.
This ''cost'' is accumulated over the entire sliding window (usually equals ''at least'' five constraint lengths), to indicate the ''soft output'' measure of reliability of the ''hard bit decision'' of the Viterbi algorithm.
See also
* Expectation–maximization algorithm
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variab ...
* Baum–Welch algorithm In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the f ...
* Forward-backward algorithm
* Forward algorithm
The forward algorithm, in the context of a hidden Markov model (HMM), is used to calculate a 'belief state': the probability of a state at a certain time, given the history of evidence. The process is also known as ''filtering''. The forward alg ...
* Error-correcting code
In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea i ...
* Viterbi decoder
A Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that has been
encoded using a convolutional code or trellis code.
There are other algorithms for decoding a convolutionally encoded stream (for example, the Fano algorithm). ...
* Hidden Markov model
A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
* Part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definiti ...
* A* search algorithm
A* (pronounced "A-star") is a graph traversal and path search algorithm, which is used in many fields of computer science due to its completeness, optimality, and optimal efficiency. One major practical drawback is its O(b^d) space complexity, ...
References
General references
* (note: the Viterbi decoding algorithm is described in section IV.) Subscription required.
*
* Subscription required.
*
* {{cite journal , author=Rabiner LR , title=A tutorial on hidden Markov models and selected applications in speech recognition , journal=Proceedings of the IEEE , volume=77 , issue=2 , pages=257–286 , date=February 1989 , doi=10.1109/5.18626, citeseerx=10.1.1.381.3454 , s2cid=13618539 (Describes the forward algorithm and Viterbi algorithm for HMMs).
* Shinghal, R. and Godfried T. Toussaint, "Experiments in text recognition with the modified Viterbi algorithm," ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', Vol. PAMI-l, April 1979, pp. 184–193.
* Shinghal, R. and Godfried T. Toussaint, "The sensitivity of the modified Viterbi algorithm to the source statistics," ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', vol. PAMI-2, March 1980, pp. 181–185.
External links
* Implementations in Java, F#, Clojure, C# on Wikibooks
Tutorial
on convolutional coding with viterbi decoding, by Chip Fleming
A tutorial for a Hidden Markov Model toolkit (implemented in C) that contains a description of the Viterbi algorithm
Viterbi algorithm
by Dr. Andrew J. Viterbi
Andrew James Viterbi (born Andrea Giacomo Viterbi, March 9, 1935) is an American electrical engineer and businessman who co-founded Qualcomm, Qualcomm Inc. and invented the Viterbi algorithm. He is the Presidential Chair Professor of Electrical ...
(scholarpedia.org).
Implementations
Mathematica
has an implementation as part of its support for stochastic processes
Susa
signal processing framework provides the C++ implementation for Forward error correction
In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea is ...
codes and channel equalizatio
here
C++
C#
Java
Java 8
Julia (HMMBase.jl)
Perl
Prolog
Go
SFIHMM
includes code for Viterbi decoding.
Error detection and correction
Dynamic programming
Markov models
Articles with example Python (programming language) code