The Viterbi algorithm is a

dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. I ...

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

for obtaining the maximum a posteriori probability estimate of the most

likely Likely may refer to: *Probability *Likelihood function *Likely (surname) *Likely, British Columbia, Canada, a community * Likely, California, United States, a census-designated place * Likely McBrien (1892-1956), leading Australian rules football ...

sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of

Markov information source In mathematics, a Markov information source, or simply, a Markov source, is an information source whose underlying dynamics are given by a stationary finite Markov chain. Formal definition An information source is a sequence of random variables ...

s and

hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

s (HMM). The algorithm has found universal application in decoding the convolutional codes used in both

CDMA Code-division multiple access (CDMA) is a channel access method used by various radio communication technologies. CDMA is an example of multiple access, where several transmitters can send information simultaneously over a single communication ...

and GSM digital cellular,

dial-up Dial-up Internet access is a form of Internet access that uses the facilities of the public switched telephone network (PSTN) to establish a connection to an Internet service provider (ISP) by dialing a telephone number on a conventional telepho ...

modems, satellite, deep-space communications, and 802.11 wireless LANs. It is now also commonly used in

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...

speech synthesis Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...

, diarization, keyword spotting,

computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...

, and

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...

. For example, in speech-to-text (speech recognition), the acoustic signal is treated as the observed sequence of events, and a string of text is considered to be the "hidden cause" of the acoustic signal. The Viterbi algorithm finds the most likely string of text given the acoustic signal.

History

The Viterbi algorithm is named after Andrew Viterbi, who proposed it in 1967 as a decoding algorithm for convolutional codes over noisy digital communication links. It has, however, a history of multiple invention, with at least seven independent discoveries, including those by Viterbi, Needleman and Wunsch, and Wagner and Fischer. It was introduced to

Natural Language Processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

as a method of part-of-speech tagging as early as 1987. ''Viterbi path'' and ''Viterbi algorithm'' have become standard terms for the application of dynamic programming algorithms to maximization problems involving probabilities. For example, in statistical parsing a dynamic programming algorithm can be used to discover the single most likely context-free derivation (parse) of a string, which is commonly called the "Viterbi parse". Another application is in target tracking, where the track is computed that assigns a maximum likelihood to a sequence of observations.

Extensions

A generalization of the Viterbi algorithm, termed the ''max-sum algorithm'' (or ''max-product algorithm'') can be used to find the most likely assignment of all or some subset of

latent variable In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...

s in a large number of

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability ...

s, e.g.

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...

s, Markov random fields and conditional random fields. The latent variables need, in general, to be connected in a way somewhat similar to a

(HMM), with a limited number of connections between variables and some type of linear structure among the variables. The general algorithm involves ''message passing'' and is substantially similar to the belief propagation algorithm (which is the generalization of the forward-backward algorithm). With the algorithm called iterative Viterbi decoding one can find the subsequence of an observation that matches best (on average) to a given hidden Markov model. This algorithm is proposed by Qi Wang et al. to deal with

turbo code In information theory, turbo codes (originally in French ''Turbocodes'') are a class of high-performance forward error correction (FEC) codes developed around 1990–91, but first published in 1993. They were the first practical codes to closely ...

. Iterative Viterbi decoding works by iteratively invoking a modified Viterbi algorithm, reestimating the score for a filler until convergence. An alternative algorithm, the Lazy Viterbi algorithm, has been proposed. For many applications of practical interest, under reasonable noise conditions, the lazy decoder (using Lazy Viterbi algorithm) is much faster than the original Viterbi decoder (using Viterbi algorithm). While the original Viterbi algorithm calculates every node in the

trellis Trellis may refer to: Structures * Trellis (architecture), an architectural structure often used to support plants (especially vineyards) * Trellis drainage pattern, a drainage system Technology * Trellis (graph), a special kind of graph used ...

of possible outcomes, the Lazy Viterbi algorithm maintains a prioritized list of nodes to evaluate in order, and the number of calculations required is typically fewer (and never more) than the ordinary Viterbi algorithm for the same result. However, it is not so easy to parallelize in hardware.

Pseudocode

This algorithm generates a path

X=(x_1,x_2,\ldots,x_T)

, which is a sequence of states

x_n \in S=\

that generate the observations

Y=(y_1,y_2,\ldots, y_T)

with

y_n \in  O=\

, where

N

is the number of possible observations in the observation space

O

. Two 2-dimensional tables of size

K \times T

are constructed: * Each element

T_1,j /math> of T_1 stores the probability of the most likely path so far \hat=(\hat_1,\hat_2,\ldots,\hat_j) with \hat_j=s_i that generates Y=(y_1,y_2,\ldots, y_j) .
* Each element T_2,j of T_2 stores \hat_of the most likely path so far \hat=(\hat_1,\hat_2,\ldots,\hat_,\hat_j = s_i) \forall j, 2\leq j \leq T The table entries T_1,j T_2,j /math> are filled by increasing order of K\cdot j+i :

: T_1,j \max_,
: T_2,j \operatorname_,

with A_and B_as defined below. Note that B_does not need to appear in the latter expression, as it's non-negative and independent of k and thus does not affect the argmax.

;Input:
* The

observation space Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instrumen ...

O=\

, * the

state space A state space is the set of all possible configurations of a system. It is a useful abstraction for reasoning about the behavior of a given system and is widely used in the fields of artificial intelligence and game theory. For instance, the t ...

S=\

, * an array of initial probabilities

\Pi = (\pi_1,\pi_2,\dots,\pi_K)

such that

\pi_i

stores the probability that

x_1 = s_i

, * a sequence of observations

Y=(y_1,y_2,\ldots, y_T)

such that

y_t=o_i

if the observation at time

t

o_i

, * transition matrix

A

of size

K\times K

such that

A_

stores the transition probability of transiting from state

s_i

to state

s_j

, * emission matrix

B

of size

K\times N

such that

B_

stores the probability of observing

o_j

from state

s_i

. ;Output * The most likely hidden state sequence

X=(x_1,x_2,\ldots,x_T)

function ''VITERBI''

(O,S,\Pi,Y,A,B):X

for each state

i=1,2,\ldots,K

leftarrow\pi_i\cdot B_

T_2,1 leftarrow  0

end for for each observation

j = 2,3,\ldots,T

do for each state

i =1,2,\ldots,K

do end for end for

x_T\leftarrow s_

for

j=T,T-1,\ldots,2

z_\leftarrow T_2_j,j /math> x_\leftarrow s_end for
     return X end function

Restated in a succinct near- Python :
 function ''viterbi'' (O, S, \Pi, Tm, Em): best\_path Tm: transition matrix   Em: emission matrix trellis \leftarrow matrix(length(S), length(O)) To hold probability of each state given each observation pointers \leftarrow matrix(length(S), length(O)) To hold backpointer to best prior state
     for s in range(length(S)) :                Determine each hidden state's probability at time 0… trellis

, 0 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

\leftarrow \Pi \cdot Em , O[0 for o in

range(1, length(O))

: …and after, tracking each state's most likely prior state, k for s in

range(length(S))

k \leftarrow \arg\max(k\ \mathsf\ trellis[k, o-1] \cdot Tm[k, s] \cdot Em[s, o])

trellis[s, o] \leftarrow trellis[k, o-1] \cdot Tm[k, s] \cdot Em[s, o]

pointers[s, o] \leftarrow k

best\_path \leftarrow list()

k \leftarrow \arg\max(k\ \mathsf\ trellis

, length(O)-1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline ...

) Find k of best final state for o in

range(length(O)-1, -1, -1)

: Backtrack from last observation

best\_path.insert(0, S

Insert previous state on most likely path

k \leftarrow pointers

, o The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

/math> Use backpointer to find best previous state return

best\_path

;Explanation: Suppose we are given a

(HMM) with state space

S

, initial probabilities

\pi_i

of being in state

i

and transition probabilities

a_

of transitioning from state

i

to state

j

. Say, we observe outputs

y_1,\dots, y_T

. The most likely state sequence

x_1,\dots,x_T

that produces the observations is given by the recurrence relationsXing E, slide 11. :

\begin
 V_ &= \mathrm\big( y_1 \ ,  \ k \big) \cdot \pi_k, \\
 V_ &= \max_ \left(  \mathrm\big( y_t \ ,  \ k \big) \cdot a_ \cdot V_\right).
\end

Here

V_

is the probability of the most probable state sequence

\mathrm\big(x_1,\dots,x_t,y_1,\dots, y_t\big)

responsible for the first

t

observations that have

k

as its final state. The Viterbi path can be retrieved by saving back pointers that remember which state

x

was used in the second equation. Let

\mathrm(k,t)

be the function that returns the value of

x

used to compute

V_

t > 1

, or

k

t=1

. Then :

\begin
 x_T &= \arg\max_ (V_), \\
 x_ &= \mathrm(x_t,t).
\end

Here we're using the standard definition of arg max. The complexity of this implementation is

O(T\times\left, \^2)

. A better estimation exists if the maximum in the internal loop is instead found by iterating only over states that directly link to the current state (i.e. there is an edge from

k

j

). Then using

amortized analysis In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case ...

one can show that the complexity is

O(T\times(\left, \ + \left, \))

, where

E

is the number of edges in the graph.

Example

Consider a village where all villagers are either healthy or have a fever, and only the village doctor can determine whether each has a fever. The doctor diagnoses fever by asking patients how they feel. The villagers may only answer that they feel normal, dizzy, or cold. The doctor believes that the health condition of the patients operates as a discrete

Markov chain A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...

. There are two states, "Healthy" and "Fever", but the doctor cannot observe them directly; they are ''hidden'' from the doctor. On each day, there is a certain chance that a patient will tell the doctor "I feel normal", "I feel cold", or "I feel dizzy", depending on the patient's health condition. The ''observations'' (normal, cold, dizzy) along with a ''hidden'' state (healthy, fever) form a hidden Markov model (HMM), and can be represented as follows in the

Python programming language Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically-typed and garbage-collected. It supports multiple programming par ...

: obs = ("normal", "cold", "dizzy") states = ("Healthy", "Fever") start_p = trans_p = emit_p = In this piece of code, start_p represents the doctor's belief about which state the HMM is in when the patient first visits (all the doctor knows is that the patient tends to be healthy). The particular probability distribution used here is not the equilibrium one, which is (given the transition probabilities) approximately . The transition_p represents the change of the health condition in the underlying Markov chain. In this example, a patient who is healthy today has only a 30% chance of having a fever tomorrow. The emit_p represents how likely each possible observation (normal, cold, or dizzy) is, given the underlying condition (healthy or fever). A patient who is healthy has a 50% chance of feeling normal; one who has a fever has a 60% chance of feeling dizzy. An example of HMM

A patient visits three days in a row, and the doctor discovers that the patient feels normal on the first day, cold on the second day, and dizzy on the third day. The doctor has a question: what is the most likely sequence of health conditions of the patient that would explain these observations? This is answered by the Viterbi algorithm. def viterbi(obs, states, start_p, trans_p, emit_p): V = [] for st in states: V[0] t= # Run Viterbi when t > 0 for t in range(1, len(obs)): V.append() for st in states: max_tr_prob = V[t - 1] tates[0 ["prob"">.html" ;"title="tates[0">tates[0 ["prob"* trans_p tates[0 t* emit_p t bs[t prev_st_selected = states[0] for prev_st in states[1:]: tr_prob = V[t - 1] [prev_st] ["prob"] * trans_p[prev_st] t* emit_p t bs[t if tr_prob > max_tr_prob: max_tr_prob = tr_prob prev_st_selected = prev_st max_prob = max_tr_prob V t= for line in dptable(V): print(line) opt = [] max_prob = 0.0 best_st = None # Get most probable state and its backtrack for st, data in V[-1].items(): if data["prob"] > max_prob: max_prob = data["prob"] best_st = st opt.append(best_st) previous = best_st # Follow the backtrack till the first observation for t in range(len(V) - 2, -1, -1): opt.insert(0, V + 1 revious prev" previous = V + 1 revious prev" print ("The steps of states are " + " ".join(opt) + " with highest probability of %s" % max_prob) def dptable(V): # Print a table of steps from dictionary yield " " * 5 + " ".join(("%3d" % i) for i in range(len(V))) for state in V yield "%.7s: " % state + " ".join("%.7s" % ("%lf" % v

tate Tate is an institution that houses, in a network of four art galleries, the United Kingdom's national collection of British art, and international modern and contemporary art. It is not a government institution, but its main sponsor is the U ...

prob" for v in V) The function viterbi takes the following arguments: obs is the sequence of observations, e.g.

 normal', 'cold', 'dizzy'/code>; states is the set of hidden states; start_p is the start probability; trans_p are the transition probabilities; and emit_p are the emission probabilities.  For simplicity of code, we assume that the observation sequence obs is non-empty and that  trans_p  /code> and emit_p  /code> is defined for all states i,j.

In the running example, the forward/Viterbi algorithm is used as follows:


viterbi(obs,
        states,
        start_p,
        trans_p,
        emit_p)



The output of the script is


$ python viterbi_example.py
         0          1          2
Healthy: 0.30000 0.08400 0.00588
Fever: 0.04000 0.02700 0.01512
The steps of states are Healthy Healthy Fever with highest probability of 0.01512


This reveals that the observations  normal', 'cold', 'dizzy'/code> were most likely generated by states  Healthy', 'Healthy', 'Fever'/code>. In other words, given the observed activities, the patient was most likely to have been healthy on the first day and also on the second day (despite feeling cold that day), and only to have contracted a fever on the third day.

The operation of Viterbi's algorithm can be visualized by means of a
trellis diagram 

In telecommunication, a convolutional code is a type of error-correcting code that generates parity symbols via the sliding application of a boolean polynomial function to a data stream.  The sliding application represents the 'convolution' of the ...
. The Viterbi path is essentially the shortest
path through this trellis.

  Soft output Viterbi algorithm 

The soft output Viterbi algorithm (SOVA) is a variant of the classical Viterbi algorithm.

SOVA differs from the classical Viterbi algorithm in that it uses a modified path metric which takes into account the  ''a priori probabilities'' of the input symbols, and produces a ''soft'' output indicating the ''reliability'' of the decision.

The first step in the SOVA is the selection of the survivor path, passing through one unique node at each time instant, ''t''. Since each node has 2 branches converging at it (with one branch being chosen to form the ''Survivor Path'', and the other being discarded), the difference in the branch metrics (or ''cost'') between the chosen and discarded branches indicate the ''amount of error'' in the choice.

This ''cost'' is accumulated over the entire sliding window (usually equals ''at least'' five constraint lengths), to indicate the ''soft output'' measure of reliability of the ''hard bit decision'' of the Viterbi algorithm.

  See also 

*  Expectation–maximization algorithm
*  Baum–Welch algorithm
*  Forward-backward algorithm
* Forward algorithm 


The forward algorithm, in the context of a hidden Markov model (HMM), is used to calculate a 'belief state': the probability of a state at a certain time, given the history of evidence.  The process is also known as ''filtering''. The forward alg ...

* Error-correcting code 


In computing, telecommunication, information theory, and  coding theory, an error correction code, sometimes error correcting code, (ECC) is used for  controlling errors in data over unreliable or noisy  communication channels. The central idea i ...

*  Viterbi decoder
* Hidden Markov model 

A hidden Markov model (HMM) is a statistical Markov model in which the system being  modeled is assumed to be a Markov process — call it  X  — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...

*  Part-of-speech tagging
* A* search algorithm 



A* (pronounced "A-star") is a  graph traversal and  path search algorithm, which is used in many fields of computer science due to its completeness, optimality, and optimal efficiency. One major practical drawback is its O(b^d) space complexity, ...


  References 



  General references 

*  (note: the Viterbi decoding algorithm is described in section IV.) Subscription required.
* 
*  Subscription required.
* 
* {{cite journal , author=Rabiner LR , title=A tutorial on hidden Markov models and selected applications in speech recognition , journal=Proceedings of the IEEE , volume=77 , issue=2 , pages=257–286 , date=February 1989 , doi=10.1109/5.18626, citeseerx=10.1.1.381.3454 , s2cid=13618539  (Describes the forward algorithm and Viterbi algorithm for HMMs).
* Shinghal, R. and  Godfried T. Toussaint, "Experiments in text recognition with the modified Viterbi algorithm," ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', Vol. PAMI-l, April 1979, pp. 184–193.
* Shinghal, R. and  Godfried T. Toussaint, "The sensitivity of the modified Viterbi algorithm to the source statistics," ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', vol. PAMI-2, March 1980, pp. 181–185.

  External links 

*  Implementations in Java, F#, Clojure, C# on Wikibooks

Tutorial
on convolutional coding with viterbi decoding, by Chip Fleming

A tutorial for a Hidden Markov Model toolkit (implemented in C) that contains a description of the Viterbi algorithm

Viterbi algorithm
by Dr. Andrew J. Viterbi 


Andrew James Viterbi (born Andrea Giacomo Viterbi, March 9, 1935) is an American electrical engineer and businessman who co-founded Qualcomm, Qualcomm Inc. and invented the Viterbi algorithm.  He is the Presidential Chair Professor of Electrical  ...
 (scholarpedia.org).

  Implementations 


Mathematica
has an implementation as part of its support for stochastic processes

Susa
signal processing framework provides the C++ implementation for Forward error correction 


In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for  controlling errors in data over unreliable or noisy communication channels. The central idea is  ...
 codes and channel equalizatio
here


C++

C#

Java

Java 8

Julia (HMMBase.jl)

Perl

Prolog



Go

SFIHMM
includes code for Viterbi decoding.

 Error detection and correction
 Dynamic programming
 Markov models
 Articles with example Python (programming language) code