In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, Hirschberg's algorithm, named after its inventor,
Dan Hirschberg, is a
dynamic programming
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
I ...
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
that finds the optimal
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...
between two
string
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
s. Optimality is measured with the
Levenshtein distance
In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of the
Needleman–Wunsch algorithm
The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul ...
that uses
divide and conquer. Hirschberg's algorithm is commonly used in
computational biology
Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
to find maximal global alignments of
DNA and
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
sequences.
Algorithm information
Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment.
BLAST
Blast or The Blast may refer to:
*Explosion, a rapid increase in volume and release of energy in an extreme manner
*Detonation, an exothermic front accelerating through a medium that eventually drives a shock front
Film
* ''Blast'' (1997 film), ...
and
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.
History
The original FASTA program ...
are suboptimal
heuristics
A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
. If ''x'' and ''y'' are strings, where length(''x'') = ''n'' and length(''y'') = ''m'', the
Needleman–Wunsch algorithm
The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul ...
finds an optimal alignment in
O(''nm'') time, using O(''nm'') space. Hirschberg's algorithm is a clever modification of the Needleman–Wunsch Algorithm, which still takes O(''nm'') time, but needs only O(min) space and is much faster in practice.
One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the
longest common subsequence
A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy conse ...
between two sets of data such as with the common
diff
In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but i ...
tool.
The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that:
# one can compute the optimal alignment score by only storing the current and previous row of the Needleman–Wunsch score matrix;
# if
is the optimal alignment of
, and
is an arbitrary partition of
, there exists a partition
of
such that
.
Algorithm description
denotes the ''i''-th character of
, where
.
denotes a substring of size
, ranging from the ''i''-th to the ''j''-th character of
.
is the reversed version of
.
and
are sequences to be aligned. Let
be a character from
, and
be a character from
. We assume that
,
and
are well defined integer-valued functions. These functions represent the cost of deleting
, inserting
, and replacing
with
, respectively.
We define
, which returns the last line of the Needleman–Wunsch score matrix
:
function NWScore(X, Y)
Score(0, 0) = 0 // 2 * (length(Y) + 1) array
for j = 1 to length(Y)
Score(0, j) = Score(0, j - 1) + Ins(Y
j)
for i = 1 to length(X) // Init array
Score(1, 0) = Score(0, 0) + Del(X
i)
for j = 1 to length(Y)
scoreSub = Score(0, j - 1) + Sub(X
i, Y
j)
scoreDel = Score(0, j) + Del(X
i)
scoreIns = Score(1, j - 1) + Ins(Y
j)
Score(1, j) = max(scoreSub, scoreDel, scoreIns)
end
// Copy Score
to Score
Score(0, :) = Score(1, :)
end
for j = 0 to length(Y)
LastLine(j) = Score(1, j)
return LastLine
Note that at any point,
only requires the two most recent rows of the score matrix. Thus,
is implemented in
space.
The Hirschberg algorithm follows:
function Hirschberg(X, Y)
Z = ""
W = ""
if length(X)
0
for i = 1 to length(Y)
Z = Z + '-'
W = W + Y
i
end
else if length(Y)
0
for i = 1 to length(X)
Z = Z + X
i
W = W + '-'
end
else if length(X)
1 or length(Y)
1
(Z, W) = NeedlemanWunsch(X, Y)
else
xlen = length(X)
xmid = length(X) / 2
ylen = length(Y)
ScoreL = NWScore(X
1:xmid, Y)
ScoreR = NWScore(rev(X
xmid+1:xlen), rev(Y))
ymid =
arg max
In mathematics, the arguments of the maxima (abbreviated arg max or argmax) are the points, or elements, of the domain of some function at which the function values are maximized.For clarity, we refer to the input (''x'') as ''points'' and th ...
ScoreL + rev(ScoreR)
(Z,W) = Hirschberg(X
1:xmid, y
1:ymid) + Hirschberg(X
xmid+1:xlen, Y
ymid+1:ylen)
end
return (Z, W)
In the context of observation (2), assume that
is a partition of
. Index
is computed such that
and
.
Example
Let
The optimal alignment is given by
W = AGTACGCA
Z = --TATGC-
Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix:
T A T G C
0 -2 -4 -6 -8 -10
A -2 -1 0 -2 -4 -6
G -4 -3 -2 -1 0 -2
T -6 -2 -4 0 -2 -1
A -8 -4 0 -2 -1 -3
C -10 -6 -2 -1 -3 1
G -12 -8 -4 -3 1 -1
C -14 -10 -6 -5 -1 3
A -16 -12 -8 -7 -3 1
One starts with the top level call to
, which splits the first argument in half:
. The call to
produces the following matrix:
T A T G C
0 -2 -4 -6 -8 -10
A -2 -1 0 -2 -4 -6
G -4 -3 -2 -1 0 -2
T -6 -2 -4 0 -2 -1
A -8 -4 0 -2 -1 -3
Likewise,
generates the following matrix:
C G T A T
0 -2 -4 -6 -8 -10
A -2 -1 -3 -5 -4 -6
C -4 0 -2 -4 -6 -5
G -6 -2 2 0 -2 -4
C -8 -4 0 1 -1 -3
Their last lines (after reversing the latter) and sum of those are respectively
ScoreL =
-8 -4 0 -2 -1 -3 rev(ScoreR) =
-3 -1 1 0 -4 -8 Sum =
11 -5 1 -2 -5 -11
The maximum (shown in bold) appears at
ymid = 2
, producing the partition
.
The entire Hirschberg recursion (which we omit for brevity) produces the following tree:
(AGTACGCA,TATGC)
/ \
(AGTA,TA) (CGCA,TGC)
/ \ / \
(AG, ) (TA,TA) (CG,TG) (CA,C)
/ \ / \
(T,T) (A,A) (C,T) (G,G)
The leaves of the tree contain the optimal alignment.
See also
*
Longest common subsequence
A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy conse ...
References
{{DEFAULTSORT:Hirschberg's Algorithm
Sequence alignment algorithms
Bioinformatics algorithms
Articles with example pseudocode
Dynamic programming