Chvátal–Sankoff Constants
   HOME

TheInfoList



OR:

In mathematics, the Chvátal–Sankoff constants are
mathematical constant A mathematical constant is a key number whose value is fixed by an unambiguous definition, often referred to by a symbol (e.g., an alphabet letter), or by mathematicians' names to facilitate using it across multiple mathematical problems. Cons ...
s that describe the lengths of
longest common subsequence A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy conse ...
s of
random In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rando ...
strings. Although the existence of these constants has been proven, their exact values are unknown. They are named after
Václav Chvátal Václav (Vašek) Chvátal () is a Professor Emeritus in the Department of Computer Science and Software Engineering at Concordia University in Montreal, Quebec, Canada and a Visiting Professor at Charles University in Prague. He has published ext ...
and
David Sankoff David Sankoff (born December 31, 1942) is a Canadian mathematician, bioinformatician, computer scientist and linguist. He holds the Canada Research Chair in Mathematical Genomics in the Mathematics and Statistics Department at the University o ...
, who began investigating them in the mid-1970s... There is one Chvátal–Sankoff constant \gamma_k for each positive integer ''k'', where ''k'' is the number of characters in the alphabet from which the random strings are drawn. The sequence of these numbers grows inversely proportionally to the square root of ''k''. However, some authors write "the Chvátal–Sankoff constant" to refer to \gamma_2, the constant defined in this way for the binary alphabet..


Background

A common subsequence of two strings ''S'' and ''T'' is a string whose characters appear in the same order (not necessarily consecutively) both in ''S'' and in ''T''. The problem of computing a
longest common subsequence A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from the longest common substring: unlike substrings, subsequences are not required to occupy conse ...
has been well studied in computer science. It can be solved in
polynomial time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
by
dynamic programming Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics. I ...
; this basic algorithm has additional speedups for small alphabets (the
Method of Four Russians In computer science, the Method of Four Russians is a technique for speeding up algorithms involving Boolean matrices, or more generally algorithms involving matrices in which each cell may take on only a bounded number of possible values. Idea ...
), for strings with few differences, for strings with few matching pairs of characters, etc. This problem and its generalizations to more complex forms of
edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to t ...
have important applications in areas that include
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
(in the comparison of DNA and
protein sequence Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesi ...
s and the reconstruction of
evolutionary tree A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological s ...
s),
geology Geology () is a branch of natural science concerned with Earth and other astronomical objects, the features or rocks of which it is composed, and the processes by which they change over time. Modern geology significantly overlaps all other Ea ...
(in
stratigraphy Stratigraphy is a branch of geology concerned with the study of rock layers ( strata) and layering (stratification). It is primarily used in the study of sedimentary and layered volcanic rocks. Stratigraphy has three related subfields: lithost ...
), and
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
(in
data comparison In computing, file comparison is the calculation and display of the differences and similarities between data objects, typically text files such as source code. The methods, implementations, and results are typically called a diff, after the Un ...
and
revision control In software engineering, version control (also known as revision control, source control, or source code management) is a class of systems responsible for managing changes to computer programs, documents, large web sites, or other collections o ...
).. One motivation for studying the longest common subsequences of random strings, given already by Chvátal and Sankoff, is to calibrate the computations of longest common subsequences on strings that are not random. If such a computation returns a subsequence that is significantly longer than what would be obtained at random, one might infer from this result that the match is meaningful or significant.


Definition and existence

The Chvátal–Sankoff constants describe the behavior of the following random process. Given parameters ''n'' and ''k'', choose two length-''n'' strings ''S'' and ''T'' from the same ''k''-symbol alphabet, with each character of each string chosen uniformly at random, independently of all the other characters. Compute a longest common subsequence of these two strings, and let \lambda_ be the
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...
whose value is the length of this subsequence. Then the
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...
of \lambda_ is (up to lower-order terms) proportional to ''n'', and the ''k''th Chvátal–Sankoff constant \gamma_k is the
constant of proportionality In mathematics, two sequences of numbers, often experimental data, are proportional or directly proportional if their corresponding elements have a constant ratio, which is called the coefficient of proportionality or proportionality consta ...
. More precisely, the expected value \operatorname
lambda_ Lambda (}, ''lám(b)da'') is the 11th letter of the Greek alphabet, representing the voiced alveolar lateral approximant . In the system of Greek numerals, lambda has a value of 30. Lambda is derived from the Phoenician Lamed . Lambda gave rise ...
/math> is
superadditive In mathematics, a function f is superadditive if f(x+y) \geq f(x) + f(y) for all x and y in the domain of f. Similarly, a sequence \left\, n \geq 1, is called superadditive if it satisfies the inequality a_ \geq a_n + a_m for all m and n. The ...
: for all ''m'' and ''n'', \operatorname
lambda_ Lambda (}, ''lám(b)da'') is the 11th letter of the Greek alphabet, representing the voiced alveolar lateral approximant . In the system of Greek numerals, lambda has a value of 30. Lambda is derived from the Phoenician Lamed . Lambda gave rise ...
ge \operatorname
lambda_ Lambda (}, ''lám(b)da'') is the 11th letter of the Greek alphabet, representing the voiced alveolar lateral approximant . In the system of Greek numerals, lambda has a value of 30. Lambda is derived from the Phoenician Lamed . Lambda gave rise ...
\operatorname
lambda_ Lambda (}, ''lám(b)da'') is the 11th letter of the Greek alphabet, representing the voiced alveolar lateral approximant . In the system of Greek numerals, lambda has a value of 30. Lambda is derived from the Phoenician Lamed . Lambda gave rise ...
/math>. This is because, if strings of length ''m'' + ''n'' are broken into substrings of lengths ''m'' and ''n'', and the longest common subsequences of those substrings are found, they can be
concatenated In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalisations of concatenatio ...
together to get a common substring of the whole strings. It follows from a lemma of
Michael Fekete Michael (Mihály) Fekete ( he, מיכאל פקטה; 19 July 1886 – 13 May 1957) was a Hungarian- Israeli mathematician. Biography Fekete was born in 1886 in Zenta, Austria-Hungary (today Senta, Serbia). He received his PhD in 1909 from ...
that the
limit Limit or Limits may refer to: Arts and media * ''Limit'' (manga), a manga by Keiko Suenobu * ''Limit'' (film), a South Korean film * Limit (music), a way to characterize harmony * "Limit" (song), a 2016 single by Luna Sea * "Limits", a 2019 ...
:\gamma_k = \lim_ \frac exists, and equals the
supremum In mathematics, the infimum (abbreviated inf; plural infima) of a subset S of a partially ordered set P is a greatest element in P that is less than or equal to each element of S, if such an element exists. Consequently, the term ''greatest l ...
of the values \operatorname
lambda_ Lambda (}, ''lám(b)da'') is the 11th letter of the Greek alphabet, representing the voiced alveolar lateral approximant . In the system of Greek numerals, lambda has a value of 30. Lambda is derived from the Phoenician Lamed . Lambda gave rise ...
n. These limiting values \gamma_k are the Chvátal–Sankoff constants.


Bounds

The exact values of the Chvátal–Sankoff constants remain unknown, but rigorous upper and lower bounds have been proven. Because \gamma_k is a supremum of the values \operatorname
lambda_ Lambda (}, ''lám(b)da'') is the 11th letter of the Greek alphabet, representing the voiced alveolar lateral approximant . In the system of Greek numerals, lambda has a value of 30. Lambda is derived from the Phoenician Lamed . Lambda gave rise ...
/math> which each depend only on a finite probability distribution, one way to prove rigorous lower bounds on \gamma_k would be to compute the exact values of \operatorname
lambda_ Lambda (}, ''lám(b)da'') is the 11th letter of the Greek alphabet, representing the voiced alveolar lateral approximant . In the system of Greek numerals, lambda has a value of 30. Lambda is derived from the Phoenician Lamed . Lambda gave rise ...
/math>; however, this method scales exponentially in ''n'', so it can only be implemented for small values of ''n'', leading to weak lower bound. In his Ph.D. thesis, Vlado Dančík pioneered an alternative approach in which a
deterministic finite automaton In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automa ...
is used to read symbols of two input strings and produce a (long but not optimal) common subsequence of these inputs. The behavior of this automaton on random inputs can be analyzed as a
Markov chain A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...
, the steady state of which determines the rate at which it finds elements of the common subsequence for large values of ''n''. This rate is necessarily a lower bound on the Chvátal–Sankoff constant. By using Dančík's method, with an automaton whose state space buffers the most recent ''h'' characters from its two input strings, and with additional techniques for avoiding the expensive steady-state Markov chain analysis of this approach, was able to perform a computerized analysis with ''n'' = 15 that proved \gamma_2\ge 0.788071. Similar methods can be generalized to non-binary alphabets. Lower bounds obtained in this way for various values of ''k'' are: also used automata-theoretic methods to prove upper bounds on the Chvátal–Sankoff constants, and again extended these results by computerized calculations. The upper bound he obtained was \gamma_2\le 0.826280. This result disproved a conjecture of J. Michael Steele that \gamma_2 = 2/(1+\sqrt 2), because this value is greater than the upper bound. Non-rigorous numerical evidence suggests that \gamma_2 is approximately 0.811, closer to the upper bound than the lower bound. In the limit as ''k'' goes to infinity, the constants \gamma_k grow inversely proportionally to the square root of ''k''. More precisely,. :\lim_ \gamma_k \sqrt k = 2.


Distribution of LCS lengths

There has also been research into the distribution of values of the longest common subsequence, generalizing the study of the expectation of this value. For instance, the standard deviation of the length of the longest common subsequence of random strings of length ''n'' is known to be proportional to the square root of ''n''. One complication in performing this sort of analysis is that the random variables describing whether the characters at different pairs of positions match each other are not independent of each other. For a more mathematically tractable simplification of the longest common subsequence problem, in which the allowed matches between pairs of symbols are not controlled by whether those symbols are equal to each other but instead by independent random variables with probability 1/''k'' of being 1 and (''k'' − 1)/''k'' of being 0, it has been shown that the distribution of the longest common subsequence length is controlled by the
Tracy–Widom distribution The Tracy–Widom distribution is a probability distribution from random matrix theory introduced by . It is the distribution of the normalized largest eigenvalue of a random Hermitian matrix. The distribution is defined as a Fredholm determinant ...
..


References

{{DEFAULTSORT:Chvatal-Sankoff constants Mathematical constants Probability theory Problems on strings