Phylogenetic
In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...
invariants are polynomial relationships between the frequencies of various site patterns in an idealized DNA
multiple sequence alignment
Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolution ...
. They have received substantial study in the field of
biomathematics
Mathematical and theoretical biology, or biomathematics, is a branch of biology which employs theoretical analysis, mathematical models and abstractions of the living organisms to investigate the principles that govern the structure, development a ...
, and they can be used to choose among phylogenetic tree topologies in an empirical setting. The primary advantage of phylogenetic invariants relative to other methods of phylogenetic estimation like
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed sta ...
or
Bayesian MCMC analyses is that invariants can yield information about the tree without requiring the estimation of branch lengths of model parameters. The idea of using phylogenetic invariants was introduced independently by James Cavender and
Joseph Felsenstein
Joseph "Joe" Felsenstein (born May 9, 1942) is a Professor Emeritus in the Departments of Genome Sciences and Biology at the University of Washington in Seattle. He is best known for his work on phylogenetic inference, and is the author of ''In ...
and by
James A. Lake
James A. Lake (born August 10, 1941, Kearney, Nebraska) is an American evolutionary biologist and a Distinguished Professor of Molecular, Cell, and Developmental Biology and of Human Genetics at UCLA. Lake is best known for the New Animal Phyloge ...
in 1987.
At this point the number of programs that allow empirical datasets to be analyzed using invariants is limited. However, phylogenetic invariants may provide solutions to other problems in phylogenetics and they represent an area of active research for that reason. Felsenstein
stated it best when he said, "invariants are worth attention, not for what they do for us now, but what they might lead to in the future." (p. 390)
If we consider a multiple sequence alignment with ''t'' taxa and no gaps or missing data (i.e., an ''idealized multiple sequence alignment''), there are 4''
t'' possible site patterns. For example, there are 256 possible site patterns for four taxa (''f''
AAAA, ''f''
AAAC, ''f''
AAAG, … ''f''
TTTT), which can be written as a vector. This site pattern frequency vector has 255 degrees of freedom because the frequencies must sum to one. However, any set of site pattern frequencies that resulted from some specific process of sequence evolution on a specific tree must obey many constraints. and therefore have many fewer degrees of freedom. Thus, there should be polynomials involving those frequencies that take on a value of zero if the DNA sequences were generated on a specific tree given a particular
substitution model
In biology, a substitution model, also called models of DNA sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules (e.g., DNA sequences) represented as seque ...
.
Invariants are formulas in the expected pattern frequencies, not the observed pattern frequencies. When they are computed using the observed pattern frequencies, we will usually find that they are not precisely zero even when the model and tree topology are correct. By testing whether such polynomials for various trees are 'nearly zero' when evaluated on the observed frequencies of patterns in real data sequences one should be able infer which tree best explains the data.
Some invariants are straightforward consequences of symmetries in the model of nucleotide substitution and they will take on a value of zero regardless of the underlying tree topology. For example, if we assume the
Jukes-Cantor model of sequence evolution and a four-taxon tree we expect:
This is a simple outgrowth of the fact that base frequencies are constrained to be equal under the Jukes-Cantor model. Thus, they are called ''symmetry invariants''. The equation shown above is only one of a large number of symmetry invariants for the Jukes-Cantor model; in fact, there are a total of 241 symmetry invariants for that model.
Symmetry invariants are non-phylogenetic in nature; they take on the expected value of zero regardless of the tree topology. However, it is possible to determine whether a particular multiple sequence alignment fits the Jukes-Cantor model of evolution (i.e., by testing whether the site patterns of the appropriate types are present in equal numbers). More general tests for the best-fitting model using invariants are also possible. For example Kedzierska et al. 2012
used invariants to establish the best-fitting model out from a specific model set.
The asterisk after the JC69, K80, and K81 models is used to emphasize the non-homogeneous nature of the models that can be examined using invariants. These non-homogeneous models include the commonly used continuous-time JC69, K80, and K81 models as submodels. The SSM (strand-specific model), also called the CS05 model, is a generalized non-homogeneous version of the HKY (Hasegawa-Kishino-Yano) model constrained to have equal distribution of the pairs of bases A,T and C,G at each node of the tree and no assumption regarding a stable base distribution. All models listed above are submodels of the general Markov model (GMM). The ability to perform tests using non-homogeneous models represents a major benefit of the invariants methods relative to the more commonly used maximum likelihood methods for phylogenetic model testing.
''Phylogenetic invariants'', which are defined as the subset of invariants that take on a value of zero only when the sequences were (or were not) generated on a specific topology, are likely to be the most useful invariants for phylogenetic studies. .
Lake's linear invariants
Lake's invariants (which he called "evolutionary parsimony") provide an excellent example of phylogenetic invariants. Lake's invariants involve quartets, two of which (the incorrect topologies) yield values of zero and one of which yields a value greater than zero. This can be used to construct a test based on following invariant relationship, which holds for the two incorrect trees when sites evolve under the Kimura two-parameter model of sequence evolution:
The indices of these site pattern frequencies indicate the bases scored relative to the base in the first taxon (which we call taxon A). If base 1 is a
purine
Purine is a heterocyclic aromatic organic compound that consists of two rings ( pyrimidine and imidazole) fused together. It is water-soluble. Purine also gives its name to the wider class of molecules, purines, which include substituted purin ...
, then base 2 is the other purine and bases 3 and 4 are the
pyrimidine
Pyrimidine (; ) is an aromatic, heterocyclic, organic compound similar to pyridine (). One of the three diazines (six-membered heterocyclics with two nitrogen atoms in the ring), it has nitrogen atoms at positions 1 and 3 in the ring. The othe ...
s. If base 1 is a pyrimidine, then base 2 is the other pyrimidine and. bases 3 and 4 are the purines.
We will call three possible quartet trees T
X X is ((A,B),(C,D)); in newick format">X is ((A,B),(C,D)); in newick format">newick_format.html" ;"title="
X is ((A,B),(C,D)); in newick format">
X is ((A,B),(C,D)); in newick format T
Y [T
Y is ((A,C),(B,D)); in newick format], and T
Z [T
Z is ((A,D),(B,C)); in newick format]. We can calculate three values from the data to identify the best topology given the data:
Lake broke these values up into a "parsimony-like term" (
for T
X) the "background term" (
for T
X) and suggests testing for deviation from zero by calculating
and performing a
χ2 test with one
degree of freedom
Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...
. Similar χ
2 tests can be performed for Y and Z. If one of the three values is significantly different from zero the corresponding topology is the best estimate of phylogeny. The advantage of using Lake's invariants relative to maximum likelihood or
neighbor joining
In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm require ...
of Kimura two-parameter distances is that the invariants should hold regardless of the model parameters, branch lengths, or patterns of among-sites rate heterogeneity.
As expected for any phylogenetic method based on the Kimura two-parameter model, phylogenetic estimation using Lake's invariants is inconsistent when the model that generated the data strongly violates the Kimura two-parameter model; in a classic study that examined methods of phylogenetic estimation John Huelsenbeck and
David Hillis
David Mark Hillis (born December 21, 1958 in Copenhagen, Denmark) is an American evolutionary biologist, and the Alfred W. Roark Centennial Professor of Biology at the University of Texas at Austin. He is best known for his studies of molecular ...
found that Lake's invariants is consistent over all of the branch length space they examined. However, they also found that Lake's invariants are very inefficient (large amounts of data are necessary to converge on the correct tree). This inefficiency has caused most empiricists to abandon the use of Lake's invariants.
Modern approaches using phylogenetic invariants
The low efficiency of Lake's invariants reflects the fact that it used a limited set of generators for the phylogenetic invariants. Casanellas et al. introduced methods to derive a much larger set of set of generators for DNA data and this has led to the development of invariants methods that are as efficient as maximum likelihood methods. Several of these methods have implementations that are practical for analyses of empirical datasets.
Eriksson proposed an invariants method for the general Markov model based on
singular value decomposition
In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is r ...
(SVD) of matrices generated by "flattening" the nucleotides associated with each of the leaves (i.e., the site pattern frequency spectrum). Different flattening matrices are produced for each topology. However, comparisons of the original Eriksson SVD method (ErikSVD) to neighbor joining and the maximum likelihood approach implemented in the
PHYLIP
PHYLogeny Inference Package (PHYLIP) is a free computational phylogenetics package of programs for inferring evolutionary trees (Phylogenetics, phylogenies). It consists of 65 Porting, portable programs, i.e., the source code is written in the prog ...
program dnaml were mixed; ErikSVD underperformed the other two methods when used with simulated data but it appeared to perform better than dnaml when applied to an empirical mammalian dataset based on an early release of data from the
ENCODE
The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome.
ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...
project. The original ErikSVD method was improved by Fernández-Sánchez and Casanellas, who proposed a normalization they called Erik+2. The original ErikSVD method is statistically consistent (it converges on. the true tree. as the empirical distribution approaches the theoretical distribution); the Erik+2 normalization improves the performance of the method given finite datasets. It has been implemented in the software package
PAUP*
PAUP* (Phylogenetic Analysis Using Parsimony *and other methods) is a computational phylogenetics program for inferring evolutionary trees (Phylogenetics, phylogenies), written by David L. Swofford. Originally, as the name implies, PAUP only implem ...
as an option for the SVDquartets method.
"Squangles" (stochastic quartet tangles) represents another example of an invariants method
hat has been implemented in software package that is practical to be used with empirical datasets. Squangles permit the choice among the three possible quartets assuming that DNA sequences have evolved under the general
Markov model
In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Mark ...
; the quartets can then be assembled using a supertree method. There are three squangles that are useful for differentiating among quartets, which can be denoted as ''q''
1(f), ''q''
2(f), and ''q''
3(f) (f is a 256 element vector containing the site frequency spectrum). Each ''q'' has 66,744 terms and together they satisfy the linear relation ''q''
1 + ''q''
2 + ''q''
3 = 0 (i.e., up to linear dependence there are only two ''q'' values). Each possible quartet has different expected values for ''q''
1, ''q''
2, and ''q''
3:
The expected values ''q''
1, ''q''
2, and ''q''
3 are all zero on the star topology (a quartet with an internal branch length of zero). For practicality, Holland et al.
used
least squares
The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the r ...
to solve for the ''q'' values. Empirical tests of the squangles method have been limited
[{{Cite journal, last1=Reddy, first1=Sushma, last2=Kimball, first2=Rebecca T., last3=Pandey, first3=Akanksha, last4=Hosner, first4=Peter A., last5=Braun, first5=Michael J., last6=Hackett, first6=Shannon J., last7=Han, first7=Kin-Lan, last8=Harshman, first8=John, last9=Huddleston, first9=Christopher J., last10=Kingston, first10=Sarah, last11=Marks, first11=Ben D., date=September 2017, title=Why Do Phylogenomic Data Sets Yield Conflicting Trees? Data Type Influences the Avian Tree of Life more than Taxon Sampling, url=http://academic.oup.com/sysbio/article/66/5/857/3091102/Why-Do-Phylogenomic-Data-Sets-Yield-Conflicting, journal=Systematic Biology, language=en, volume=66, issue=5, pages=857–879, doi=10.1093/sysbio/syx041, pmid=28369655 , issn=1063-5157, doi-access=free] but they appear to be promising.
References
Phylogenetics