Distance matrices are used in phylogeny as
non-parametric
Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as in parametric sta ...
distance methods and were originally applied to
phenetic
In biology, phenetics (; ), also known as taximetrics, is an attempt to classify organisms based on overall similarity, usually with respect to Morphology (biology), morphology or other observable traits, regardless of their phylogeny or evoluti ...
data using a matrix of pairwise distances. These distances are then reconciled to produce a tree (a
phylogram
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In o ...
, with informative branch lengths). The
distance matrix
In mathematics, computer science and especially graph theory, a distance matrix is a square matrix (two-dimensional array) containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the ''dist ...
can come from a number of different sources, including measured distance (for example from
immunological studies) or
morphometric analysis, various pairwise distance formulae (such as
euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
) applied to discrete morphological characters, or
genetic distance
Genetic distance is a measure of the genetics, genetic divergence between species or between population#Genetics, populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Populations with ...
from sequence,
restriction fragment
In molecular biology, a restriction fragment is a DNA fragment resulting from the cutting of a DNA strand by a restriction enzyme (restriction endonucleases). Each restriction enzyme is highly specific, recognising a particular short DNA sequen ...
, or
allozyme
Alloenzymes (or also called allozymes) are variant forms of an enzyme which differ structurally but not functionally from other allozymes coded for by different alleles at the same locus. These are opposed to isozymes, which are enzymes that p ...
data. For phylogenetic character data, raw distance values can be calculated by simply counting the number of pairwise differences in character states (
Hamming distance
In information theory, the Hamming distance between two String (computer science), strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number ...
).
Distance-matrix methods
Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore they start with a multiple sequence alignment (MSA) as an input. From it, they construct an all-to-all matrix describing the distance between each sequence pair. Finally, they construct a phylogenetic tree that places closely related sequences under the same
interior node and whose branch lengths closely reproduce the observed distances between sequences. The produced tree is either rooted or unrooted, depending on the algorithm used.
Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches.
[Mount DM. (2004). ''Bioinformatics: Sequence and Genome Analysis'' 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.]
Distance-matrix methods are frequently used as the basis for progressive and iterative types of
multiple sequence alignment
Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis an ...
.
The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.
[Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.]
Neighbor-joining
Neighbor-joining methods apply general
data clustering
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some specific sense defined by the analyst) to each o ...
techniques to sequence analysis using genetic distance as a clustering metric. The simple
neighbor-joining
In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm require ...
method produces unrooted trees, but it does not assume a constant rate of evolution (i.e., a
molecular clock
The molecular clock is a figurative term for a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleot ...
) across lineages.
UPGMA and WPGMA
The
UPGMA
UPGMA (unweighted pair group method with arithmetic mean) is a simple agglomerative (bottom-up) hierarchical clustering method. It also has a weighted variant, WPGMA, and they are generally attributed to Sokal and Michener.
Note that the unwei ...
(''Unweighted Pair Group Method with Arithmetic mean'') and
WPGMA (''Weighted Pair Group Method with Arithmetic mean'') methods produce rooted trees and require a constant-rate assumption – that is, it assumes an
ultrametric
In mathematics, an ultrametric space is a metric space in which the triangle inequality is strengthened to d(x,z)\leq\max\left\ for all x, y, and z. Sometimes the associated metric is also called a non-Archimedean metric or super-metric.
Formal d ...
tree in which the distances from the root to every branch tip are equal.
Fitch–Margoliash method
The Fitch–Margoliash method uses a weighted
least squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
method for clustering based on genetic distance.
Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. In practice, the distance correction is only necessary when the evolution rates differ among branches.
The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and distantly related groups. The distances calculated by this method must be
linear
In mathematics, the term ''linear'' is used in two distinct senses for two different properties:
* linearity of a '' function'' (or '' mapping'');
* linearity of a '' polynomial''.
An example of a linear function is the function defined by f(x) ...
; the linearity criterion for distances requires that the
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
s of the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances – a property that applies to biological sequences only when they have been corrected for the possibility of
back mutations at individual sites. This correction is done through the use of a
substitution matrix
In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a Nucleic acid sequence, nucleotide sequence or a Protein primary structure, protein sequence changes to other character states ove ...
such as that derived from the
Jukes–Cantor model of DNA evolution.
The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is
NP-complete
In computational complexity theory, NP-complete problems are the hardest of the problems to which ''solutions'' can be verified ''quickly''.
Somewhat more precisely, a problem is NP-complete when:
# It is a decision problem, meaning that for any ...
,
so
heuristic
A heuristic or heuristic technique (''problem solving'', '' mental shortcut'', ''rule of thumb'') is any approach to problem solving that employs a pragmatic method that is not fully optimized, perfected, or rationalized, but is nevertheless ...
search methods like those used in maximum-parsimony analysis are applied to the search through tree space.
Using outgroups
Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees. Standard usage of distance-matrix methods involves the inclusion of at least one
outgroup sequence known to be only distantly related to the sequences of interest in the query set.
This usage can be seen as a type of
experimental control. If the outgroup has been appropriately chosen, it will have a much greater
genetic distance
Genetic distance is a measure of the genetics, genetic divergence between species or between population#Genetics, populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Populations with ...
and thus a longer branch length than any other sequence, and it will appear near the root of a rooted tree. Choosing an appropriate outgroup requires the selection of a sequence that is moderately related to the sequences of interest; too close a relationship defeats the purpose of the outgroup and too distant adds
noise
Noise is sound, chiefly unwanted, unintentional, or harmful sound considered unpleasant, loud, or disruptive to mental or hearing faculties. From a physics standpoint, there is no distinction between noise and desired sound, as both are vibrat ...
to the analysis.
Care should also be taken to avoid situations in which the species from which the sequences were taken are distantly related, but the gene encoded by the sequences is highly
conserved across lineages.
Horizontal gene transfer
Horizontal gene transfer (HGT) or lateral gene transfer (LGT) is the movement of genetic material between organisms other than by the ("vertical") transmission of DNA from parent to offspring (reproduction). HGT is an important factor in the e ...
, especially between otherwise divergent
bacteria
Bacteria (; : bacterium) are ubiquitous, mostly free-living organisms often consisting of one Cell (biology), biological cell. They constitute a large domain (biology), domain of Prokaryote, prokaryotic microorganisms. Typically a few micr ...
, can also confound outgroup usage.
Weaknesses of different methods
In general, pairwise distance data are an underestimate of the path-distance between taxa on a
phylogram
A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In o ...
. Pairwise distances effectively "cut corners" in a manner analogous to geographic distance: the distance between two cities may be 100 miles "as the crow flies," but a traveler may actually be obligated to travel 120 miles because of the layout of roads, the terrain, stops along the way, etc. Between pairs of taxa, some character changes that took place in ancestral lineages will be undetectable, because later changes have erased the evidence (often called
multiple hits and
back mutations in
sequence data
In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is calle ...
). This problem is common to all phylogenetic estimation, but it is particularly acute for distance methods, because only two samples are used for each distance calculation; other methods benefit from evidence of these hidden changes found in other taxa not considered in pairwise comparisons. For
nucleotide
Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
and
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
sequence data, the same stochastic models of nucleotide change used in maximum likelihood analysis can be employed to "correct" distances, rendering the analysis "semi-parametric."
Several simple algorithms exist to construct a tree directly from pairwise distances, including
UPGMA
UPGMA (unweighted pair group method with arithmetic mean) is a simple agglomerative (bottom-up) hierarchical clustering method. It also has a weighted variant, WPGMA, and they are generally attributed to Sokal and Michener.
Note that the unwei ...
and
neighbor joining
In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm require ...
(NJ), but these will not necessarily produce the best tree for the data. To counter potential complications noted above, and to find the best tree for the data, distance analysis can also incorporate a tree-search protocol that seeks to satisfy an explicit optimality criterion. Two optimality criteria are commonly applied to distance data,
minimum evolution
Minimum evolution is a distance method employed in phylogenetics modeling. It shares with maximum parsimony the aspect of searching for the phylogeny that has the shortest total sum of branch lengths.
The theoretical foundations of the minimum ...
(ME) and
least squares inference. Least squares is part of a broader class of regression-based methods lumped together here for simplicity. These regression formulae minimize the residual differences between path-distances along the tree and pairwise distances in the data matrix, effectively "fitting" the tree to the empirical distances. In contrast, ME accepts the tree with the shortest sum of branch lengths, and thus minimizes the total amount of evolution assumed. ME is closely akin to parsimony, and under certain conditions, ME analysis of distances based on a discrete character dataset will favor the same tree as conventional parsimony analysis of the same data.
Phylogeny estimation using distance methods has produced a number of controversies.
UPGMA
UPGMA (unweighted pair group method with arithmetic mean) is a simple agglomerative (bottom-up) hierarchical clustering method. It also has a weighted variant, WPGMA, and they are generally attributed to Sokal and Michener.
Note that the unwei ...
assumes an
ultrametric
In mathematics, an ultrametric space is a metric space in which the triangle inequality is strengthened to d(x,z)\leq\max\left\ for all x, y, and z. Sometimes the associated metric is also called a non-Archimedean metric or super-metric.
Formal d ...
tree (a tree where all the path-lengths from the root to the tips are equal). If the rate of evolution were equal in all sampled lineages (a
molecular clock
The molecular clock is a figurative term for a technique that uses the mutation rate of biomolecules to deduce the time in prehistory when two or more life forms diverged. The biomolecular data used for such calculations are usually nucleot ...
), and if the tree were completely balanced (equal numbers of taxa on both sides of any split, to counter the
node density effect), UPGMA should not produce a biased result. These expectations are not met by most datasets, and although UPGMA is somewhat robust to their violation, it is not commonly used for phylogeny estimation. The advantage of UPGMA is that it is fast and can handle many sequences.
Neighbor-joining
In bioinformatics, neighbor joining is a bottom-up (agglomerative) clustering method for the creation of phylogenetic trees, created by Naruya Saitou and Masatoshi Nei in 1987. Usually based on DNA or protein sequence data, the algorithm require ...
is a form of
star decomposition and, as a
heuristic
A heuristic or heuristic technique (''problem solving'', '' mental shortcut'', ''rule of thumb'') is any approach to problem solving that employs a pragmatic method that is not fully optimized, perfected, or rationalized, but is nevertheless ...
method, is generally the least computationally intensive of these methods. It is very often used on its own, and in fact quite frequently produces reasonable trees. However, it lacks any sort of tree search and optimality criterion, and so there is no guarantee that the recovered tree is the one that best fits the data. A more appropriate analytical procedure would be to use NJ to produce a starting tree, then employ a tree search using an optimality criterion, to ensure that the best tree is recovered.
Many scientists eschew distance methods, for various reasons. A commonly cited reason is that distances are inherently
phenetic
In biology, phenetics (; ), also known as taximetrics, is an attempt to classify organisms based on overall similarity, usually with respect to Morphology (biology), morphology or other observable traits, regardless of their phylogeny or evoluti ...
rather than
phylogenetic
In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...
, in that they do not distinguish between ancestral similarity (
symplesiomorphy
In phylogenetics, a plesiomorphy ("near form") and symplesiomorphy are synonyms for an ancestral character shared by all members of a clade, which does not distinguish the clade from other clades.
Plesiomorphy, symplesiomorphy, apomorphy, an ...
) and derived similarity (
synapomorphy
In phylogenetics, an apomorphy (or derived trait) is a novel Phenotypic trait, character or character state that has evolution, evolved from its ancestral form (or Plesiomorphy and symplesiomorphy, plesiomorphy). A synapomorphy is an apomorphy sh ...
). This criticism is not entirely fair: most currently implementations of parsimony, likelihood, and Bayesian phylogenetic inference use time-reversible character models, and thus accord no special status to derived or ancestral character states. Under these models, the tree is estimated unrooted; rooting, and consequently determination of polarity, is performed after the analysis. The primary difference between these methods and distances is that parsimony, likelihood, and Bayesian methods fit individual characters to the tree, whereas distance methods fit all the characters at once. There is nothing inherently less phylogenetic about this approach.
More practically, distance methods are avoided because the relationship between individual characters and the tree is lost in the process of reducing characters to distances. These methods do not use character data directly, and information locked in the distribution of character states can be lost in the pairwise comparisons. Also, some complex phylogenetic relationships may produce biased distances. On any phylogram, branch lengths will be underestimated because some changes cannot be discovered at all due to failure to sample some species due to either experimental design or extinction (a phenomenon called the node density effect). However, even if pairwise distances from genetic data are "corrected" using stochastic models of evolution as mentioned above, they may more easily sum to a different tree than one produced from analysis of the same data and model using
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
. This is because pairwise distances are not independent; each branch on a tree is represented in the distance measurements of all taxa it separates. Error resulting from any characteristic of that branch that might confound phylogeny (stochastic variability, change in evolutionary parameters, an abnormally long or short branch length) will be propagated through all of the relevant distance measurements. The resulting distance matrix may then better fit an alternate (presumably less optimal) tree.
Despite these potential problems, distance methods are extremely fast, and they often produce a reasonable estimate of phylogeny. They also have certain benefits over the methods that use characters directly. Notably, distance methods allow use of data that may not be easily converted to character data, such as
DNA-DNA hybridization assays. They also permit analyses that account for the possibility that the rate at which particular nucleotides are incorporated into sequences may vary over the tree, using
LogDet distances. For some network-estimation methods (notably
NeighborNet), the abstraction of information about individual characters in distance data are an advantage. When considered character-by character, conflict between character and a tree due to reticulation cannot be told from conflict due either to homoplasy or error. However, pronounced conflict in distance data, which represents an amalgamation of many characters, is less likely due to error or homoplasy unless the data are strongly biased, and is thus more likely to be a result of reticulation.
Distance methods are popular among molecular systematists, a substantial number of whom use NJ without an optimization stage almost exclusively. With the increasing speed of character-based analyses, some of the advantages of distance methods will probably wane. However, the nearly instantaneous NJ implementations, the ability to incorporate an evolutionary model in a speedy analysis, LogDet distances, network estimation methods, and the occasional need to summarize relationships with a single number all mean that distance methods will probably stay in the mainstream for a long time to come.
See also
*
List of phylogenetics software
This list of phylogenetics software is a compilation of computational phylogenetics software used to produce phylogenetic trees. Such tools are commonly used in comparative genomics, cladistics, and bioinformatics. Methods for estimating phylogenie ...
References
{{phylo
Computational phylogenetics