Ancestral reconstruction (also known as ''Character Mapping'' or ''Character Optimization'') is the extrapolation back in time from measured characteristics of individuals (or populations) to their
common ancestors. It is an important application of
phylogenetics
In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups ...
, the reconstruction and study of the
evolution
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
ary relationships among individuals, populations or
species
In biology, a species is the basic unit of Taxonomy (biology), classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of ...
to their ancestors. In the context of
evolutionary biology
Evolutionary biology is the subfield of biology that studies the evolutionary processes (natural selection, common descent, speciation) that produced the diversity of life on Earth. It is also defined as the study of the history of life fo ...
, ancestral reconstruction can be used to recover different kinds of ancestral character states of organisms that lived millions of years ago.
These states include the
genetic sequence
A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequences are usua ...
(
ancestral sequence reconstruction Ancestral sequence reconstruction (ASR) – also known as ancestral gene/sequence reconstruction/resurrection – is a technique used in the study of molecular evolution. The method uses related sequences to reconstruct an "ancestral" gene from a mu ...
), the
amino acid sequence
Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesi ...
of a
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
, the composition of a
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
(e.g., gene order), a measurable characteristic of an organism (
phenotype
In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology or physical form and structure, its developmental processes, its biochemical and physiological prop ...
), and the
geographic range of an ancestral population or species (ancestral range reconstruction). This is desirable because it allows us to examine parts of
phylogenetic trees corresponding to the distant past, clarifying the evolutionary history of the species in the tree. Since modern
genetic sequence
A nucleic acid sequence is a succession of bases signified by a series of a set of five different letters that indicate the order of nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. By convention, sequences are usua ...
s are essentially a variation of ancient ones, access to ancient sequences may identify other variations and organisms which could have arisen from those sequences.
In addition to genetic sequences, one might attempt to track the changing of one character trait to another, such as fins turning to legs.
Non-biological applications include the reconstruction of the vocabulary or phonemes of
ancient language
An ancient language is any language originating in times that may be referred to as ancient. There are no formal criteria for deeming a language ancient, but a traditional convention is to demarcate as "ancient" those languages that existed prior t ...
s,
and cultural characteristics of ancient societies such as oral traditions
or marriage practices.
Ancestral reconstruction relies on a sufficiently realistic
statistical
Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industr ...
model of evolution to accurately recover ancestral states. These models use the genetic information already obtained through methods such as
phylogenetics
In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups ...
to determine the route that
evolution
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
has taken and when evolutionary events occurred.
No matter how well the model approximates the actual evolutionary history, however, one's ability to accurately reconstruct an ancestor deteriorates with increasing evolutionary time between that ancestor and its observed descendants. Additionally, more realistic models of evolution are inevitably more complex and difficult to calculate, but also required to obtain more accurate reconstructions. Progress in the field of ancestral reconstruction has relied heavily on the
exponential growth of computing power and the concomitant development of efficient
computational algorithms (e.g., a
dynamic programming
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
I ...
algorithm for the joint
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed sta ...
reconstruction of ancestral sequences).
Methods of ancestral reconstruction are often applied to a given
phylogenetic tree that has already been inferred from the same data. While convenient, this approach has the disadvantage that its results are contingent on the accuracy of a single phylogenetic tree (i.e., a biased phylogenetic tree due to ignoring recombination can bias the reconstructed ancestral sequences ). In contrast, some researchers
advocate a more computationally intensive
Bayesian
Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister.
Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a follower ...
approach that accounts for uncertainty in tree reconstruction by evaluating ancestral reconstructions over many trees.
History
The concept of ancestral reconstruction is often credited to
Emile Zuckerkandl
Émile Zuckerkandl (July 4, 1922 – November 9, 2013) was an Austrian-born French biologist considered one of the founders of the field of molecular evolution. He introduced, with Linus Pauling, the concept of the "molecular clock", which enabl ...
and
Linus Pauling
Linus Carl Pauling (; February 28, 1901August 19, 1994) was an American chemist, biochemist, chemical engineer, peace activist, author, and educator. He published more than 1,200 papers and books, of which about 850 dealt with scientific top ...
. Motivated by the development of techniques for determining the
primary (amino acid) sequence of proteins by
Frederick Sanger
Frederick Sanger (; 13 August 1918 – 19 November 2013) was an English biochemist who received the Nobel Prize in Chemistry twice.
He won the 1958 Chemistry Prize for determining the amino acid sequence of insulin and numerous other pr ...
in 1955,
Zuckerkandl and Pauling postulated
that such sequences could be used to infer not only the
phylogeny
A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spe ...
relating the observed protein sequences, but also the ancestral protein sequence at the earliest point (root) of this tree. However, the idea of reconstructing ancestors from measurable biological characteristics had already been developing in the field of
cladistics
Cladistics (; ) is an approach to biological classification in which organisms are categorized in groups (" clades") based on hypotheses of most recent common ancestry. The evidence for hypothesized relationships is typically shared derived ch ...
, one of the precursors of modern phylogenetics. Cladistic methods, which appeared as early as 1901, infer the evolutionary relationships of species on the basis of the distribution of shared characteristics, of which some are inferred to be descended from common ancestors. Furthermore,
Theodoseus Dobzhansky and
Alfred Sturtevant
Alfred Henry Sturtevant (November 21, 1891 – April 5, 1970) was an American geneticist. Sturtevant constructed the first genetic linkage, genetic map of a chromosome in 1911. Throughout his career he worked on the organism ''Drosophila mel ...
articulated the principles of ancestral reconstruction in a phylogenetic context in 1938, when inferring the evolutionary history of
chromosomal inversion
An inversion is a chromosome rearrangement in which a segment of a chromosome becomes inverted within its original position. An inversion occurs when a chromosome undergoes a two breaks within the chromosomal arm, and the segment between the two br ...
s in ''
Drosophila pseudoobscura
''Drosophila'' () is a genus of flies, belonging to the family Drosophilidae, whose members are often called "small fruit flies" or (less frequently) pomace flies, vinegar flies, or wine flies, a reference to the characteristic of many species ...
''.
Thus, ancestral reconstruction has its roots in several disciplines. Today, computational methods for ancestral reconstruction continue to be extended and applied in a diversity of settings, so that ancestral states are being inferred not only for biological characteristics and the molecular sequences, but also for the structure
or
catalytic
Catalysis () is the process of increasing the rate of a chemical reaction by adding a substance known as a catalyst (). Catalysts are not consumed in the reaction and remain unchanged after it. If the reaction is rapid and the catalyst recycl ...
properties
of ancient versus modern
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
s, the geographic location of populations and species (
phylogeography
Phylogeography is the study of the historical processes that may be responsible for the past to present geographic distributions of genealogical lineages. This is accomplished by considering the geographic distribution of individuals in light of ge ...
)
and the higher-order structure of genomes.
Methods and algorithms
Any attempt at ancestral reconstruction begins with a
phylogeny
A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spe ...
. In general, a phylogeny is a tree-based
hypothesis
A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. For a hypothesis to be a scientific hypothesis, the scientific method requires that one can testable, test it. Scientists generally base scientific hypotheses on prev ...
about the order in which populations (referred to as
taxa
In biology, a taxon ( back-formation from '' taxonomy''; plural taxa) is a group of one or more populations of an organism or organisms seen by taxonomists to form a unit. Although neither is required, a taxon is usually known by a particular n ...
) are related by descent from common ancestors. Observed taxa are represented by the ''tips'' or ''terminal nodes'' of the tree that are progressively connected by branches to their common ancestors, which are represented by the branching points of the tree that are usually referred to as the ''ancestral'' or ''internal nodes''. Eventually, all lineages converge to the
most recent common ancestor
In biology and genetic genealogy, the most recent common ancestor (MRCA), also known as the last common ancestor (LCA) or concestor, of a set of organisms is the most recent individual from which all the organisms of the set are descended. The ...
of the entire sample of taxa. In the context of ancestral reconstruction, a phylogeny is often treated as though it were a known quantity (with Bayesian approaches being an important exception). Because there can be an enormous number of phylogenies that are nearly equally effective at explaining the data, reducing the subset of phylogenies supported by the data to a single representative, or point estimate, can be a convenient and sometimes necessary simplifying assumption.
Ancestral reconstruction can be thought of as the direct result of applying a hypothetical model of evolution to a given phylogeny. When the model contains one or more free parameters, the overall objective is to estimate these parameters on the basis of measured characteristics among the observed taxa (sequences) that descended from common ancestors.
Parsimony
Parsimony refers to the quality of economy or frugality in the use of resources.
Parsimony may also refer to
* The Law of Parsimony, or Occam's razor, a problem-solving principle
** Maximum parsimony (phylogenetics), an optimality criterion in p ...
is an important exception to this paradigm: though it has been shown that there are circumstances under which it is the maximum likelihood estimator,
at its core, it is simply based on the heuristic that changes in character state are rare, without attempting to quantify that rarity.
There are three different classes of method for ancestral reconstruction. In chronological order of discovery, these are
maximum parsimony
In phylogenetics, maximum parsimony is an optimality criterion under which the phylogenetic tree that minimizes the total number of character-state changes (or miminizes the cost of differentially weighted character-state changes) is preferred. ...
,
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed sta ...
, and
Bayesian Inference. Maximum parsimony considers all evolutionary events equally likely; maximum likelihood accounts for the differing likelihood of certain classes of event; and Bayeisan inference relates the conditional probability of an event to the likelihood of the tree, as well as the amount of uncertainty that is associated with that tree. Maximum parsimony and maximum likelihood yield a single most probable outcome, whereas Bayesian inference accounts for uncertainties in the data and yields a sample of possible trees.
Maximum parsimony
Parsimony, known colloquially as "
Occam's razor
Occam's razor, Ockham's razor, or Ocham's razor ( la, novacula Occami), also known as the principle of parsimony or the law of parsimony ( la, lex parsimoniae), is the problem-solving principle that "entities should not be multiplied beyond neces ...
", refers to the principle of selecting the simplest of competing hypotheses. In the context of ancestral reconstruction, parsimony endeavours to find the distribution of ancestral states within a given tree which minimizes the total number of character state changes that would be necessary to explain the states observed at the tips of the tree. This method of
maximum parsimony
In phylogenetics, maximum parsimony is an optimality criterion under which the phylogenetic tree that minimizes the total number of character-state changes (or miminizes the cost of differentially weighted character-state changes) is preferred. ...
is one of the earliest formalized algorithms for reconstructing ancestral states, as well as one of the simplest.
Maximum parsimony can be implemented by one of several algorithms. One of the earliest examples is
Fitch's method,
which assigns ancestral character states by parsimony via two traversals of a rooted
binary tree
In computer science, a binary tree is a k-ary k = 2 tree data structure in which each node has at most two children, which are referred to as the ' and the '. A recursive definition using just set theory notions is that a (non-empty) binar ...
. The first stage is a
post-order traversal
In computer science, tree traversal (also known as tree search and walking the tree) is a form of graph traversal and refers to the process of visiting (e.g. retrieving, updating, or deleting) each node in a tree data structure, exactly once. ...
that proceeds from the tips toward the root of a tree by visiting descendant (child) nodes before their parents. Initially, we are determining the set of possible character states ''S
i'' for the ''i''-th ancestor based on the observed character states of its descendants. Each assignment is the
set intersection
In set theory, the intersection of two sets A and B, denoted by A \cap B, is the set containing all elements of A that also belong to B or equivalently, all elements of B that also belong to A.
Notation and terminology
Intersection is writ ...
of the character states of the ancestor's descendants; if the intersection is the empty set, then it is the
set union
In set theory, the union (denoted by ∪) of a collection of sets is the set of all elements in the collection. It is one of the fundamental operations through which sets can be combined and related to each other.
A refers to a union of ze ...
. In the latter case, it is implied that a character state change has occurred between the ancestor and one of its two immediate descendants. Each such event counts towards the algorithm's cost function, which may be used to discriminate among alternative trees on the basis of maximum parsimony. Next, a
pre-order traversal
In computer science, tree traversal (also known as tree search and walking the tree) is a form of graph traversal and refers to the process of visiting (e.g. retrieving, updating, or deleting) each node in a tree data structure, exactly once. S ...
of the tree is performed, proceeding from the root towards the tips. Character states are then assigned to each descendant based on which character states it shares with its parent. Since the root has no parent node, one may be required to select a character state arbitrarily, specifically when more than one possible state has been reconstructed at the root.

For example, consider a phylogeny recovered for a genus of plants containing 6 species A - F, where each plant is pollinated by either a "bee", "hummingbird" or "wind". One obvious question is what the pollinators at deeper nodes were in the phylogeny of this genus of plants. Under maximum parsimony, an ancestral state reconstruction for this clade reveals that "hummingbird" is the most parsimonious ancestral state for the lower clade (plants D, E, F), that the ancestral states for the nodes in the top clade (plants A, B, C) are equivocal and that both "hummingbird" or "bee" pollinators are equally plausible for the pollination state at the root of the phylogeny. Supposing we have strong evidence from the fossil record that the root state is "hummingbird". Resolution of the root to "hummingbird" would yield the pattern of ancestral state reconstruction depicted by the symbols at the nodes with the state requiring the fewest changes circled.
Parsimony methods are intuitively appealing and highly efficient, such that they are still used in some cases to seed maximum likelihood optimization algorithms with an initial phylogeny.
However, the underlying assumption that evolution attained a certain end result as fast as possible is inaccurate. Natural selection and evolution do not work towards a goal, they simply select for or against randomly occurring genetic changes. Parsimony methods impose six general assumptions: that the phylogenetic tree you are using is correct, that you have all of the relevant data, in which no mistakes were made in coding, that all branches of the phylogenetic tree are equally likely to change, that the rate of evolution is slow, and that the chance of losing or gaining a characteristic is the same.
In reality, assumptions are often violated, leading to several issues:
# ''Variation in rates of evolution.'' Fitch's method assumes that changes between all character states are equally likely to occur; thus, any change incurs the same cost for a given tree. This assumption is often unrealistic and can limit the accuracy of such methods.
For example,
transitions tend to occur more often than
transversion
Transversion, in molecular biology, refers to a point mutation in DNA in which a single (two ring) purine ( A or G) is changed for a (one ring) pyrimidine ( T or C), or vice versa. A transversion can be spontaneous, or it can be caused by ion ...
s in the evolution of nucleic acids. This assumption can be relaxed by assigning differential costs to specific character state changes, resulting in a weighted parsimony algorithm.
# ''Rapid evolution.'' The upshot of the "minimum evolution" heuristic underlying such methods is that such methods assume that changes are ''rare'', and thus are inappropriate in cases where change is the norm rather than the exception.
# ''Variation in time among lineages.'' Parsimony methods implicitly assume that the same amount of evolutionary time has passed along every branch of the tree. Thus, they do not account for variation in branch lengths in the tree, which are often used to quantify the passage of evolutionary or chronological time. This limitation makes the technique liable to infer that one change occurred on a very short branch rather than multiple changes occurring on a very long branch, for example.
In addition, it is possible that some branches of the tree could be experiencing higher selection and change rates than others, perhaps due to changing environmental factors. Some periods of time may represent more rapid evolution than others, when this happens parsimony becomes inaccurate.
This shortcoming is addressed by model-based methods (both maximum likelihood and Bayesian methods) that infer the stochastic process of evolution as it unfolds along each branch of a tree.
# ''Statistical justification.'' Without a statistical model underlying the method, its estimates do not have well-defined uncertainties.
# ''Convergent evolution.'' When considering a single character state, parsimony will automatically assume that two organisms that share that characteristic will be more closely related than those who do not. For example, just because dogs and apes have fur does not mean that they are more closely related than apes are to humans.
Maximum likelihood
Maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed sta ...
(ML) methods of ancestral state reconstruction treat the character states at internal nodes of the tree as parameters, and attempt to find the parameter values that maximize the probability of the data (the observed character states) given the hypothesis (a model of evolution and a phylogeny relating the observed sequences or taxa). In other words, this method assumes that the ancestral states are those which are statistically most likely, given the observed phenotypes. Some of the earliest ML approaches to ancestral reconstruction were developed in the context of
genetic sequence evolution;
similar models were also developed for the analogous case of discrete character evolution.
The use of a model of evolution accounts for the fact that not all events are equally likely to happen. For example, a
transition
Transition or transitional may refer to:
Mathematics, science, and technology Biology
* Transition (genetics), a point mutation that changes a purine nucleotide to another purine (A ↔ G) or a pyrimidine nucleotide to another pyrimidine (C ↔ ...
, which is a type of point mutation from one purine to another, or from one pyrimidine to another is much more likely to happen than a
transversion
Transversion, in molecular biology, refers to a point mutation in DNA in which a single (two ring) purine ( A or G) is changed for a (one ring) pyrimidine ( T or C), or vice versa. A transversion can be spontaneous, or it can be caused by ion ...
, which is the chance of a purine being switched to a pyrimidine, or vice versa. These differences are not captured by maximum parsimony. However, just because some events are more likely than others does not mean that they always happen. We know that throughout evolutionary history there have been times when there was a large gap between what was most likely to happen, and what actually occurred. When this is the case, maximum parsimony may actually be more accurate because it is more willing to make large, unlikely leaps than maximum likelihood is. Maximum likelihood has been shown to be quite reliable in reconstructing character states, but it does not do as good of a job at giving accurate estimations of the stability of proteins. Maximum likelihood always overestimates the stability of proteins, which makes sense since it assumes that the proteins that were made and used were the most stable and optimal.
The merits of maximum likelihood have been subject to debate, with some having concluded that maximum likelihood test represents a good medium between accuracy and speed.
However, other studies have complained that maximum likelihood takes too much time and computational power to be useful in some scenarios.
These approaches employ the same probabilistic framework as used to infer the phylogenetic tree.
In brief, the evolution of a genetic sequence is modelled by a time-reversible continuous time
Markov process
A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...
. In the simplest of these, all characters undergo independent state transitions (such as nucleotide substitutions) at a constant rate over time. This basic model is frequently extended to allow different rates on each branch of the tree. In reality, mutation rates may also vary over time (due, for example, to environmental changes); this can be modelled by allowing the rate parameters to evolve along the tree, at the expense of having an increased number of parameters. A model defines transition probabilities from states ''i'' to ''j'' along a branch of length ''t'' (in units of evolutionary time). The likelihood of a phylogeny is computed from a nested sum of transition probabilities that corresponds to the hierarchical structure of the proposed tree. At each node, the likelihood of its descendants is summed over all possible ancestral character states at that node:
where we are computing the likelihood of the
subtree
In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be con ...
rooted at node ''x'' with direct descendants ''y'' and ''z'',
denotes the character state of the ''i''-th node,
is the branch length (evolutionary time) between nodes ''i'' and ''j'', and
is the set of all possible character states (for example, the nucleotides A, C, G, and T).
Thus, the objective of ancestral reconstruction is to find the assignment to
for all ''x'' internal nodes that maximizes the likelihood of the observed data for a given tree.
Marginal and joint likelihood
Rather than compute the overall likelihood for alternative trees, the problem for ancestral reconstruction is to find the combination of character states at each ancestral node with the highest marginal maximum likelihood. Generally speaking, there are two approaches to this problem. First, one can assign the most likely character state to each ancestor independently of the reconstruction of all other ancestral states. This approach is referred to as ''marginal reconstruction''. It is akin to summing over all combinations of ancestral states at all of the other nodes of the tree (including the root node), other than those for which data is available. Marginal reconstruction is finding the state at the current node that maximizes the likelihood integrating over all other states at all nodes, in proportion to their probability. Second, one may instead attempt to find the joint combination of ancestral character states throughout the tree which jointly maximizes the likelihood of the entire dataset. Thus, this approach is referred to as joint reconstruction.
Not surprisingly, joint reconstruction is more
computationally complex than marginal reconstruction. Nevertheless, efficient algorithms for joint reconstruction have been developed with a time complexity that is generally linear with the number of observed taxa or sequences.
[
ML-based methods of ancestral reconstruction tend to provide greater accuracy than MP methods in the presence of variation in rates of evolution among characters (or across sites in a genome).] However, these methods are not yet able to accommodate variation in rates of evolution over time, otherwise known as heterotachy Heterotachy refers to variations in lineage-specific evolutionary rates over time. In the field of molecular evolution, the principle of heterotachy states that the substitution rate of sites in a gene can change through time. It has been proposed t ...
. If the rate of evolution for a specific character accelerates on a branch of the phylogeny, then the amount of evolution that has occurred on that branch will be underestimated for a given length of the branch and assuming a constant rate of evolution for that character. In addition to that, it is difficult to distinguish heterotachy from variation among characters in rates of evolution.
Since ML (unlike maximum parsimony) requires the investigator to specify a model of evolution, its accuracy may be affected by the use of a grossly incorrect model (model misspecification). Furthermore, ML can only provide a single reconstruction of character states (what is often referred to as a "point estimate") — when the likelihood surface is highly non-convex, comprising multiple peaks (local optima), then a single point estimate cannot provide an adequate representation, and a Bayesian approach may be more suitable.
Bayesian inference
Bayesian inference uses the likelihood of observed data to update the investigator's belief, or prior distribution
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
, to yield the posterior distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
. In the context of ancestral reconstruction, the objective is to infer the posterior probabilities of ancestral character states at each internal node of a given tree. Moreover, one can integrate these probabilities over the posterior distributions over the parameters of the evolutionary model and the space of all possible trees. This can be expressed as an application of Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For exa ...
:
where ''S'' represents the ancestral states, ''D'' corresponds to the observed data, and represents both the evolutionary model and the phylogenetic tree. is the likelihood of the observed data which can be computed by Felsenstein's pruning algorithm as given above. is the prior probability of the ancestral states for a given model and tree. Finally, is the probability of the data for a given model and tree, integrated over all possible ancestral states.
Bayesian inference is the method that many have argued is the most accurate. In general, Bayesian statistical methods allow investigators to combine pre-existing information with new hypothesis. In the case of evolution, it combines the likelihood of the data observed with the likelihood that the events happened in the order they did, while recognizing the potential for error and uncertainty. Overall, it is the most accurate method for reconstructing ancestral genetic sequences, as well as protein stability. Unlike the other two methods, Bayesian inference yields a distribution of possible trees, allowing for more accurate and easily interpretable estimates of the variance of possible outcomes.
We have given two formulations above to emphasize the two different applications of Bayes' theorem, which we discuss in the following section.
Empirical and hierarchical Bayes
One of the first implementations of a Bayesian approach to ancestral sequence reconstruction was developed by Yang and colleagues, where the maximum likelihood estimates of the evolutionary model and tree, respectively, were used to define the prior distributions. Thus, their approach is an example of an empirical Bayes method
Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed ...
to compute the posterior probabilities of ancestral character states; this method was first implemented in the software package PAML. In terms of the above Bayesian rule formulation, the empirical Bayes method fixes to the empirical estimates of the model and tree obtained from the data, effectively dropping from the posterior likelihood, and prior terms of the formula. Moreover, Yang and colleagues used the empirical distribution of site patterns (i.e., assignments of nucleotides to tips of the tree) in their alignment of observed nucleotide sequences in the denominator in place of exhaustively computing over all possible values of ''S'' given . Computationally, the empirical Bayes method is akin to the maximum likelihood reconstruction of ancestral states except that, rather than searching for the ML assignment of states based on their respective probability distributions at each internal node, the probability distributions themselves are reported directly.
Empirical Bayes method
Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed ...
s for ancestral reconstruction require the investigator to assume that the evolutionary model parameters and tree are known without error. When the size or complexity of the data makes this an unrealistic assumption, it may be more prudent to adopt the fully hierarchical Bayesian approach and infer the joint posterior distribution over the ancestral character states, model, and tree. Huelsenbeck and Bollback first proposed a hierarchical Bayes method to ancestral reconstruction by using Markov chain Monte Carlo
In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
(MCMC) methods to sample ancestral sequences from this joint posterior distribution. A similar approach was also used to reconstruct the evolution of symbiosis with algae in fungal species ( lichenization). For example, the Metropolis-Hastings algorithm for MCMC explores the joint posterior distribution by accepting or rejecting parameter assignments on the basis of the ratio of posterior probabilities.
Put simply, the empirical Bayes approach calculates the probabilities of various ancestral states for a specific tree and model of evolution. By expressing the reconstruction of ancestral states as a set of probabilities, one can directly quantify the uncertainty for assigning any particular state to an ancestor. On the other hand, the hierarchical Bayes approach averages these probabilities over all possible trees and models of evolution, in proportion to how likely these trees and models are, given the data that has been observed.
Whether the hierarchical Bayes method confers a substantial advantage in practice remains controversial, however. Moreover, this fully Bayesian approach is limited to analyzing relatively small numbers of sequences or taxa because the space of all possible trees rapidly becomes too vast, making it computationally infeasible for chain samples to converge in a reasonable amount of time.
Calibration
Ancestral reconstruction can be informed by the observed states in historical samples of known age, such as fossils or archival specimens. Since the accuracy of ancestral reconstruction generally decays with increasing time, the use of such specimens provides data that are closer to the ancestors being reconstructed and will most likely improve the analysis, especially when rates of character change vary through time. This concept has been validated by an experimental evolutionary study in which replicate populations of bacteriophage T7 were propagated to generate an artificial phylogeny. In revisiting these experimental data, Oakley and Cunningham found that maximum parsimony methods were unable to accurately reconstruct the known ancestral state of a continuous character ( plaque size); these results were verified by computer simulation. This failure of ancestral reconstruction was attributed to a directional bias in the evolution of plaque size (from large to small plaque diameters) that required the inclusion of "fossilized" samples to address.
Studies of both mammalian carnivores and fishes have demonstrated that without incorporating fossil data, the reconstructed estimates of ancestral body sizes are unrealistically large. Moreover, Graham Slater and colleagues showed using caniform carnivorans that incorporating fossil data into prior distributions improved both the Bayesian inference of ancestral states and evolutionary model selection, relative to analyses using only contemporaneous data.
Models
Many models have been developed to estimate ancestral states of discrete and continuous characters from extant descendants. Such models assume that the evolution of a trait through time may be modelled as a stochastic process. For discrete-valued traits (such as "pollinator type"), this process is typically taken to be a Markov chain
A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...
; for continuous-valued traits (such as "brain size
The size of the brain is a frequent topic of study within the fields of anatomy, biological anthropology, animal science and evolution. Brain size is sometimes measured by weight and sometimes by volume (via MRI scans or by skull volume). ...
"), the process is frequently taken to be a Brownian motion
Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas).
This pattern of motion typically consists of random fluctuations in a particle's position insi ...
or an Ornstein-Uhlenbeck process. Using this model as the basis for statistical inference, one can now use maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed sta ...
methods or Bayesian inference to estimate the ancestral states.
Discrete-state models
Suppose the trait in question may fall into one of states, labelled . The typical means of modelling evolution of this trait is via a continuous-time Markov chain, which may be briefly described as follows. Each state has associated to it rates of transition to all of the other states. The trait is modelled as stepping between the states; when it reaches a given state, it starts an exponential "clock" for each of the other states that it can step to. It then "races" the clocks against each other, and it takes a step towards the state whose clock is the first to ring. In such a model, the parameters are the transition rates , which can be estimated using, for example, maximum likelihood methods, where one maximizes over the set of all possible configurations of states of the ancestral nodes.
In order to recover the state of a given ancestral node in the phylogeny (call this node ) by maximum likelihood, the procedure is: find the maximum likelihood estimate of ; then compute the likelihood of each possible state for conditioning on ; finally, choose the ancestral state which maximizes this. One may also use this substitution model as the basis for a Bayesian inference procedure, which would consider the posterior belief in the state of an ancestral node given some user-chosen prior.
Because such models may have as many as parameters, overfitting may be an issue. Some common choices that reduce the parameter space are:
* ''Markov -state 1 parameter model'': this model is the reverse-in-time -state counterpart of the Jukes-Cantor model. In this model, all transitions have the same rate , regardless of their start and end states. Some transitions may be disallowed by declaring that their rates are simply 0; this may be the case, for example, if certain states cannot be reached from other states in a single transition.
* ''Asymmetrical Markov -state 2 parameter model'': in this model, the state space is ordered (so that, for example, state 1 is smaller than state 2, which is smaller than state 3), and transitions may only occur between adjacent states. This model contains two parameters and : one for the rate of increase of state (e.g. 0 to 1, 1 to 2, etc.), and one for the rate of decrease in state (e.g. from 2 to 1, 1 to 0, etc.).
Example: Binary state speciation and extinction model
The binary state speciation and extinction model (BiSSE) is a discrete-space model that does not directly follow the framework of those mentioned above. It allows estimation of ancestral binary character states jointly with diversification rates
Diversification rates are the rates at which new species form (the Speciation rate, λ) and living species go extinct (the extinction rate, μ). Diversification rates can be estimated from fossils, data on the species diversity of clades and their ...
associated with different character states; it may also be straightforwardly extended to a more general multiple-discrete-state model. In its most basic form, this model involves six parameters: two speciation rates (one each for lineages in states 0 and 1); similarly, two extinction rates; and two rates of character change. This model allows for hypothesis testing on the rates of speciation/extinction/character change, at the cost of increasing the number of parameters.
Continuous-state models
In the case where the trait instead takes non-discrete values, one must instead turn to a model where the trait evolves as some continuous process. Inference of ancestral states by maximum likelihood (or by Bayesian methods) would proceed as above, but with the likelihoods of transitions in state between adjacent nodes given by some other continuous probability distribution.
* ''Brownian motion'': in this case, if nodes and are adjacent in the phylogeny (say is the ancestor of ) and separated by a branch of length , the likelihood of a transition from being in state to being in state is given by a Gaussian density with mean and variance In this case, there is only one parameter (), and the model assumes that the trait evolves freely without a bias toward increase or decrease, and that the rate of change is constant throughout the branches of the phylogenetic tree.
* ''Ornstein-Uhlenbeck process'': in brief, an Ornstein-Uhlenbeck process is a continuous stochastic process that behaves like a Brownian motion, but attracted toward some central value, where the strength of the attraction increases with the distance from that value. This is useful for modelling scenarios where the trait is subject to ''stabilizing'' selection around a certain value (say ). Under this model, the above-described transition of being in state to being in state would have a likelihood defined by the transition density of an Ornstein-Uhlenbeck process with two parameters: , which describes the variance of the driving Brownian motion, and , which describes the strength of its attraction to . As tends to , the process is less and less constrained by its attraction to and the process becomes a Brownian motion. Because of this, the models may be nested, and log-likelihood ratio tests discerning which of the two models is appropriate may be carried out.
* ''Stable models of continuous character evolution:'' though Brownian motion is appealing and tractable as a model of continuous evolution, it does not permit non-neutrality in its basic form, nor does it provide for any variation in the rate of evolution over time. Instead, one may use a stable process In probability theory, a stable process is a type of stochastic process. It includes stochastic processes whose associated probability distributions are stable distributions.
Examples of stable processes include the Wiener process, or Brownian mo ...
, one whose values at fixed times are distributed as stable distribution
In probability theory, a distribution is said to be stable if a linear combination of two independent random variables with this distribution has the same distribution, up to location and scale parameters. A random variable is said to be sta ...
s, to model the evolution of traits. Stable processes, roughly speaking, behave as Brownian motions that also incorporate discontinuous jumps. This allows to appropriately model scenarios in which short bursts of fast trait evolution are expected. In this setting, maximum likelihood methods are poorly suited due to a rugged likelihood surface and because the likelihood may be made arbitrarily large, so Bayesian methods are more appropriate.
Applications
Character evolution
Ancestral reconstruction is widely used to infer the ecological, phenotypic, or biogeographic traits associated with ancestral nodes in a phylogenetic tree. All methods of ancestral trait reconstructions have pitfalls, as they use mathematical models to predict how traits have changed with large amounts of missing data. This missing data includes the states of extinct species, the relative rates of evolutionary changes, knowledge of initial character states, and the accuracy of phylogenetic trees. In all cases where ancestral trait reconstruction is used, findings should be justified with an examination of the biological data that supports model based conclusions. Griffith O.W. ''et al.''
Ancestral reconstruction allows for the study of evolutionary pathways, adaptive selection, developmental gene expression, and functional divergence of the evolutionary past. For a review of biological and computational techniques of ancestral reconstruction see Chang ''et al.''. For criticism of ancestral reconstruction computation methods see Williams P.D. ''et al.''.
Behavior and life history evolution
In horned lizards
Lizards are a widespread group of squamate reptiles, with over 7,000 species, ranging across all continents except Antarctica, as well as most oceanic island chains. The group is paraphyletic since it excludes the snakes and Amphisbaenia a ...
(genus ''Phrynosoma
Horned lizards (''Phrynosoma''), also known as horny toads or horntoads, are a genus of North American lizards and the type genus of the family Phrynosomatidae. The common names refer directly to their horns or to their flattened, rounded bodies, ...
''), viviparity
Among animals, viviparity is development of the embryo inside the body of the parent. This is opposed to oviparity which is a reproductive mode in which females lay developing eggs that complete their development and hatch externally from the m ...
(live birth) has evolved multiple times, based on ancestral reconstruction methods.
=Diet reconstruction in Galapagos finches
=
Both phylogenetic and character data are available for the radiation of finch
The true finches are small to medium-sized passerine birds in the family Fringillidae. Finches have stout conical bills adapted for eating seeds and nuts and often have colourful plumage. They occupy a great range of habitats where they are usua ...
es inhabiting the Galapagos Islands. These data allow testing of hypotheses concerning the timing and ordering of character state changes through time via ancestral state reconstruction. During the dry season, the diets of the 13 species of Galapagos finches
Darwin's finches (also known as the Galápagos finches) are a group of about 18 species of passerine birds. They are well known for their remarkable diversity in beak form and function. They are often classified as the subfamily Geospizinae or ...
may be assorted into three broad diet categories, first those that consume grain-like foods are considered "granivores
Seed predation, often referred to as granivory, is a type of plant-animal interaction in which granivores (seed predators) feed on the seeds of plants as a main or exclusive food source,Hulme, P.E. and Benkman, C.W. (2002) "Granivory", pp. 132 ...
", those that ingest arthropods are termed "insectivore
A robber fly eating a hoverfly
An insectivore is a carnivorous animal or plant that eats insects. An alternative term is entomophage, which can also refer to the human practice of eating insects.
The first vertebrate insectivores wer ...
s" and those that consume vegetation are classified as "folivore
In zoology, a folivore is a herbivore that specializes in eating leaves. Mature leaves contain a high proportion of hard-to-digest cellulose, less energy than other types of foods, and often toxic compounds.Jones, S., Martin, R., & Pilbeam, D. ( ...
s". Dietary ancestral state reconstruction using maximum parsimony recover 2 major shifts from an insectivorous state: one to granivory, and one to folivory. Maximum-likelihood ancestral state reconstruction recovers broadly similar results, with one significant difference: the common ancestor of the tree finch (''Camarhynchus
''Camarhynchus'' is a genus of birds in the tanager family Thraupidae. All species of ''Camarhynchus'' are endemic to the Galápagos Islands, and together with related genera, they are collectively known as Darwin's finches. Formerly classified i ...
'') and ground finch (''Geospiza
''Geospiza'' is a genus of bird in the tanager family Thraupidae. All species in the genus are endemic to the Galápagos Islands. Together with related genera, they are collectively known as Darwin's finches. Although in the past, they were class ...
'') clades are most likely granivorous rather than insectivorous (as judged by parsimony). In this case, this difference between ancestral states returned by maximum parsimony and maximum likelihood likely occurs as a result of the fact that ML estimates consider branch lengths of the phylogenetic tree.
Morphological and physiological character evolution
Phrynosomatid lizards show remarkable morphological diversity, including in the relative muscle fiber type composition in their hindlimb muscles
Skeletal muscles (commonly referred to as muscles) are organs of the vertebrate muscular system and typically are attached by tendons to bones of a skeleton. The muscle cells of skeletal muscles are much longer than in the other types of muscl ...
. Ancestor reconstruction based on squared-change parsimony (equivalent to maximum likelihood under Brownian motion
Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas).
This pattern of motion typically consists of random fluctuations in a particle's position insi ...
character evolution) indicates that horned lizard
Horned lizards (''Phrynosoma''), also known as horny toads or horntoads, are a genus of North American lizards and the type genus of the family Phrynosomatidae. The common names refer directly to their horns or to their flattened, rounded bodies, ...
s, one of the three main subclades of the lineage, have undergone a major evolutionary increase in the proportion of fast-oxidative glycolytic fibers in their iliofibularis muscles.
=Mammalian body mass
=
In an analysis of the body mass of 1,679 placental mammal
Placental mammals ( infraclass Placentalia ) are one of the three extant subdivisions of the class Mammalia, the other two being Monotremata and Marsupialia. Placentalia contains the vast majority of extant mammals, which are partly distinguish ...
species comparing stable models of continuous character evolution to Brownian motion
Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas).
This pattern of motion typically consists of random fluctuations in a particle's position insi ...
models, Elliot and Mooers showed that the evolutionary process describing mammalian body mass evolution is best characterized by a stable model of continuous character evolution, which accommodates rare changes of large magnitude. Under a stable model, ancestral mammals retained a low body mass through early diversification, with large increases in body mass coincident with the origin of several Orders of large body massed species (e.g. ungulates). By contrast, simulation under a Brownian motion model recovered a less realistic, order of magnitude larger body mass among ancestral mammals, requiring significant reductions in body size prior to the evolution of Orders exhibiting small body size (e.g. Rodent
Rodents (from Latin , 'to gnaw') are mammals of the Order (biology), order Rodentia (), which are characterized by a single pair of continuously growing incisors in each of the upper and lower jaws. About 40% of all mammal species are roden ...
ia). Thus stable models recover a more realistic picture of mammalian body mass evolution by permitting large transformations to occur on a small subset of branches.
=Correlated character evolution
=
Phylogenetic comparative methods
Phylogenetic comparative methods (PCMs) use information on the historical relationships of lineages ( phylogenies) to test evolutionary hypotheses. The comparative method has a long history in evolutionary biology; indeed, Charles Darwin used diff ...
(inferences drawn through comparison of related taxa) are often used to identify biological characteristics that do not evolve independently, which can reveal an underlying dependence. For example, the evolution of the shape of a finch's beak may be associated with its foraging behaviour. However, it is not advisable to search for these associations by the direct comparison of measurements or genetic sequences because these observations are not independent because of their descent from common ancestors. For discrete characters, this problem was first addressed in the framework of maximum parsimony by evaluating whether two characters tended to undergo a change on the same branches of the tree. Felsenstein identified this problem for continuous character evolution and proposed a solution similar to ancestral reconstruction, in which the phylogenetic structure of the data was accommodated statistically by directing the analysis through computation of "independent contrasts" between nodes of the tree related by non-overlapping branches.
Molecular evolution
On a molecular level, amino acid residues
Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, the monomers of the polymer. A single amino acid monomer may ...
at different locations of a protein may evolve non-independently because they have a direct physicochemical interaction, or indirectly by their interactions with a common substrate or through long-range interactions in the protein structure. Conversely, the folded structure of a protein could potentially be inferred from the distribution of residue interactions. One of the earliest applications of ancestral reconstruction, to predict the three-dimensional structure of a protein through residue contacts, was published by Shindyalov and colleagues. Phylogenies relating 67 different protein families were generated by a distance-based clustering method ( unweighted pair group method with arithmetic mean, UPGMA), and ancestral sequences were reconstructed by parsimony. The authors reported a weak but significant tendency for co-evolving
In biology, coevolution occurs when two or more species reciprocally affect each other's evolution through the process of natural selection. The term sometimes is used for two traits in the same species affecting each other's evolution, as well ...
pairs of residues to be co-located in the known three-dimensional structure of the proteins.
The reconstruction of ancient proteins and DNA sequences has only recently become a significant scientific endeavour. The developments of extensive genomic sequence databases in conjunction with advances in biotechnology and phylogenetic inference methods have made ancestral reconstruction cheap, fast, and scientifically practical. This concept has been applied to identify co-evolving residues in protein sequences using more advanced methods for the reconstruction of phylogenies and ancestral sequences. For example, ancestral reconstruction has been used to identify co-evolving residues in proteins encoded by RNA virus genomes, particularly in HIV.
Ancestral protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
and DNA reconstruction allows for the recreation of protein and DNA evolution in the laboratory so that it can be studied directly. With respect to proteins, this allows for the investigation of the evolution of present-day molecular structure and function. Additionally, ancestral protein reconstruction can lead to the discoveries of new biochemical functions that have been lost in modern proteins. It also allows insights into the biology and ecology of extinct organisms. Although the majority of ancestral reconstructions have dealt with proteins, it has also been used to test evolutionary mechanisms at the level of bacterial genomes and primate gene sequences.
=Vaccine design
=
RNA viruses such as the human immunodeficiency virus
The human immunodeficiency viruses (HIV) are two species of ''Lentivirus'' (a subgroup of retrovirus) that infect humans. Over time, they cause acquired immunodeficiency syndrome (AIDS), a condition in which progressive failure of the immun ...
(HIV) evolve at an extremely rapid rate, orders of magnitude faster than mammals or birds. For these organisms, ancestral reconstruction can be applied on a much shorter time scale; for example, in order to reconstruct the global or regional progenitor of an epidemic
An epidemic (from Greek ἐπί ''epi'' "upon or above" and δῆμος ''demos'' "people") is the rapid spread of disease to a large number of patients among a given population within an area in a short period of time.
Epidemics of infectious d ...
that has spanned decades rather than millions of years. A team around Brian Gaschen proposed that such reconstructed strains be used as targets for vaccine
A vaccine is a biological preparation that provides active acquired immunity to a particular infectious or malignant disease. The safety and effectiveness of vaccines has been widely studied and verified.[ ...]
design efforts, as opposed to sequences isolated from patients in the present day. Because HIV is extremely diverse, a vaccine designed to work on one patient's viral population might not work for a different patient, because the evolutionary distance between these two viruses may be large. However, their most recent common ancestor is closer to each of the two viruses than they are to each other. Thus, a vaccine designed for a common ancestor could have a better chance of being effective for a larger proportion of circulating strains. Another team took this idea further by developing a center-of-tree reconstruction method to produce a sequence whose total evolutionary distance to contemporary strains is as small as possible. Strictly speaking, this method was not ''ancestral'' reconstruction, as the center-of-tree (COT) sequence does not necessarily represent a sequence that has ever existed in the evolutionary history of the virus. However, Rolland and colleagues did find that, in the case of HIV, the COT virus was functional when synthesized. Similar experiments with synthetic ancestral sequences obtained by maximum likelihood reconstruction have likewise shown that these ancestors are both functional and immunogenic, lending some credibility to these methods. Furthermore, ancestral reconstruction can potentially be used to infer the genetic sequence of the transmitted HIV variants that have gone on to establish the next infection, with the objective of identifying distinguishing characteristics of these variants (as a non-random selection of the transmitted population of viruses) that may be targeted for vaccine design.
=Genome rearrangements
=
Rather than inferring the ancestral DNA sequence, one may be interested in the larger-scale molecular structure and content of an ancestral genome. This problem is often approached in a combinatorial framework, by modelling genomes as permutation
In mathematics, a permutation of a set is, loosely speaking, an arrangement of its members into a sequence or linear order, or if the set is already ordered, a rearrangement of its elements. The word "permutation" also refers to the act or p ...
s of genes or homologous regions. Various operations are allowed on these permutations, such as an inversion
Inversion or inversions may refer to:
Arts
* , a French gay magazine (1924/1925)
* ''Inversion'' (artwork), a 2005 temporary sculpture in Houston, Texas
* Inversion (music), a term with various meanings in music theory and musical set theory
* ...
(a segment of the permutation is reversed in-place), deletion
Deletion or delete may refer to:
Computing
* File deletion, a way of removing a file from a computer's file system
* Code cleanup, a way of removing unnecessary variables, data structures, cookies, and temporary files in a programming language
* ...
(a segment is removed), transposition (a segment is removed from one part of the permutation and spliced in somewhere else), or gain of genetic content through recombination, duplication or horizontal gene transfer
Horizontal gene transfer (HGT) or lateral gene transfer (LGT) is the movement of genetic material between unicellular and/or multicellular organisms other than by the ("vertical") transmission of DNA from parent to offspring ( reproduction). ...
. The "genome rearrangement problem", first posed by Watterson and colleagues, asks: given two genomes (permutations) and a set of allowable operations, what is the shortest sequence of operations that will transform one genome into the other? A generalization of this problem applicable to ancestral reconstruction is the "multiple genome rearrangement problem": given a set of genomes and a set of allowable operations, find (i) a binary tree with the given genomes as its leaves, and (ii) an assignment of genomes to the internal nodes of the tree, such that the total number of operations across the whole tree is minimized. This approach is similar to parsimony, except that the tree is inferred along with the ancestral sequences. Unfortunately, even the single genome rearrangement problem is NP-hard
In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
, although it has received much attention in mathematics and computer science (for a review, see Fertin and colleagues).
The reconstruction of ancestral genomes is also called karyotype
A karyotype is the general appearance of the complete set of metaphase chromosomes in the cells of a species or in an individual organism, mainly including their sizes, numbers, and shapes. Karyotyping is the process by which a karyotype is disce ...
reconstruction. Chromosome painting is currently the main experimental technique. Recently, researchers have developed computational methods to reconstruct the ancestral karyotype by taking advantage of comparative genomics
Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural ...
. Furthermore, comparative genomics and ancestral genome reconstruction has been applied to identify ancient horizontal gene transfer events at the last common ancestor of a lineage (e.g. ''Candidatus'' Accumulibacter phosphatis) to identify the evolutionary basis for trait acquisition.
Spatial applications
=Migration
=
Ancestral reconstruction is not limited to biological traits. Spatial location is also a trait, and ancestral reconstruction methods can infer the locations of ancestors of the individuals under consideration. Such techniques were used by Lemey and colleagues to geographically trace the ancestors of 192 Avian influenza A-H5N1 strains sampled from twenty localities in Europe and Asia, and for 101 rabies virus
Rabies virus, scientific name ''Rabies lyssavirus'', is a neurotropic virus that causes rabies in humans and animals. Rabies transmission can occur through the saliva of animals and less commonly through contact with human saliva. ''Rabies lys ...
sequences sampled across twelve African countries.
Treating locations as discrete states (countries, cities, etc.) allows for the application of the discrete-state models described above. However, unlike in a model where the state space for the trait is small, there may be many locations, and transitions between certain pairs of states may rarely or never occur; for example, migration between distant locales may never happen directly if air travel between the two places does not exist, so such migrations must pass through intermediate locales first. This means that there could be many parameters in the model which are zero or close to zero. To this end, Lemey and colleagues used a Bayesian procedure to not only estimate the parameters and ancestral states, but also to select which migration parameters are not zero; their work suggests that this procedure does lead to more efficient use of the data. They also explore the use of prior distributions that incorporate geographical structure or hypotheses about migration dynamics, finding that those they considered had little effect on the findings.
Using this analysis, the team around Lemey found that the most likely hub of diffusion of A-H5N1 is Guangdong
Guangdong (, ), alternatively romanized as Canton or Kwangtung, is a coastal province in South China on the north shore of the South China Sea. The capital of the province is Guangzhou. With a population of 126.01 million (as of 2020 ...
, with Hong Kong
Hong Kong ( (US) or (UK); , ), officially the Hong Kong Special Administrative Region of the People's Republic of China (abbr. Hong Kong SAR or HKSAR), is a city and special administrative region of China on the eastern Pearl River Delta i ...
also receiving posterior support. Further, their results support the hypothesis of long-standing presence of African rabies in West Africa
West Africa or Western Africa is the westernmost region of Africa. The United Nations defines Western Africa as the 16 countries of Benin, Burkina Faso, Cape Verde, The Gambia, Ghana, Guinea, Guinea-Bissau, Ivory Coast, Liberia, Mali, Mau ...
.
=Species ranges
=
Inferring historical biogeographic
Biogeography is the study of the distribution of species and ecosystems in geographic space and through geological time. Organisms and biological communities often vary in a regular fashion along geographic gradients of latitude, elevation, ...
patterns often requires reconstructing ancestral ranges of species on phylogenetic trees. For instance, a well-resolved phylogeny of plant species in the genus '' Cyrtandra'' was used together with information of their geographic ranges to compare four methods of ancestral range reconstruction. The team compared Fitch parsimony, (FP; parsimony) stochastic mapping (SM; maximum likelihood), dispersal-vicariance analysis (DIVA; parsimony), and dispersal-extinction-cladogenesis (DEC; maximum-likelihood). Results indicated that both parsimony methods performed poorly, which was likely due to the fact that parsimony methods do not consider branch lengths. Both maximum-likelihood methods performed better; however, DEC analyses that additionally allow incorporation of geological priors gave more realistic inferences about range evolution in ''Cyrtandra'' relative to other methods.
Another maximum likelihood method recovers the phylogeographic history of a gene by reconstructing the ancestral locations of the sampled taxa. This method assumes a spatially explicit random walk model of migration to reconstruct ancestral locations given the geographic coordinates of the individuals represented by the tips of the phylogenetic tree. When applied to a phylogenetic tree of chorus frogs ''Pseudacris feriarum'', this method recovered recent northward expansion, higher per-generation dispersal distance in the recently colonized region, a non-central ancestral location, and directional migration.
The first consideration of the multiple genome rearrangement problem, long before its formalization in terms of permutations, was presented by Sturtevant and Dobzhansky in 1936. They examined genomes of several strains of fruit fly
Fruit fly may refer to:
Organisms
* Drosophilidae, a family of small flies, including:
** ''Drosophila'', the genus of small fruit flies and vinegar flies
** ''Drosophila melanogaster'' or common fruit fly
** '' Drosophila suzukii'' or Asian fruit ...
from different geographic locations, and observed that one configuration, which they called "standard", was the most common throughout all the studied areas. Remarkably, they also noticed that four different strains could be obtained from the standard sequence by a single inversion, and two others could be related by a second inversion. This allowed them to hypothesize a phylogeny for the sequences, and to infer that the standard sequence was probably also the ancestral one.
Linguistic Evolution
Reconstructions of the words and phenomes of ancient proto-language
In the tree model of historical linguistics, a proto-language is a postulated ancestral language from which a number of attested languages are believed to have descended by evolution, forming a language family. Proto-languages are usually unatte ...
s such as Proto-Indo-European
Proto-Indo-European (PIE) is the reconstructed common ancestor of the Indo-European language family. Its proposed features have been derived by linguistic reconstruction from documented Indo-European languages. No direct record of Proto-Indo- ...
have been performed based on the observed analogues in present-day languages. Typically, these analyses are carried out manually using the "comparative method". First, words from different languages with a common etymology (cognate
In historical linguistics, cognates or lexical cognates are sets of words in different languages that have been inherited in direct descent from an etymological ancestor in a common parent language. Because language change can have radical e ...
s) are identified in the contemporary languages under study, analogous to the identification of orthologous
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spe ...
biological sequences. Second, correspondences between individual sounds in the cognates are identified, a step similar to biological sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...
, although performed manually. Finally, likely ancestral sounds are hypothesised by manual inspection and various heuristics (such as the fact that most languages have both nasal and non-nasal vowels).
Software
There are many software packages available which can perform ancestral state reconstruction. Generally, these software packages have been developed and maintained through the efforts of scientists in related fields and released under free software license
A free-software license is a notice that grants the recipient of a piece of software extensive rights to modify and software distribution, redistribute that software. These actions are usually prohibited by copyright law, but the rights-holde ...
s. The following table is not meant to be a comprehensive itemization of all available packages, but provides a representative sample of the extensive variety of packages that implement methods of ancestral reconstruction with different strengths and features.
Package descriptions
Molecular evolution
The majority of these software packages are designed for analyzing genetic sequence data. For example, PAML is a collection of programs for the phylogenetic analysis of DNA and protein sequence alignments by maximum likelihood. Ancestral reconstruction can be performed using the ''codeml'' program. In addition, LAZARUS is a collection of Python scripts that wrap the ancestral reconstruction functions of PAML for batch processing and greater ease-of-use. Software packages such as MEGA
Mega or MEGA may refer to:
Science
* mega-, a metric prefix denoting 106
* Mega (number), a certain very large integer in Steinhaus–Moser notation
* "mega-" a prefix meaning "large" that is used in taxonomy
* Gravity assist, for ''Moon-Eart ...
, HyPhy, and Mesquite also perform phylogenetic analysis of sequence data, but are designed to be more modular and customizable. HyPhy implements a joint maximum likelihood method of ancestral sequence reconstruction that can be readily adapted to reconstructing a more generalized range of discrete ancestral character states such as geographic locations by specifying a customized model in its batch language. Mesquite provides ancestral state reconstruction methods for both discrete and continuous characters using both maximum parsimony and maximum likelihood methods. It also provides several visualization tools for interpreting the results of ancestral reconstruction. MEGA is a modular system, too, but places greater emphasis on ease-of-use than customization of analyses. As of version 5, MEGA allows the user to reconstruct ancestral states using maximum parsimony, maximum likelihood, and empirical Bayes methods.
The Bayesian analysis of genetic sequences may confer greater robustness to model misspecification. MrBayes allows inference of ancestral states at ancestral nodes using the full hierarchical Bayesian approach. The PREQUEL program distributed in the PHAST package performs comparative evolutionary genomics using ancestral sequence reconstruction. SIMMAP stochastically maps mutations on phylogenies. BayesTraits analyses discrete or continuous characters in a Bayesian framework to evaluate models of evolution, reconstruct ancestral states, and detect correlated evolution between pairs of traits. ProtASR performs ancestral sequence reconstruction Ancestral sequence reconstruction (ASR) – also known as ancestral gene/sequence reconstruction/resurrection – is a technique used in the study of molecular evolution. The method uses related sequences to reconstruct an "ancestral" gene from a mu ...
(ASR) of proteins accounting for structural constraints.
Other character types
Other software packages are more oriented towards the analysis of qualitative and quantitative traits (phenotype
In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology or physical form and structure, its developmental processes, its biochemical and physiological prop ...
s). For example, the ''ape'' package in the statistical computing environment R also provides methods for ancestral state reconstruction for both discrete and continuous characters through the ace''' function, including maximum likelihood. Phyrex implements a maximum parsimony-based algorithm to reconstruct ancestral gene expression profiles, in addition to a maximum likelihood method for reconstructing ancestral genetic sequences (by wrapping around the baseml function in PAML).
Several software packages also reconstruct phylogeography
Phylogeography is the study of the historical processes that may be responsible for the past to present geographic distributions of genealogical lineages. This is accomplished by considering the geographic distribution of individuals in light of ge ...
. BEAST
Beast most often refers to:
* Non-human animal
* Monster
Beast or Beasts may also refer to:
Bible
* Beast (Revelation), two beasts described in the Book of Revelation
Computing and gaming
* Beast (card game), English name of historical Fren ...
(Bayesian Evolutionary Analysis by Sampling Trees) provides tools for reconstructing ancestral geographic locations from observed sequences annotated with location data using Bayesian MCMC sampling methods. Diversitree is an R package providing methods for ancestral state reconstruction under Mk2 (a continuous time Markov model of binary character evolution). and BiSSE (Binary State Speciation and Extinction) models. Lagrange performs analyses on reconstruction of geographic range evolution on phylogenetic trees. Phylomapper is a statistical framework for estimating historical patterns of gene flow and ancestral geographic locations. RASP infers ancestral states using statistical dispersal-vicariance analysis, Lagrange, Bayes-Lagrange, BayArea and BBM methods. VIP infers historical biogeography by examining disjunct geographic distributions.
Genome rearrangements provide valuable information in comparative genomics
Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural ...
between species. ANGES compares extant related genomes through ancestral reconstruction of genetic markers. BADGER uses a Bayesian approach to examining the history of gene rearrangement. Count reconstructs the evolution of the size of gene families. EREM analyses the gain and loss of genetic features encoded by binary characters. PARANA performs parsimony based inference of ancestral biological networks that represent gene loss and duplication.
Web applications
Finally, there are several web-server based applications that allow investigators to use maximum likelihood methods for ancestral reconstruction of different character types without having to install any software. For example, Ancestors is web-server for ancestral genome reconstruction by the identification and arrangement of syntenic regions. FastML is a web-server for probabilistic reconstruction of ancestral sequences by maximum likelihood that uses a gap character model for reconstructing indel
Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that ...
variation. MLGO is a web-server for maximum likelihood gene order analysis.
Future directions
The development and application of computational algorithms for ancestral reconstruction continues to be an active area of research across disciplines. For example, the reconstruction of sequence insertions and deletions (indels) has lagged behind the more straightforward application of substitution models. Bouchard-Côté and Jordan recently described a new model (the Poisson Indel Process) which represents an important advance on the archetypal Thorne-Kishino-Felsenstein model of indel evolution. In addition, the field is being driven forward by rapid advances in the area of next-generation sequencing Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation s ...
technology, where sequences are generated from millions of nucleic acid templates by extensive parallelization of sequencing reactions in a custom apparatus. These advances have made it possible to generate a "deep" snapshot of the genetic composition of a rapidly evolving population, such as RNA viruses or tumour cells, in a relatively short amount of time. At the same time, the massive amount of data and platform-specific sequencing error profiles has created new bioinformatic challenges for processing these data for ancestral sequence reconstruction.
See also
* Evolutionary biology
Evolutionary biology is the subfield of biology that studies the evolutionary processes (natural selection, common descent, speciation) that produced the diversity of life on Earth. It is also defined as the study of the history of life fo ...
* Origin of life
In biology, abiogenesis (from a- 'not' + Greek bios 'life' + genesis 'origin') or the origin of life is the natural process by which life has arisen from non-living matter, such as simple organic compounds. The prevailing scientific hypothes ...
* Enzyme promiscuity
Enzyme promiscuity is the ability of an enzyme to catalyse a fortuitous side reaction in addition to its main reaction. Although enzymes are remarkably specific catalysts, they can often perform side reactions in addition to their main, native cata ...
* Ancestral sequence reconstruction Ancestral sequence reconstruction (ASR) – also known as ancestral gene/sequence reconstruction/resurrection – is a technique used in the study of molecular evolution. The method uses related sequences to reconstruct an "ancestral" gene from a mu ...
References
{{DEFAULTSORT:Ancestral Reconstruction
Evolutionary biology