A number of different

Markov Markov ( Bulgarian, russian: Марков), Markova, and Markoff are common surnames used in Russia and Bulgaria. Notable people with the name include: Academics *Ivana Markova (born 1938), Czechoslovak-British emeritus professor of psychology at ...

models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one

nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecule ...

replaces another during evolution. These models are frequently used in molecular phylogenetic analyses. In particular, they are used during the calculation of likelihood of a tree (in

Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...

and

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...

approaches to tree estimation) and they are used to estimate the evolutionary distance between sequences from the observed differences between the sequences.

Introduction

These models are phenomenological descriptions of the evolution of DNA as a string of four discrete states. These Markov models do not explicitly depict the mechanism of mutation nor the action of natural selection. Rather they describe the relative rates of different changes. For example, mutational biases and

purifying selection In natural selection, negative selection or purifying selection is the selective removal of alleles that are deleterious. This can result in stabilising selection through the purging of deleterious genetic polymorphisms that arise through random ...

favoring conservative changes are probably both responsible for the relatively high rate of transitions compared to

transversions Transversion, in molecular biology, refers to a point mutation in DNA in which a single (two ring) purine ( A or G) is changed for a (one ring) pyrimidine ( T or C), or vice versa. A transversion can be spontaneous, or it can be caused by io ...

in evolving sequences. However, the Kimura (K80) model described below only attempts to capture the effect of both forces in a parameter that reflects the relative rate of transitions to transversions. Evolutionary analyses of sequences are conducted on a wide variety of time scales. Thus, it is convenient to express these models in terms of the instantaneous rates of change between different states (the ''Q'' matrices below). If we are given a starting (ancestral) state at one position, the model's ''Q'' matrix and a branch length expressing the expected number of changes to have occurred since the ancestor, then we can derive the probability of the descendant sequence having each of the four states. The mathematical details of this transformation from rate-matrix to probability matrix are described in the mathematics of substitution models section of the substitution model page. By expressing models in terms of the instantaneous rates of change we can avoid estimating a large numbers of parameters for each branch on a phylogenetic tree (or each comparison if the analysis involves many pairwise sequence comparisons). The models described on this page describe the evolution of a single site within a set of sequences. They are often used for analyzing the evolution of an entire

locus Locus (plural loci) is Latin for "place". It may refer to: Entertainment * Locus (comics), a Marvel Comics mutant villainess, a member of the Mutant Liberation Front * ''Locus'' (magazine), science fiction and fantasy magazine ** ''Locus Award' ...

by making the simplifying assumption that different sites evolve independently and are identically distributed. This assumption may be justifiable if the sites can be assumed to be evolving neutrally. If the primary effect of natural selection on the evolution of the sequences is to constrain some sites, then models of among-site rate-heterogeneity can be used. This approach allows one to estimate only one matrix of relative rates of substitution, and another set of parameters describing the variance in the total rate of substitution across sites.

DNA evolution as a continuous-time Markov chain

Continuous-time Markov chains

''Continuous-time''

Markov chains A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...

have the usual transition matrices which are, in addition, parameterized by time,

t

. Specifically, if

E_1,E_2,E_3,E_4

are the states, then the transition matrix :

P(t) = \big(P_(t)\big)

where each individual entry,

P_(t)

refers to the probability that state

E_i

will change to state

E_j

in time

t

. Example: We would like to model the substitution process in DNA sequences (''i.e.'' Jukes–Cantor, Kimura, ''etc.'') in a continuous-time fashion. The corresponding transition matrices will look like: :

P(t) = \begin
            p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) \\
            p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) \\
            p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) \\
            p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t) & p_\mathrm(t)
       \end

where the top-left and bottom-right 2 × 2 blocks correspond to ''transition probabilities'' and the top-right and bottom-left 2 × 2 blocks corresponds to ''transversion probabilities''. Assumption: If at some time

t_0

, the Markov chain is in state

E_i

, then the probability that at time

t_0+t

, it will be in state

E_j

depends only upon

i

j

and

t

. This then allows us to write that probability as

p_(t)

. Theorem: Continuous-time transition matrices satisfy: ::

P(t+\tau) = P(t)P(\tau)

Note: There is here a possible confusion between two meanings of the word ''transition''. (i) In the context of ''Markov chains'', transition is the general term for the change between two states. (ii) In the context of ''nucleotide changes in DNA sequences'', transition is a specific term for the exchange between either the two purines (A ↔ G) or the two pyrimidines (C ↔ T) (for additional details, see the article about transitions in genetics). By contrast, an exchange between one purine and one pyrimidine is called a

transversion Transversion, in molecular biology, refers to a point mutation in DNA in which a single (two ring) purine ( A or G) is changed for a (one ring) pyrimidine ( T or C), or vice versa. A transversion can be spontaneous, or it can be caused by i ...

Deriving the dynamics of substitution

Consider a DNA sequence of fixed length ''m'' evolving in time by base replacement. Assume that the processes followed by the ''m'' sites are Markovian independent, identically distributed and that the process is constant over time. For a particular site, let :

\mathcal = \

be the set of possible states for the site, and :

\mathbf(t) = (p_A(t),\,  p_G(t),\,  p_C(t),\,  p_T(t))

their respective probabilities at time

t

. For two distinct

x, y \in \mathcal

, let

\mu_\

be the transition rate from state

x

to state

y

. Similarly, for any

x

, let the total rate of change from

x

be :

\mu_x = \sum_\mu_\,.

The changes in the probability distribution

p_A(t)

for small increments of time

\Delta t

are given by :

p_A(t+\Delta t) = p_A(t) - p_A(t)\mu_A\Delta t + \sum_p_x(t)\mu_\Delta t\,.

In other words, (in frequentist language), the frequency of

A

's at time

t + \Delta t

is equal to the frequency at time

t

minus the frequency of the ''lost''

A

's plus the frequency of the ''newly created''

A

's. Similarly for the probabilities

p_G(t)

p_C(t)

and

p_T(t)

. These equations can be written compactly as :

\mathbf(t+\Delta t) = \mathbf(t) + \mathbf(t)Q\Delta t\,,

where :

Q = \begin -\mu_A & \mu_ & \mu_ & \mu_ \\
                            \mu_ & -\mu_G  & \mu_ & \mu_ \\
                            \mu_ & \mu_ & -\mu_C  & \mu_ \\
                            \mu_ & \mu_ & \mu_ & -\mu_T \end

is known as the ''rate matrix''. Note that, by definition, the sum of the entries in each row of

Q

is equal to zero. It follows that :

\mathbf'(t) = \mathbf(t) Q\,.

For a

stationary process In mathematics and statistics, a stationary process (or a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose unconditional joint probability distribution does not change when shifted in time. Con ...

, where

Q

does not depend on time ''t'', this differential equation can be solved. First, :

P(t) = \exp(tQ),

where

\exp(tQ)

denotes the

exponential Exponential may refer to any of several mathematical topics related to exponentiation, including: *Exponential function, also: **Matrix exponential, the matrix analogue to the above *Exponential decay, decrease at a rate proportional to value *Expo ...

of the matrix

tQ

. As a result, :

\mathbf(t) = \mathbf(0)P(t) = \mathbf(0)\exp(tQ) \,.

Ergodicity

If the Markov chain is irreducible, ''i.e.'' if it is always possible to go from a state

x

to a state

y

(possibly in several steps), then it is also ergodic. As a result, it has a unique ''stationary distribution''

= \

, where

\pi_x

corresponds to the proportion of time spent in state

x

after the Markov chain has run for an infinite amount of time. In DNA evolution, under the assumption of a common process for each site, the stationary frequencies

\pi_A,\, \pi_G,\, \pi_C,\, \pi_T

correspond to equilibrium base compositions. Indeed, note that since the stationary distribution

satisfies

Q = 0

, we see that when the current distribution

\mathbf(t)

is the stationary distribution

we have :

Q = 0 \,.

In other words, the frequencies of

p_A(t),\, p_G(t),\, p_C(t),\, p_T(t)

do not change.

Time reversibility

Definition: A stationary Markov process is ''time reversible'' if (in the steady state) the amount of change from state

x\

y\

is equal to the amount of change from

y\

x\

, (although the two states may occur with different frequencies). This means that: :

\pi_x\mu_ = \pi_y\mu_ \

Not all stationary processes are reversible, however, most commonly used DNA evolution models assume time reversibility, which is considered to be a reasonable assumption. Under the time reversibility assumption, let

s_ = \mu_/\pi_y\

, then it is easy to see that: :

s_ = s_ \

Definition The symmetric term

s_\

is called the ''exchangeability'' between states

x\

and

y\

. In other words,

s_\

is the fraction of the frequency of state

x\

that is the result of transitions from state

y\

to state

x\

. Corollary The 12 off-diagonal entries of the rate matrix,

Q\

(note the off-diagonal entries determine the diagonal entries, since the rows of

Q\

sum to zero) can be completely determined by 9 numbers; these are: 6 exchangeability terms and 3 stationary frequencies

\pi_x\

, (since the stationary frequencies sum to 1).

Scaling of branch lengths

By comparing extant sequences, one can determine the amount of sequence divergence. This raw measurement of divergence provides information about the number of changes that have occurred along the path separating the sequences. The simple count of differences (the

Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...

) between sequences will often underestimate the number of substitution because of multiple hits (see

homoplasy Homoplasy, in biology and phylogenetics, is the term used to describe a feature that has been gained or lost independently in separate lineages over the course of evolution. This is different from homology, which is the term used to characterize ...

). Trying to estimate the exact number of changes that have occurred is difficult, and usually not necessary. Instead, branch lengths (and path lengths) in phylogenetic analyses are usually expressed in the expected number of changes per site. The path length is the product of the duration of the path in time and the mean rate of substitutions. While their product can be estimated, the rate and time are not identifiable from sequence divergence. The descriptions of rate matrices on this page accurately reflect the relative magnitude of different substitutions, but these rate matrices are not scaled such that a branch length of 1 yields one expected change. This scaling can be accomplished by multiplying every element of the matrix by the same factor, or simply by scaling the branch lengths. If we use the β to denote the scaling factor, and ν to denote the branch length measured in the expected number of substitutions per site then βν is used in the transition probability formulae below in place of μ''t''. Note that ν is a parameter to be estimated from data, and is referred to as the branch length, while β is simply a number that can be calculated from the rate matrix (it is not a separate free parameter). The value of β can be found by forcing the expected rate of flux of states to 1. The diagonal entries of the rate-matrix (the ''Q'' matrix) represent -1 times the rate of leaving each state. For time-reversible models, we know the equilibrium state frequencies (these are simply the π''_i'' parameter value for state ''i''). Thus we can find the expected rate of change by calculating the sum of flux out of each state weighted by the proportion of sites that are expected to be in that class. Setting β to be the reciprocal of this sum will guarantee that scaled process has an expected flux of 1: :

\beta = 1/\left(-\sum_i \pi_i\mu_\right)

For example, in the Jukes-Cantor, the scaling factor would be ''4/(3μ)'' because the rate of leaving each state is ''3μ/4''.

Most common models of DNA evolution

JC69 model (Jukes and Cantor 1969)

JC69, the

Jukes Jukes is a surname. Notable people with the surname include: * Andrew Jukes (theologian) (1815–1901) *Andrew Jukes (missionary) (1847–1931), Anglican missionary * Betty Jukes (1910–2006), British sculptor * Bill Jukes (c.1883–1939), English ...

and

Cantor A cantor or chanter is a person who leads people in singing or sometimes in prayer. In formal Jewish worship, a cantor is a person who sings solo verses or passages to which the choir or congregation responds. In Judaism, a cantor sings and lead ...

1969 model, is the simplest substitution model. There are several assumptions. It assumes equal base frequencies

\left(\pi_A = \pi_G = \pi_C = \pi_T = \right)

and equal mutation rates. The only parameter of this model is therefore

\mu

, the overall substitution rate. As previously mentioned, this variable becomes a constant when we normalize the mean-rate to 1. :

Q = \begin  &  &  &  \\  &  & & \\ & &  & \\ & & &  \end

P= \begin  &  &  &  \\\\  &  &  &  \\\\  &  &  &  \\\\  &  &  &   \end

When branch length,

\nu

, is measured in the expected number of changes per site then: :

P_(\nu) = \left\{
\begin{array}{cc}
{1\over4} + {3\over4}e^{-4\nu/3}  & \mbox{ if } i = j   \\
{1\over4} - {1\over4}e^{-4\nu/3}  & \mbox{ if } i \neq j  
\end{array}
\right.

It is worth noticing that

\nu={3\over4}t\mu=({\mu\over4}+{\mu\over4}+{\mu\over4})t

what stands for sum of any column (or row) of matrix

Q

multiplied by time and thus means expected number of substitutions in time

t

(branch duration) for each particular site (per site) when the rate of substitution equals

\mu

. Given the proportion

p

of sites that differ between the two sequences the Jukes-Cantor estimate of the evolutionary distance (in terms of the expected number of changes) between two sequences is given by :

\hat{d}=-{3\over4} \ln({1-{4\over3}p})=\hat{\nu}

The

p

in this formula is frequently referred to as the

p

-distance. It is a sufficient statistic for calculating the Jukes-Cantor distance correction, but is not sufficient for the calculation of the evolutionary distance under the more complex models that follow (also note that

p

used in subsequent formulae is not identical to the "

p

-distance").

K80 model (Kimura 1980)

K80, the Kimura 1980 model, often referred to as Kimura's two parameter model (or the K2P model), distinguishes between transitions (

A \leftrightarrow G

, i.e. from purine to purine, or

C \leftrightarrow T

, i.e. from pyrimidine to pyrimidine) and

s (from purine to pyrimidine or vice versa). In Kimura's original description of the model the α and β were used to denote the rates of these types of substitutions, but it is now more common to set the rate of transversions to 1 and use κ to denote the transition/transversion rate ratio (as is done below). The K80 model assumes that all of the bases are equally frequent (

\pi_A = \pi_G = \pi_C = \pi_T ={1\over4}

). Rate matrix

Q= \begin{pmatrix} {*} & {\kappa} & {1} & {1} \\ {\kappa} & {*} & {1} & {1} \\ {1} & {1} & {*} & {\kappa} \\ {1} & {1} & {\kappa} & {*}  \end{pmatrix}

with columns corresponding to

A

G

C

, and

T

, respectively. The Kimura two-parameter distance is given by: :

K = - {1\over2}\ln((1-2p-q) \sqrt{1-2q})

where ''p'' is the proportion of sites that show transitional differences and ''q'' is the proportion of sites that show transversional differences.

K81 model (Kimura 1981)

K81, the Kimura 1981 model, often called Kimura's three parameter model (K3P model) or the Kimura three substitution type (K3ST) model, has distinct rates for transitions and two distinct types of

s. The two

types are those that conserve the weak/strong properties of the nucleotides (i.e.,

A \leftrightarrow T

and

C \leftrightarrow G

, denoted by symbol

\gamma

) and those that conserve the amino/keto properties of the nucleotides (i.e.,

A \leftrightarrow C

and

G \leftrightarrow T

, denoted by symbol

\beta

). The K81 model assumes that all equilibrium base frequencies are equal (i.e.,

\pi_A = \pi_G = \pi_C = \pi_T =0.25

). Rate matrix

Q= \begin{pmatrix} {*} & {\alpha} & {\beta} & {\gamma} \\ {\alpha} & {*} & {\gamma} & {\beta} \\ {\beta} & {\gamma} & {*} & {\alpha} \\ {\gamma} & {\beta} & {\alpha} & {*}  \end{pmatrix}

with columns corresponding to

A

G

C

, and

T

, respectively. The K81 model is used much less often than the K80 (K2P) model for distance estimation and it is seldom the best-fitting model in maximum likelihood phylogenetics. Despite these facts, the K81 model has continued to be studied in the context of mathematical phylogenetics. One important property is the ability to perform a

Hadamard transform The Hadamard transform (also known as the Walsh–Hadamard transform, Hadamard–Rademacher–Walsh transform, Walsh transform, or Walsh–Fourier transform) is an example of a generalized class of Fourier transforms. It performs an orthogonal ...

assuming the site patterns were generated on a tree with nucleotides evolving under the K81 model. When used in the context of phylogenetics the Hadamard transform provides an elegant and fully invertible means to calculate expected site pattern frequencies given a set of branch lengths (or vice versa). Unlike many maximum likelihood calculations, the relative values for

\alpha

\beta

, and

\gamma

can vary across branches and the Hadamard transform can even provide evidence that the data do not fit a tree. The Hadamard transform can also be combined with a wide variety of methods to accommodate among-sites rate heterogeneity, using continuous distributions rather than the discrete approximations typically used in maximum likelihood phylogenetics (although one must sacrifice the invertibility of the Hadamard transform to use certain among-sites rate heterogeneity distributions).

F81 model (Felsenstein 1981)

F81, the Felsenstein's 1981 model, is an extension of the JC69 model in which base frequencies are allowed to vary from 0.25 (

\pi_A \ne \pi_G \ne \pi_C \ne \pi_T

) Rate matrix: :

Q= \begin{pmatrix} {*} & {\pi_G} & {\pi_C} & {\pi_T} \\ {\pi_A} & {*} & {\pi_C} & {\pi_T} \\ {\pi_A} & {\pi_G} & {*} & {\pi_T} \\ {\pi_A} & {\pi_G} & {\pi_C} & {*}  \end{pmatrix}

When branch length, ν, is measured in the expected number of changes per site then: :

\beta = 1/(1-\pi_A^2-\pi_C^2-\pi_G^2-\pi_T^2)

P_{ij}(\nu) = \left\{
\begin{array}{cc}
e^{-\beta\nu}+\pi_j\left(1- e^{-\beta\nu}\right) & \mbox{ if } i = j   \\
\pi_j\left(1- e^{-\beta\nu}\right) & \mbox{ if } i \neq j  
\end{array}
\right.

HKY85 model (Hasegawa, Kishino and Yano 1985)

HKY85, the Hasegawa, Kishino and Yano 1985 model, can be thought of as combining the extensions made in the Kimura80 and Felsenstein81 models. Namely, it distinguishes between the rate of transitions and

s (using the κ parameter), and it allows unequal base frequencies (

\pi_A \ne \pi_G \ne \pi_C \ne \pi_T

). Felsenstein described a similar (but not equivalent) model in 1984 using a different parameterization; that latter model is referred to as the F84 model. Rate matrix

Q= \begin{pmatrix} {*} & {\kappa\pi_G} & {\pi_C} & {\pi_T} \\ {\kappa\pi_A} & {*} & {\pi_C} & {\pi_T} \\ {\pi_A} & {\pi_G} & {*} & {\kappa\pi_T} \\ {\pi_A} & {\pi_G} & {\kappa\pi_C} & {*}  \end{pmatrix}

If we express the branch length, ''ν'' in terms of the expected number of changes per site then: :

\beta  = \frac{1}{2(\pi_A + \pi_G)(\pi_C + \pi_T) + 2\kappa \pi_A\pi_G) + (\pi_C\pi_T)

(\pi_A + \pi_G)

P_{AC}(\nu,\kappa,\pi)  =  \pi_C\left(1.0 - e^{-\beta\nu}\right)

/\left(\pi_A + \pi_G\right)

P_{AT}(\nu,\kappa,\pi)  =  \pi_T\left(1.0 - e^{-\beta\nu}\right)

and formula for the other combinations of states can be obtained by substituting in the appropriate base frequencies.

T92 model (Tamura 1992)

T92, the Tamura 1992 model, is a mathematical method developed to estimate the number of nucleotide substitutions per site between two DNA sequences, by extending Kimura's (1980) two-parameter method to the case where a

G+C content G, or g, is the seventh letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''gee'' (pronounced ), plural ''gees''. History Th ...

bias exists. This method will be useful when there are strong transition-transversion and G+C-content biases, as in the case of ''Drosophila'' mitochondrial DNA. T92 involves a single, compound base frequency parameter

\theta \in (0,1)

(also noted

\pi_{GC}

)

= \pi_G + \pi_C = 1 - ( \pi_A + \pi_T )

As T92 echoes the Chargaff's second parity rule — pairing nucleotides do have the same frequency on a single DNA strand, G and C on the one hand, and A and T on the other hand — it follows that the four base frequences can be expressed as a function of

\pi_{GC}

\pi_G = \pi_C = {\pi_{GC}\over 2}

and

\pi_A = \pi_T = {(1-\pi_{GC})\over 2}

Rate matrix

Q= \begin{pmatrix} {*} & {\kappa\pi_{GC}/2} & {\pi_{GC}/2} & {(1-\pi_{GC})/2} \\
                   {\kappa(1-\pi_{GC})/2} & {*} & {\pi_{GC}/2} & {(1-\pi_{GC})/2} \\
                   {(1-\pi_{GC})/2} & {\pi_{GC}/2} & {*} & {\kappa(1-\pi_{GC})/2} \\
                   {(1-\pi_{GC})/2} & {\pi_{GC}/2} & {\kappa\pi_{GC}/2} & {*}  \end{pmatrix}

The evolutionary distance between two DNA sequences according to this model is given by :

d = -h \ln(1-{p\over h}-q)-{1\over2}(1-h)\ln(1-2q)

where

h = 2\theta(1-\theta)

and

\theta

is the G+C content (

\pi_{GC} = \pi_G + \pi_C

TN93 model (Tamura and Nei 1993)

TN93, the Tamura and Nei 1993 model, distinguishes between the two different types of transition; i.e. (

A \leftrightarrow G

) is allowed to have a different rate to (

C \leftrightarrow T

Transversion Transversion, in molecular biology, refers to a point mutation in DNA in which a single (two ring) purine ( A or G) is changed for a (one ring) pyrimidine ( T or C), or vice versa. A transversion can be spontaneous, or it can be caused by i ...

s are all assumed to occur at the same rate, but that rate is allowed to be different from both of the rates for transitions. TN93 also allows unequal base frequencies (

\pi_A \ne \pi_G \ne \pi_C \ne \pi_T

). Rate matrix

Q= \begin{pmatrix} {*} & {\kappa_1\pi_G} & {\pi_C} & {\pi_T} \\
                   {\kappa_1\pi_A} & {*} & {\pi_C} & {\pi_T} \\
                   {\pi_A} & {\pi_G} & {*} & {\kappa_2\pi_T} \\
                   {\pi_A} & {\pi_G} & {\kappa_2\pi_C} & {*}  \end{pmatrix}

GTR model (Tavaré 1986)

GTR, the Generalised time-reversible model of Tavaré 1986, is the most general neutral, independent, finite-sites, time-reversible model possible. It was first described in a general form by Simon Tavaré in 1986. GTR parameters consist of an equilibrium base frequency vector,

\Pi = (\pi_A , \pi_G , \pi_C , \pi_T)

, giving the frequency at which each base occurs at each site, and the rate matrix :

Q = \begin{pmatrix}
{-(\alpha\pi_G + \beta\pi_C + \gamma\pi_T)} & {\alpha\pi_G} & {\beta\pi_C} & {\gamma\pi_T} \\ 
{\alpha\pi_A} & {-(\alpha\pi_A + \delta\pi_C + \epsilon\pi_T)} & {\delta\pi_C} & {\epsilon\pi_T} \\ 
{\beta\pi_A} & {\delta\pi_G} & {-(\beta\pi_A + \delta\pi_G + \eta\pi_T)} & {\eta\pi_T} \\  
{\gamma\pi_A} & {\epsilon\pi_G} & {\eta\pi_C} & {-(\gamma\pi_A + \epsilon\pi_G + \eta\pi_C)} 
\end{pmatrix}

Where

\begin{align}
\alpha = r(A\rightarrow G) = r(G\rightarrow A)\\
\beta = r(A\rightarrow C) = r(C\rightarrow A)\\
\gamma = r(A\rightarrow T) = r(T\rightarrow A)\\
\delta = r(G\rightarrow C) = r(C\rightarrow G)\\
\epsilon = r(G\rightarrow T) = r(T\rightarrow G)\\
\eta = r(C\rightarrow T) = r(T\rightarrow C)
\end{align}

are the transition rate parameters. Therefore, GTR (for four characters, as is often the case in phylogenetics) requires 6 substitution rate parameters, as well as 4 equilibrium base frequency parameters. However, this is usually eliminated down to 9 parameters plus

\mu

, the overall number of substitutions per unit time. When measuring time in substitutions (

\mu

=1) only 8 free parameters remain. In general, to compute the number of parameters, one must count the number of entries above the diagonal in the matrix, i.e. for n trait values per site

{{n^2-n} \over 2}

, and then add ''n'' for the equilibrium base frequencies, and subtract 1 because

\mu

is fixed. One gets :

{{n^2-n} \over 2} + n - 1 = {1 \over 2}n^2 + {1 \over 2}n - 1.

For example, for an amino acid sequence (there are 20 "standard" amino acids that make up proteins), one would find there are 209 parameters. However, when studying coding regions of the genome, it is more common to work with a codon substitution model (a codon is three bases and codes for one amino acid in a protein). There are

4^3 = 64

codons, but the rates for transitions between codons which differ by more than one base is assumed to be zero. Hence, there are

{{20 \times 19 \times 3} \over 2} + 64 - 1 = 633

parameters.

References

{{Reflist, 32em

External links

DAWG: DNA Assembly With Gaps
— free software for simulating sequence evolution {{MolecularEvolution {{Evolution Bioinformatics Phylogenetics Computational phylogenetics Markov models

Introduction

DNA evolution as a continuous-time Markov chain

Continuous-time Markov chains

Deriving the dynamics of substitution

Ergodicity

Time reversibility

Scaling of branch lengths

Most common models of DNA evolution

JC69 model (Jukes and Cantor 1969)

K80 model (Kimura 1980)

K81 model (Kimura 1981)

F81 model (Felsenstein 1981)

HKY85 model (Hasegawa, Kishino and Yano 1985)

T92 model (Tamura 1992)

TN93 model (Tamura and Nei 1993)

GTR model (Tavaré 1986)

See also

References

Further reading

External links