Quantitative comparative linguistics is the use of quantitative analysis as applied to

comparative linguistics Comparative linguistics is a branch of historical linguistics that is concerned with comparing languages to establish their historical relatedness. Genetic relatedness implies a common origin or proto-language and comparative linguistics aim ...

. Examples include the statistical fields of lexicostatistics and

glottochronology Glottochronology (from Attic Greek γλῶττα ''tongue, language'' and χρόνος ''time'') is the part of lexicostatistics which involves comparative linguistics and deals with the chronological relationship between languages.Sheila Embleton ...

, and the borrowing of

phylogenetics In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...

from biology.

History

Statistical methods have been used for the purpose of quantitative analysis in

for more than a century. During the 1950s, the

Swadesh list A Swadesh list () is a compilation of cultural universal, tentatively universal concepts for the purposes of lexicostatistics. That is, a Swadesh list is a list of forms and concepts which all languages, without exception, have terms for, such as ...

emerged: a standardised set of lexical concepts found in most languages, as words or phrases, that allow two or more languages to be compared and contrasted empirically. Probably the first published quantitative historical linguistics study was by Sapir in 1916, while Kroeber and Chretien in 1937 investigated nine Indo-European (IE) languages using 74 morphological and phonological features (extended in 1939 by the inclusion of Hittite). Ross in 1950 carried out an investigation into the theoretical basis for such studies. Swadesh, using word lists, developed lexicostatistics and

in a series of papers published in the early 1950s but these methods were widely criticised though some of the criticisms were seen as unjustified by other scholars. Embleton published a book on "Statistics in Historical Linguistics" in 1986 which reviewed previous work and extended the glottochronological method. Dyen, Kruskal and Black carried out a study of the lexicostatistical method on a large IE database in 1992. During the 1990s, there was renewed interest in the topic, based on the application of methods of

computational phylogenetics Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, Heuristic (computer science), heuristics, and approaches involved in Phylogenetics, phylogenetic analyses. The goal i ...

and

cladistics Cladistics ( ; from Ancient Greek 'branch') is an approach to Taxonomy (biology), biological classification in which organisms are categorized in groups ("clades") based on hypotheses of most recent common ancestry. The evidence for hypothesiz ...

. Such projects often involved collaboration by linguistic scholars, and colleagues with expertise in information science and/or

biological anthropology Biological anthropology, also known as physical anthropology, is a natural science discipline concerned with the biological and behavioral aspects of human beings, their extinct hominin ancestors, and related non-human primates, particularly fro ...

. These projects often sought to arrive at an optimal

phylogenetic tree A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA. In ...

(or network), to represent a hypothesis about the evolutionary ancestry and perhaps its language contacts. Pioneers in these methods included the founders of CPHL: computational phylogenetics in historical linguistics (CPHL project): Donald Ringe, Tandy Warnow, Luay Nakhleh and Steven N. Evans. In the mid-1990s a group at Pennsylvania University computerised the comparative method and used a different IE database with 20 ancient languages. In the biological field several software programs were then developed which could have application to historical linguistics. In particular a group at the University of Auckland developed a method that gave controversially old dates for IE languages. A conference on "Time-depth in Historical Linguistics" was held in August 1999 at which many applications of quantitative methods were discussed. Subsequently many papers have been published on studies of various language groups as well as comparisons of the methods. The proceedings of a 2004 conference, ''Phylogenetic Methods and the Prehistory of Languages'' were published in 2006, edited by Peter Forster and

Colin Renfrew Andrew Colin Renfrew, Baron Renfrew of Kaimsthorn, (25 July 1937 – 24 November 2024) was a British archaeologist, paleolinguist and Conservative peer noted for his work on radiocarbon dating, the prehistory of languages, archaeogenetics, ...

Studied language families

Computational phylogenetic analyses have been performed for: *

Indo-European languages The Indo-European languages are a language family native to the northern Indian subcontinent, most of Europe, and the Iranian plateau with additional native branches found in regions such as Sri Lanka, the Maldives, parts of Central Asia (e. ...

: Bouckaert (2012) *

Uralic languages The Uralic languages ( ), sometimes called the Uralian languages ( ), are spoken predominantly in Europe and North Asia. The Uralic languages with the most native speakers are Hungarian, Finnish, and Estonian. Other languages with speakers ab ...

: Honkola (2013) *

Turkic languages The Turkic languages are a language family of more than 35 documented languages, spoken by the Turkic peoples of Eurasia from Eastern Europe and Southern Europe to Central Asia, East Asia, North Asia (Siberia), and West Asia. The Turkic langua ...

: Hruschka (2014) *

Dravidian languages The Dravidian languages are a language family, family of languages spoken by 250 million people, primarily in South India, north-east Sri Lanka, and south-west Pakistan, with pockets elsewhere in South Asia. The most commonly spoken Dravidian l ...

: Kolipakam (2018) *

Austroasiatic languages The Austroasiatic languages ( ) are a large language family spoken throughout Mainland Southeast Asia, South Asia and East Asia. These languages are natively spoken by the majority of the population in Vietnam and Cambodia, and by minority popu ...

: Sidwell (2015) *

Austronesian languages The Austronesian languages ( ) are a language family widely spoken throughout Maritime Southeast Asia, parts of Mainland Southeast Asia, Madagascar, the islands of the Pacific Ocean and Taiwan (by Taiwanese indigenous peoples). They are spoken ...

: Gray (2009) * Pama-Nyungan languages: Bowern & Atkinson (2012), Bouckaert, Bowern and Atkinson (2018) *

Bantu languages The Bantu languages (English: , Proto-Bantu language, Proto-Bantu: *bantʊ̀), or Ntu languages are a language family of about 600 languages of Central Africa, Central, Southern Africa, Southern, East Africa, Eastern and Southeast Africa, South ...

: Currie (2013), Grollemund (2015) *

Semitic languages The Semitic languages are a branch of the Afroasiatic languages, Afroasiatic language family. They include Arabic, Amharic, Tigrinya language, Tigrinya, Aramaic, Hebrew language, Hebrew, Maltese language, Maltese, Modern South Arabian language ...

: Kitchen (2009) * Dené–Yeniseian languages: Sicoli & Holton (2014) *

Uto-Aztecan languages The Uto-Aztecan languages are a family of native American languages, consisting of over thirty languages. Uto-Aztecan languages are found almost entirely in the Western United States and Mexico. The name of the language family reflects the common ...

: Wheeler & Whiteley (2014) *

Mayan languages The Mayan languages In linguistics, it is conventional to use ''Mayan'' when referring to the languages, or an aspect of a language. In other academic fields, ''Maya'' is the preferred usage, serving as both a singular and plural noun, and a ...

: Atkinson (2006) * Arawakan languages: Walker & Ribeiro (2011) * Tupi-Guarani languages: Michael (2015) *

Sino-Tibetan languages Sino-Tibetan (also referred to as Trans-Himalayan) is a family of more than 400 languages, second only to Indo-European in number of native speakers. Around 1.4 billion people speak a Sino-Tibetan language. The vast majority of these are the 1.3 ...

: Zhang et al. (2019), Sagart et al. (2019)

Background

The standard method for assessing language relationships has been the

comparative method In linguistics, the comparative method is a technique for studying the development of languages by performing a feature-by-feature comparison of two or more languages with common descent from a shared ancestor and then extrapolating backwards ...

. However this has a number of limitations. Not all linguistic material is suitable as input and there are issues of the linguistic levels on which the method operates. The reconstructed languages are idealized and different scholars can produce different results. Language family trees are often used in conjunction with the method and "borrowings" must be excluded from the data, which is difficult when borrowing is within a family. It is often claimed that the method is limited in the time depth over which it can operate. The method is difficult to apply and there is no independent test. Thus alternative methods have been sought that have a formalised method, quantify the relationships and can be tested. A goal of comparative historical linguistics is to identify instances of genetic relatedness amongst languages. The steps in quantitative analysis are (i) to devise a procedure based on theoretical grounds, on a particular model or on past experience, etc. (ii) to verify the procedure by applying it to some data where there exists a large body of linguistic opinion for comparison (this may lead to a revision of the procedure of stage (i) or at the extreme of its total abandonment) (iii) to apply the procedure to data where linguistic opinions have not yet been produced, have not yet been firmly established or perhaps are even in conflict. Applying phylogenetic methods to languages is a multi-stage process: (a) the encoding stage - getting from real languages to some expression of the relationships between them in the form of numerical or state data, so that those data can then be used as input to phylogenetic methods (b) the representation stage - applying phylogenetic methods to extract from those numerical and/or state data a signal that is converted into some useful form of representation, usually two dimensional graphical ones such as trees or networks, which synthesise and "collapse" what are often highly complex multi dimensional relationships in the signal (c) the interpretation stage - assessing those tree and network representations to extract from them what they actually mean for real languages and their relationships through time.

Types of trees and networks

An output of a quantitative historical linguistic analysis is normally a tree or a network diagram. This allows summary visualisation of the output data but is not the complete result. A tree is a connected acyclic graph, consisting of a set of vertices (also known as "nodes") and a set of edges ("branches") each of which connects a pair of vertices. An internal node represents a linguistic ancestor in a phylogenic tree or network. Each language is represented by a path, the paths showing the different states as it evolves. There is only one path between every pair of vertices. Unrooted trees plot the relationship between the input data without assumptions regarding their descent. A rooted tree explicitly identifies a common ancestor, often by specifying a direction of evolution or by including an "outgroup" that is known to be only distantly related to the set of languages being classified. Most trees are binary, that is a parent has two children. A tree can always be produced even though it is not always appropriate. A different sort of tree is that only based on language similarities / differences. In this case the internal nodes of the graph do not represent ancestors but are introduced to represent the conflict between the different splits ("bipartitions") in the data analysis. The "phenetic distance" is the sum of the weights (often represented as lengths) along the path between languages. Sometimes an additional assumption is made that these internal nodes do represent ancestors. When languages converge, usually with word adoption ("borrowing"), a network model is more appropriate. There will be additional edges to reflect the dual parentage of a language. These edges will be bidirectional if both languages borrow from one another. A tree is thus a simple network, however there are many other types of network. A phylogentic network is one where the taxa are represented by nodes and their evolutionary relationships are represented by branches. Another type is that based on splits, and is a combinatorial generalisation of the split tree. A given set of splits can have more than one representation thus internal nodes may not be ancestors and are only an "implicit" representation of evolutionary history as distinct from the "explicit" representation of phylogenetic networks. In a splits network the phrenetic distance is that of the shortest path between two languages. A further type is the reticular network which shows incompatibilities (due to for example to contact) as reticulations and its internal nodes do represent ancestors. A network may also be constructed by adding contact edges to a tree. The last main type is the consensus network formed from trees. These trees may be as a result of bootstrap analysis or samples from a posterior distribution.

Language change

Change happens continually to languages, but not usually at a constant rate, with its cumulative effect producing splits into dialects, languages and language families. It is generally thought that morphology changes slowest and phonology the quickest. As change happens, less and less evidence of the original language remains. Finally there could be loss of any evidence of relatedness. Changes of one type may not affect other types, for example sound changes do not affect cognacy. Unlike biology, it cannot be assumed that languages all have a common origin and establishing relatedness is necessary. In modelling it is often assumed for simplicity that the characters change independently but this may not be the case. Besides borrowing, there can also be semantic shifts and polymorphism.

Databases

Swadesh originally published a 200 word list but later refined it into a 100 word one. A commonly used IE database is that by Dyen, Kruskal and Black which contains data for 95 languages, though the original is known to contain a few errors. Besides the raw data it also contains cognacy judgements. The database of Ringe, Warnow and Taylor has information on 24 IE languages, with 22 phonological characters, 15 morphological characters and 333 lexical characters. Gray and Atkinson used a database of 87 languages with 2449 lexical items, based on the Dyen set with the addition of three ancient languages. They incorporated the cognacy judgements of a number of scholars. Other databases have been drawn up for African, Australian and Andean language families, amongst others. Coding of the data may be in binary form or in multistate form. The former is often used but does result in a bias. It has been claimed that there is a constant scale factor between the two coding methods, and that allowance can be made for this. However, another study suggests that the topology may change

Word lists

The word slots are chosen to be as culture- and borrowing- free as possible. The original

s are most commonly used but many others have been devised for particular purposes. Often these are shorter than Swadesh's preferred 100 item list. Kessler has written a book on "The Significance of Word Lists while McMahon and McMahon carried out studies on the effects of reconstructability and retentiveness. The effect of increasing the number of slots has been studied and a law of diminishing returns found, with about 80 being found satisfactory. However some studies have used less than half this number. Generally each cognate set is represented as a different character but differences between words can also be measured as a distance measurement by sound changes. Distances may also be measured letter by letter.

Morphological features

Traditionally these have been seen as more important than lexical ones and so some studies have put additional weighting on this type of character. Such features were included in the Ringe, Warnow and Taylor IE database for example. However other studies have omitted them.

Typological features

Examples of these features include glottalised constants, tone systems, accusative alignment in nouns, dual number, case number correspondence, object-verb order, and first person singular pronouns. These will be listed in the WALS database, though this is only sparsely populated for many languages yet.

Probabilistic models

Some analysis methods incorporate a statistical model of language evolution and use the properties of the model to estimate the evolution history. Statistical models are also used for simulation of data for testing purposes. A stochastic process can be used to describe how a set of characters evolves within a language. The probability with which a character will change can depend on the branch but not all characters evolve together, nor is the rate identical on all branches. It is often assumed that each character evolves independently but this is not always the case. Within a model borrowing and parallel development (homoplasy) may also be modelled, as well as polymorphisms.

Effects of chance

Chance resemblances produce a level of noise against which the required signal of relatedness has to be found. A study was carried out by Ringe into the effects of chance on the mass comparison method.

Detection of borrowing

Loanwords can severely affect the topology of a tree so efforts are made to exclude borrowings. However, undetected ones sometimes still exist. McMahon and McMahon Language Classification by Numbers showed that around 5% borrowing can affect the topology while 10% has significant effects. In networks borrowing produces reticulations. Minett and Wang examined ways of detecting borrowing automatically.

Split dating

Dating of language splits can be determined if it is known how the characters evolve along each branch of a tree. The simplest assumption is that all characters evolve at a single constant rate with time and that this is independent of the tree branch. This was the assumption made in glottochronology. However, studies soon showed that there was variation between languages, some probably due to the presence of unrecognised borrowing. A better approach is to allow rate variation, and the gamma distribution is usually used because of its mathematical convenience. Studies have also been carried out that show that the character replacement rate depends on the frequency of use. Widespread borrowing can bias divergence time estimates by making languages seem more similar and hence younger. However, this also makes the ancestor's branch length longer so that the root is unaffected.

Types of analysis

There is a need to understand how a language classification method works in order to determine its assumptions and limitations. It may only be valid under certain conditions or be suitable for small databases. The methods differ in their data requirements, their complexity and running time. The methods also differ in their optimisation criteria.

Character based models

Maximum parsimony and maximum compatibility

These two methods are similar but the maximum parsimony method's objective is to find the tree (or network) in which the minimum number of evolutionary changes occurs. In some implementations the characters can be given weights and then the objective is to minimise the total weighted sum of the changes. The analysis produces unrooted trees unless an outgroup is used or directed characters. Heuristics are used to find the best tree but optimisation is not guaranteed. The method is often implemented using the programs PAUP or TNT. Maximum compatibility also uses characters, with the objective of finding the tree on which the maximum number of characters evolve without homoplasy. Again the characters can be weighted and when this occurs the objective is to maximise the sum of the weights of compatible characters. It also produces unrooted trees unless additional information is incorporated. There are no readily available heuristics available that are accurate with large databases. This method has only been used by Ringe's group.

Perfect Phylogenetic Networks

This method produces an explicit phylogenic network having an underlying tree with additional contact edges. Characters can be borrowed but evolve without homoplasy. To produce such networks, a graph-theoretic algorithm has been used.

Gray and Atkinson's method

The input lexical data is coded in binary form, with one character for each state of the original multi-state character. The method allows homoplasy and constraints on split times. A likelihood-based analysis method is used, with evolution expressed as a rate matrix. Cognate gain and loss is modelled with a gamma distribution to allow rate variation and with rate smoothing. Because of the vast number of possible trees with many languages, Bayesian inference is used to search for the optimal tree. A

Markov Chain In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally ...

Monte Carlo algorithm generates a sample of trees as an approximation to the posterior probability distribution. A summary of this distribution can be provided as a greedy consensus tree or network with support values. The method also provides date estimates. The method is accurate when the original characters are binary, and evolve identically and independently of each other under a rates-across-sites model with gamma distributed rates; the dates are accurate when the rate of change is constant. Understanding the performance of the method when the original characters are multi-state is more complicated, since the binary encoding produces characters that are not independent, while the method assumes independence.

Nicholls and Gray's method

This method is an outgrowth of Gray and Atkinson's. Rather than having two parameters for a character, this method uses three. The birth rate, death rate of a cognate are specified and its borrowing rate. The birth rate is a Poisson random variable with a single birth of a cognate class but separate deaths of branches are allowed (Dollo parsimony). The method does not allow homoplasy but allows polymorphism and constraints. Its major problem is that it cannot handle missing data (this issue has since been resolved by Ryder and Nicholls. Statistical techniques are used to fit the model to the data. Prior information may be incorporated and an MCMC research is made of possible reconstructions. The method has been applied to Gray and Nichol's database and seems to give similar results.

Distance based models

These use a triangular matrix of pairwise language comparisons. The input character matrix is used to compute the distance matrix either using the

Hamming distance In information theory, the Hamming distance between two String (computer science), strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number ...

or the Levenshtein distance. The former measures the proportion of matching characters while the latter allows costs of the various possible transforms to be included. These methods are fast compared with wholly character based ones. However, these methods do result in information loss.

UPGMA

The "Unweighted Pairwise Group Method with Arithmetic-mean" ( UPGMA) is a clustering technique which operates by repeatedly joining the two languages that have the smallest distance between them. It operates accurately with clock-like evolution but otherwise it can be in error. This is the method used in Swadesh's original lexicostatistics.

Split Decomposition

This is a technique for dividing data into natural groups. The data could be characters but is more usually distance measures. The character counts or distances are used to generate the splits and to compute weights (branch lengths) for the splits. The weighted splits are then represented in a tree or network based on minimising the number of changes between each pair of taxa. There are fast algorithms for generating the collection of splits. The weights are determined from the taxon to taxon distances. Split decomposition is effective when the number of taxa is small or when the signal is not too complicated.

Neighbor joining

This method operates on distance data, computes a transformation of the input matrix and then computes the minimum distance of the pairs of languages. It operates correctly even if the languages do not evolve with a lexical clock. A weighted version of the method may also be used. The method produces an output tree. It is claimed to be the closest method to manual techniques for tree construction.

Neighbor-net

It uses a similar algorithm to neighbor joining. Unlike Split Decomposition it does not fuse nodes immediately but waits until a node has been paired a second time. The tree nodes are then replaced by two and the distance matrix reduced. It can handle large and complicated data sets. However, the output is a phenogram rather than a phylogram. This is the most popular network method.

Network

This was an early network method that has been used for some language analysis. It was originally developed for genetic sequences with more than one possible origin. Network collapses the alternative trees into a single network. Where there are multiple histories a reticulation (a box shape) is drawn. It generates a list of characters incompatible with a tree.

ASP

This uses a declarative knowledge representation formalism and the methods of Answer Set Programming. One such solver is CMODELS which can be used for small problems but larger ones require heuristics. Preprocessing is used to determine the informative characters. CMODELS transforms them into a propositional theory that uses a SAT solver to compute the models of this theory.

Fitch/Kitch

Fitch and Kitch are maximum likelihood based programs in PHYLIP that allow a tree to be rearranged after each addition, unlike NJ. Kitch differs from Fitch in assuming a constant rate of change throughout the tree while Fitch allows for different rates down each branch.

Separation level method

Holm introduced a method in 2000 to deal with some known problems of lexicostatistical analysis. These are the "symplesiomorphy trap", where shared archaisms are difficult to distinguish from shared innovations, and the "proportionality "trap" when later changes can obscure early ones. Later he introduced a refined method, called SLD, to take account of the variable word distribution across languages. The method does not assume aconstant rate of change.

Fast convergence methods

A number of fast converging analysis methods have been developed for use with large databases (>200 languages). One of these is the Disk Covering Method (DCM). This has been combined with existing methods to give improved performance. A paper on the DCM-NJ+MP method is given by the same authors in "The performance of Phylogenetic Methods on Trees of Bounded Diameter", where it is compared with the NJ method.

Resemblance based models

These models compare the letters of words rather than their phonetics. Dunn ''et al.'' studied 125 typological characters across 16 Austronesian and 15 Papuan languages. They compared their results to an MP tree and one constructed by traditional analysis. Significant differences were found. Similarly Wichmann and Saunders used 96 characters to study 63 American languages.

Computerised mass comparison

A method that has been suggested for initial inspection of a set of languages to see if they are related was mass comparison. However, this has been severely criticised and fell into disuse. Recently Kessler has resurrected a computerised version of the method but using rigorous hypothesis testing. The aim is to make use of similarities across more than two languages at a time. In another paper various criteria for comparing word lists are evaluated. It was found that the IE and Uralic families could be reconstructed but there was no evidence for a joint super-family.

Nichol's method

This method uses stable lexical fields, such as stance verbs, to try to establish long-distance relationships. Account is taken of convergence and semantic shifts to search for ancient cognates. A model is outlined and the results of a pilot study are presented.

ASJP

The Automated Similarity Judgment Program (ASJP) is similar to lexicostatistics, but the judgement of similarities is done by a computer program following a consistent set of rules. Trees are generated using standard phylogenetic methods. ASJP uses 7 vowel symbols and 34 consonant symbols. There are also various modifiers. Two words are judged similar if at least two consecutive consonants in the respective words are identical while vowels are also taken into account. The proportion of words with the same meaning judged to be similar for a pair of languages is the Lexical Similarity Percentage (LSP). The Phonological Similarity Percentage (PSP) is also calculated. PSP is then subtracted from the LSP yielding the Subtracted Similarity Percentage (SSP) and the ASJP distance is 100-SSP. Currently there are data on over 4,500 languages and dialects in the ASJP database from which a tree of the world's languages was generated.

Serva and Petroni's method

This measures the orthographical distance between words to avoid the subjectivity of cognacy judgements. It determines the minimum number of operations needed to transform one word into another, normalised by the length of the longer word. A tree is constructed from the distance data by the UPGMA technique.

Phonetic evaluation methods

Heggarty has proposed a means of providing a measure of the degrees of difference between cognates, rather than just yes/no answers. This is based on examining many (>30) features of the phonetics of the glosses in comparison with the protolanguage. This could require a large amount of work but Heggarty claims that only a representative sample of sounds is necessary. He also examined the rate of change of the phonetics and found a large rate variation, so that it was unsuitable for glottochronology. A similar evaluation of the phonetics had earlier been carried out by Grimes and Agard for Romance languages, but this used only six points of comparison.

Evaluation of methods

Metrics

Standard mathematical techniques are available for measuring the similarity/difference of two trees. For consensus trees the Consistency Index (CI) is a measure of homoplasy. For one character it is the ratio of the minimimum conceivable number of steps on any one tree (= 1 for binary trees) divided by the number of reconstructed steps on the tree. The CI of a tree is the sum of the character CIs divided by the number of characters. It represents the proportion of patterns correctly assigned. The Retention Index (RI) measures the amount of similarity in a character. It is the ratio (g - s) / (g - m) where g is the greatest number of steps of a character on any tree, m is the minimum number of steps on any tree, and s is the minimum steps on a particular tree. There is also a Rescaled CI which is the product of the CI and RI. For binary trees the standard way of comparing their topology is to use the Robinson-Foulds metric. This distance is the average of the number of false positives and false negatives in terms of branch occurrence. R-F rates above 10% are considered poor matches. For other sorts of trees and for networks there is yet no standard method of comparison. Lists of incompatible characters are produced by some tree producing methods. These can be extremely helpful in analysing the output. Where heuristic methods are used repeatability is an issue. However, standard mathematical techniques are used to overcome this problem.

Comparison with previous analyses

In order to evaluate the methods a well understood family of languages is chosen, with a reliable dataset. This family is often the IE one but others have been used. After applying the methods to be compared to the database, the resulting trees are compared with the reference tree determined by traditional linguistic methods. The aim is to have no conflicts in topology, for example no missing sub-groups, and compatible dates. The families suggested for this analysis by Nichols and Warnow are Germanic, Romance, Slavic, Common Turkic, Chinese, and Mixe Zoque as well as older groups such as Oceanic and IE.

Use of simulations

Although the use of real languages does add realism and provides real problems, the above method of validation suffers from the fact that the true evolution of the languages is unknown. By generating a set of data from a simulated evolution correct tree is known. However it will be a simplified version of reality. Thus both evaluation techniques should be used.

Sensitivity analysis

To assess the robustness of a solution it is desirable to vary the input data and constraints, and observe the output. Each variable is changed slightly in turn. This analysis has been carried out in a number of cases and the methods found to be robust, for example by Atkinson and Gray.

Studies comparing methods

During the early 1990s, linguist Donald Ringe, with computer scientists Luay Nakhleh and Tandy Warnow, statistician Steven N. Evans and others, began collaborating on research in quantitative comparative linguistic projects. They later founded the CHPL project, the goals of which include: "producing and maintaining real linguistic datasets, in particular of Indo-European languages", "formulating statistical models that capture the evolution of historical linguistic data", "designing simulation tools and accuracy measures for generating synthetic data for studying the performance of reconstruction methods", and "developing and implementing statistically-based as well as combinatorial methods for reconstructing language phylogenies, including phylogenetic networks". A comparison of coding methods was carried out by Rexova ''et al.'' (2003). They created a reduced data set from the Dyen database but with the addition of Hittite. They produced a standard multistate matrix where the 141 character states corresponds to individual cognate classes, allowing polymorphism. They also joined some cognate classes, to reduce subjectivity and polymorphic states were not allowed. Lastly they produced a binary matrix where each class of words was treated as a separate character. The matrices were analysed by PAUP. It was found that using the binary matrix produced changes near the root of the tree. McMahon and McMahon (2003) used three PHYLIP programs (NJ, Fitch and Kitch) on the DKB dataset. They found that the results produced were very similar. Bootstrapping was used to test the robustness of any part of the tree. Later they used subsets of the data to assess its retentiveness and reconstructability. The outputs showed topological differences which were attributed to borrowing. They then also used Network, Split Decomposition, Neighbor-net and SplitsTree on several data sets. Significant differences were found between the latter two methods. Neighbor-net was considered optimal for discerning language contact. In 2005, Nakhleh, Warnow, Ringe and Evans carried out a comparison of six analysis methods using an Indo-European database. The methods compared were UPGMA, NJ MP, MC, WMC and GA. The PAUP software package was used for UPGMA, NJ, and MC as well as computing the majority consensus trees. The RWT database was used but 40 characters were removed due to evidence of polymorphism. Then a screened database was produced excluding all characters that clearly exhibited parallel development, so eliminating 38 features. The trees were evaluated on the basis of the number of incompatible characters and on agreement with established sub-grouping results. They found that UPGMA was clearly worst but there was not a lot of difference between the other methods. The results depended on the data set used. It was found that weighting the characters was important, which requires linguistic judgement. Saunders (2005) compared NJ, MP, GA and Neighbor-Net on a combination of lexical and typological data. He recommended use of the GA method but Nichols and Warnow have some concerns about the study methodology. Cysouw ''et al.'' (2006) compared Holm's original method with NJ, Fitch, MP and SD. They found Holm's method to be less accurate than the others. In 2013, François Barbancon, Warnow, Evans, Ringe and Nakleh (2013) studied various tree reconstruction methods using simulated data. Their simulated data varied in the number of contact edges, the degree of homoplasy, the deviation from a lexical clock, and the deviation from the rates-across-sites assumption. It was found that the accuracy of the unweighted methods (MP, NJ, UPGMA, and GA) were consistent in all the conditions studied, with MP being the best. The accuracy of the two weighted methods (WMC and WMP) depended on the appropriateness of the weighting scheme. With low homoplasy the weighted methods generally produced the more accurate results but inappropriate weighting could make these worse than MP or GA under moderate or high homoplasy levels.

Choosing the best model

Choice of an appropriate model is critical for the production of good phylogenetic analyses. Both underparameterised or overly restrictive models may produce aberrant behaviour when their underlying assumptions are violated, while overly complex or overparameterised models require long run times and their parameters may be overfit.Sullivan and Joyce
Model selection in phylogenetics
Annual Review of Ecology, Evolution and Systematics 36 (2005) The most common method of model selection is the "Likelihood Ratio Test" which produces an estimate of the fit between the model and the data, but as an alternative the Akaike Information Criterion or the Bayesian Information Criterion can be used. Model selection computer programs are available.

Notes

Bibliography

*Atkinson, Nicholls, Welsh and Gray : From words to dates - Transactions of the Philological Society 103 (2005). *Bandelt and Drew : Split Decomposition - Molecular Phylogentic Evolution 1 (1992). *Bandelt, Forster and Rohl
Median-joining networks for inferring intraspecific phylogenies
- Molecular Biological Evolution 16 (1999). *Bryant, Filimon and Gray
Untangling our past: Languages, trees, splits and networks
{cbignore, bot=medic (in The Evolution of Cultural Diversity by Mace, Holden and Shennan UCL 2005). * *Evans and Warnow : Unidentifiable divergence times in rates-across-sites models - IEEE/ACM Transactions on Computational Biology and Bioinformation 1 (2005). *Huelsenbeck and Ronquist : Mr Bayes, Baysesian inference of phylogeny - Bioinfomatics 17 (2001). *Huson: Splitstree, a program for analysing and visualising evolutionary data - Bioinfomatics 14(1) (1998). * Warnow, Evans, Ringe and Nakhleh
A Stochastic Model of Language Evolution that Incorporates Homoplasy and Borrowing
(in Phylogenetic Methods and the Prehistory of Languages - Forster and Renfrew, 2006). *Efron, Halloran and Holmes
Bootstrap confidence levels for phylogenetic trees
- Proceedings of National Academy of Sciences USA 93 (1996). *Kowalski and Thorton : Performance of maximum parsimony and likelihood phylogenies when evolution is heterogeneous - Nature 431 (2004). *Felsentein
Cases in which parsimony and compatibility methods will be positively misleading
- Systematic Zoology 27 (1978). *Rogers
Maximum likelihood estimation of phylogenetic trees is consistent when substitution rates vary according to the invariable sites plus gamma distribution
- Systematic Biology 59 (2001). Historical linguistics Phylogenetics Comparative linguistics Quantitative linguistics Mathematical linguistics