HOME

TheInfoList



OR:

The Robinson–Foulds or symmetric difference metric, often abbreviated as the RF distance, is a simple way to calculate the distance between
phylogenetic trees A phylogenetic tree (also phylogeny or evolutionary tree Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, MA.) is a branching diagram or a tree showing the evolutionary relationships among various biological spec ...
. It is defined as ( + ) where is the number of partitions of data implied by the first tree but not the second tree and is the number of partitions of data implied by the second tree but not the first tree (although some software implementations divide the RF metric by 2 and others scale the RF distance to have a maximum value of 1). The partitions are calculated for each tree by removing each branch. Thus, the number of eligible partitions for each tree is equal to the number of branches in that tree. RF distances have been criticized as biased, but they represent a relatively intuitive measure of the distances between phylogenetic trees and therefore remain widely used (the original 1981 paper describing Robinson-Foulds distances was cited more than 200 times in 2019 based on
Google Scholar Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes ...
). Nevertheless, the biases inherent to the RF distances suggest that researches should consider using "Generalized" Robinson–Foulds metrics that may have better theoretical and practical performance and avoid the biases and misleading attributes of the original metric.


Explanation

Given two unrooted trees of nodes and a set of labels (i.e.,
taxa In biology, a taxon (back-formation from ''taxonomy''; plural taxa) is a group of one or more populations of an organism or organisms seen by taxonomists to form a unit. Although neither is required, a taxon is usually known by a particular nam ...
) for each node (which could be empty, but only nodes with degree greater than or equal to three can be labeled by an empty set) the Robinson–Foulds metric finds the number of \alpha and \alpha^ operations to convert one into the other. The number of operations defines their distance. Rooted trees can be examined by assigning a label to the leaf node. The authors define two trees to be the same if they are isomorphic and the isomorphism preserves the labeling. The construction of the proof is based on a function called \alpha, which contracts an edge (combining the nodes, creating a union of their sets). Conversely, \alpha^ expands an edge (decontraction), where the set can be split in any fashion. The \alpha function removes all edges from T_1 that are not in T_2, creating T_1 \wedge T_2, and then \alpha^ is used to add the edges only discovered in T_2 to the tree T_1 \wedge T_2 to build T_2. The number of operations in each of these procedures is equivalent to the number of edges in T_1 that are not in T_2 plus the number of edges in T_2 that are not in T_1. The sum of the operations is equivalent to a transformation from T_1 to T_2, or vice versa.


Properties

The RF distance corresponds to an equivalent similarity metric that reflects the resolution of the strict consensus of two trees, first used to compare trees in 1980. In their 1981 paper Robinson and Foulds proved that the distance is in fact a
metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathe ...
.


Algorithms for computing the metric

In 1985 Day gave an algorithm based on perfect hashing that computes this distance that has only a linear complexity in the number of nodes in the trees. A randomized algorithm that uses hash tables that are not necessarily perfect has been shown to approximate the Robinson-Foulds distance with a bounded error in sublinear time.


Specific applications

In
phylogenetics In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups ...
, the metric is often used to compute a distance between two trees. The treedist program in the
PHYLIP PHYLogeny Inference Package (PHYLIP) is a free computational phylogenetics package of programs for inferring evolutionary trees (Phylogenetics, phylogenies). It consists of 65 Porting, portable programs, i.e., the source code is written in the prog ...
suite offers this function, as does th
RAxML_standard
package, th
DendroPy
Python library (under the name "symmetric difference metric"), and R package
TreeDist
(`RobinsonFoulds()` function) an
phangorn
(`treedist()` function). For comparing groups of trees, the fastest implementations include HashRF and MrsRF. The Robinson–Foulds metric has also been used in quantitative comparative linguistics to compute distances between trees that represent how languages are related to each other.


Strengths and weaknesses

The RF metric remains widely used because the idea of using the number of splits that differ between a pair of trees is a relatively intuitive way to assess the differences among trees for many systematists. This is the primary strength of the RF distance and the reason for its continued use in phylogenetics. Of course, the number of splits that differ between a pair of trees depends on the number of taxa in the trees so one might argue that this unit is not meaningful. However, it is straightforward to normalize RF distances so they range between zero and one. However, the RF metric also suffers a number of theoretical and practical shortcomings: * Relative to other metrics, lacks sensitivity, and is thus imprecise; it can take two fewer distinct values than there are taxa in a tree. * It is rapidly saturated; very similar trees can be allocated the maximum distance value. * Its value can be counterintuitive. One example is that moving a tip and its neighbour to a particular point on a tree generates a ''lower'' difference value than if just one of the two tips were moved to the same place. * Its range of values can depend on tree shape: trees that contain many uneven partitions will command relatively lower distances, on average, than trees with many even partitions. * It performs more poorly than many alternative measures in practical settings, based on simulated trees. Another issue to consider when using RF distances is that differences in one clade may be trivial (perhaps if the clade resolves three species within a genus differently) or may be fundamental (if the clade is deep in the tree and defines two fundamental subgroups, such as mammals and birds). However, this issue is not a problem with RF distances per se, it is a more general criticism of tree distances. Regardless of the behaviour of any specific tree distance a practicing evolutionary biologist might view some tree rearrangements as "important" and other rearrangements as "trivial". Tree distances are tools; they are most useful in the context of other information about the organisms in the trees. These issues can be addressed by using less conservative metrics. "Generalized RF distances" recognize similarity between similar, but non-identical, splits; the original Robinson Foulds distance doesn't care how similar two groupings are, if they aren't identical they are discarded.*Böcker S., Canzar S., Klau G.W. 2013. The generalized Robinson-Foulds metric. In: Darling A., Stoye J., editors. Algorithms in Bioinformatics. WABI 2013. Lecture Notes in Computer Science, vol 8126. Berlin, Heidelberg: Springer. p. 156–169. *Bogdanowicz D., Giaro K. 2012. Matching split distance for unrooted binary phylogenetic trees. IEEE/ACM Trans. Comput. Biol. Bioinforma. 9:150–160. *Bogdanowicz D., Giaro K. 2013. On a matching distance between rooted phylogenetic trees. Int. J. Appl. Math. Comput. Sci. 23:669–684. *Nye T.M.W., Liò P., Gilks W.R. 2006. A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics. 22:117–119. The best-performing generalized Robinson-Foulds distances have a basis in information theory, and measure the distance between trees in terms of the quantity of information that the trees' splits hold in common (measured in bits). The Clustering Information Distance (implemented in R packag
TreeDist
is recommended as the most suitable alternative to the Robinson-Foulds distance. An alternative approach to tree distance calculation is to use
Quartet distance The quartet distance is a way of measuring the distance between two phylogenetic trees. It is defined as the number of subsets of four leaves that are not related by the same topology in both trees. Computing the quartet distance The most straight ...
, rather than splits, as the basis for tree comparison.


Software implementations


References


Further reading

* M. Bourque, Arbres de Steiner et reseaux dont certains sommets sont a localisation variable. PhD thesis, University de Montreal, Montreal, Quebec, 1978 http://www.worldcat.org/title/arbres-de-steiner-et-reseaux-dont-certains-sommets-sont-a-localisation-variable/oclc/053538946 * * William H. E. Day, "Optimal algorithms for comparing trees with labeled leaves", ''Journal of Classification'', Number 1, December 1985. * Makarenkov, V and Leclerc, B. Comparison of additive trees using circular orders, Journal of Computational Biology,7,5,731-744,2000,"Mary Ann Liebert, Inc." * * {{DEFAULTSORT:Robinson-Foulds metric Computational phylogenetics Bioinformatics algorithms