The Sørensen–Dice coefficient (see below for other names) is a
statistic
A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hy ...
used to gauge the similarity of two
samples. It was independently developed by the
botanists
Thorvald Sørensen and
Lee Raymond Dice, who published in 1948 and 1945 respectively.
Name
The index is known by several other names, especially Sørensen–Dice index,
Sørensen index and Dice's coefficient. Other variations include the "similarity coefficient" or "index", such as Dice similarity coefficient (DSC). Common alternate spellings for Sørensen are ''Sorenson'', ''Soerenson'' and ''Sörenson'', and all three can also be seen with the ''–sen'' ending.
Other names include:
*
F1 score
*
Czekanowski's binary (non-quantitative) index
* Measure of genetic similarity
* Zijdenbos similarity index, referring to a 1994 paper of Zijdenbos et al.
Formula
Sørensen's original formula was intended to be applied to discrete data. Given two sets, X and Y, it is defined as
:
where , ''X'', and , ''Y'', are the
cardinalities of the two sets (i.e. the number of elements in each set).
The Sørensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.
When applied to Boolean data, using the definition of true positive (TP), false positive (FP), and false negative (FN), it can be written as
:
.
It is different from the
Jaccard index which only counts true positives once in both the numerator and denominator. DSC is the quotient of similarity and ranges between 0 and 1. It can be viewed as a
similarity measure over sets.
Similarly to the
Jaccard index, the set operations can be expressed in terms of vector operations over binary vectors a and b:
:
which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms.
For sets ''X'' and ''Y'' of keywords used in
information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities :
When taken as a
string similarity
In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or c ...
measure, the coefficient may be calculated for two strings, ''x'' and ''y'' using
bigrams as follows:
:
where ''n''
''t'' is the number of character bigrams found in both strings, ''n''
''x'' is the number of bigrams in string ''x'' and ''n''
''y'' is the number of bigrams in string ''y''. For example, to calculate the similarity between:
:
night
:
nacht
We would find the set of bigrams in each word:
:
:
Each set has four elements, and the intersection of these two sets has only one element:
ht
.
Inserting these numbers into the formula, we calculate, ''s'' = (2 · 1) / (4 + 4) = 0.25.
Continuous Dice Coefficient
For a discrete ground truth and continuous measures the following formula can be used:
where c can be computed as follows:
If
which means no overlap between A and B, c is set to 1 arbitrarily.
Difference from Jaccard
This coefficient is not very different in form from the
Jaccard index. In fact, both are equivalent in the sense that given a value for the Sørensen–Dice coefficient
, one can calculate the respective Jaccard index value
and vice versa, using the equations
and
.
Since the Sørensen–Dice coefficient does not satisfy the
triangle inequality
In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
This statement permits the inclusion of degenerate triangles, bu ...
, it can be considered a
semimetric
In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setting ...
version of the Jaccard index.
The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function
:
is not a proper distance metric as it does not satisfy the triangle inequality.
[Gallagher, E.D., 1999]
COMPAH Documentation
University of Massachusetts, Boston The simplest counterexample of this is given by the three sets , , and , the distance between the first two being 1, and the difference between the third and each of the others being one-third. To satisfy the triangle inequality, the sum of ''any'' two of these three sides must be greater than or equal to the remaining side. However, the distance between and plus the distance between and equals 2/3 and is therefore less than the distance between and which is 1.
Applications
The Sørensen–Dice coefficient is useful for ecological community data (e.g. Looman & Campbell, 1960). Justification for its use is primarily empirical rather than theoretical (although it can be justified theoretically as the intersection of two
fuzzy sets). As compared to
Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefore o ...
, the Sørensen distance retains sensitivity in more heterogeneous data sets and gives less weight to outliers. Recently the Dice score (and its variations, e.g. logDice taking a logarithm of it) has become popular in computer
lexicography
Lexicography is the study of lexicons, and is divided into two separate academic disciplines. It is the art of compiling dictionaries.
* Practical lexicography is the art or craft of compiling, writing and editing dictionaries.
* Theoret ...
for measuring the lexical association score of two given words.
logDice is also used as part of the Mash Distance for genome and metagenome distance estimation
Finally, Dice is used in
image segmentation, in particular for comparing algorithm output against reference masks in medical applications.
Abundance version
The expression is easily extended to
abundance instead of presence/absence of species. This quantitative version is known by several names:
* Quantitative Sørensen–Dice index
* Quantitative Sørensen index
* Quantitative Dice index
*
Bray–Curtis similarity (1 minus the ''Bray-Curtis dissimilarity'')
*
Czekanowski's quantitative index
*
Steinhaus Steinhaus may refer to:
*Bibiana Steinhaus, German football referee
* Edward Arthur Steinhaus (1914–1969), American insect pathologist
* Hugo Steinhaus, mathematician
* Steinhaus, Austria, a municipality in Upper Austria, Austria
* Steinhaus, Sw ...
index
*
Pielou's percentage similarity
* 1 minus the
Hellinger distance
* Proportion of specific agreement or positive agreement
See also
*
Correlation
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statisti ...
*
F1 score
*
Jaccard index
*
Hamming distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chang ...
*
Mantel test The Mantel test, named after Nathan Mantel, is a statistical test of the correlation between two matrices. The matrices must be of the same dimension; in most applications, they are matrices of interrelations between the same vectors of objects. Th ...
*
Morisita's overlap index Morisita's overlap index, named after Masaaki Morisita, is a statistical measure of dispersion of individuals in a population. It is used to compare overlap among samples (Morisita 1959). This formula is based on the assumption that increasing ...
*
Most frequent k characters
Most or Möst or ''variation'', may refer to:
Places
* Most, Kardzhali Province, a village in Bulgaria
* Most (city), a city in the Czech Republic
** Most District, a district surrounding the city
** Most Basin, a lowland named after the city
** A ...
*
Overlap coefficient
*
Renkonen similarity index (due to
Olavi Renkonen)
*
Tversky index
*
Universal adaptive strategy theory (UAST)
References
External links
{{DEFAULTSORT:Sorensen-Dice coefficient
Information retrieval evaluation
String metrics
Measure theory
Similarity measures