The Rand index
or Rand measure (named after William M. Rand) in
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, and in particular in
data clustering
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some specific sense defined by the analyst) to each o ...
, is a measure of the similarity between two
data clustering
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some specific sense defined by the analyst) to each o ...
s. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. The Rand index is the
accuracy
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''.
''Precision'' is how close the measurements are to each other.
The ...
of determining if a link belongs within a cluster or not.
Rand index
Definition
Given a
set
Set, The Set, SET or SETS may refer to:
Science, technology, and mathematics Mathematics
*Set (mathematics), a collection of elements
*Category of sets, the category whose objects and morphisms are sets and total functions, respectively
Electro ...
of
elements and two
partitions of
to compare,
, a partition of ''S'' into ''r'' subsets, and
, a partition of ''S'' into ''s'' subsets, define the following:
*
, the number of pairs of elements in
that are in the same subset in
and in the same subset in
*
, the number of pairs of elements in
that are in different subsets in
and in different subsets in
*
, the number of pairs of elements in
that are in the same subset in
and in different subsets in
*
, the number of pairs of elements in
that are in different subsets in
and in the same subset in
The Rand index,
, is:
:
Intuitively, can be considered as the number of agreements between and and as the number of disagreements between and .
Since the denominator is the total number of pairs, the Rand index represents the ''frequency of occurrence''
of agreements over the total pairs, or the probability that and
will agree on a randomly chosen pair.
is calculated as .
Similarly, one can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm. It can be computed using the following formula:
::
:where is the number of true positives, is the number of true negative
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resu ...
s, is the number of false positives
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test res ...
, and is the number of false negatives.
Properties
The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.
In mathematical terms, a, b, c, d are defined as follows:
*, where
*, where
*, where
*, where
for some
Relationship with classification accuracy
The Rand index can also be viewed through the prism of binary classification accuracy over the pairs of elements in . The two class labels are " and are in the same subset in and " and " and are in different subsets in and ".
In that setting, is the number of pairs correctly labeled as belonging to the same subset ( true positives), and is the number of pairs correctly labeled as belonging to different subsets ( true negatives).
Adjusted Rand index
The adjusted Rand index is the corrected-for-chance version of the Rand index.[ Such a correction for chance establishes a baseline by using the expected similarity of all pair-wise comparisons between clusterings specified by a random model. Traditionally, the Rand Index was corrected using the Permutation Model for clusterings (the number and size of clusters within a clustering are fixed, and all random clusterings are generated by shuffling the elements between the fixed clusters) However, the premises of the permutation model are frequently violated; in many clustering scenarios, either the number of clusters or the size distribution of those clusters vary drastically. For example, consider that in ]K-means
''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster (statistics), cluste ...
the number of clusters is fixed by the practitioner, but the sizes of those clusters are inferred from the data. Variations of the adjusted Rand Index account for different models of random clusterings.[
]
Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.
The contingency table
Given a set of elements, and two groupings or partitions (''e.g.'' clusterings) of these elements, namely and , the overlap between and can be summarized in a contingency table