An index of qualitative variation (IQV) is a measure of
statistical dispersion
In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a Probability distribution, distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard de ...
in
nominal distributions. There are a variety of these, but they have been relatively little-studied in the statistics literature. The simplest is the
variation ratio, while more complex indices include the
information entropy
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
.
Properties
There are several types of indices used for the analysis of nominal data. Several are standard statistics that are used elsewhere -
range
Range may refer to:
Geography
* Range (geographic), a chain of hills or mountains; a somewhat linear, complex mountainous or hilly area (cordillera, sierra)
** Mountain range, a group of mountains bordered by lowlands
* Range, a term used to i ...
,
standard deviation
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
,
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
,
mean deviation,
coefficient of variation
In probability theory and statistics, the coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as ...
,
median absolute deviation
In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample.
For a un ...
,
interquartile range
In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference ...
and
quartile deviation.
In addition to these several statistics have been developed with nominal data in mind. A number have been summarized and devised by Wilcox , , who requires the following standardization properties to be satisfied:
* Variation varies between 0 and 1.
* Variation is 0 if and only if all cases belong to a single category.
* Variation is 1 if and only if cases are evenly divided across all categories.
In particular, the value of these standardized indices does not depend on the number of categories or number of samples.
For any index, the closer to uniform the distribution, the larger the variance, and the larger the differences in frequencies across categories, the smaller the variance.
Indices of qualitative variation are then analogous to
information entropy
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
, which is minimized when all cases belong to a single category and maximized in a uniform distribution. Indeed, information entropy can be used as an index of qualitative variation.
One characterization of a particular index of qualitative variation (IQV) is as a ratio of observed differences to maximum differences.
Wilcox's indexes
Wilcox gives a number of formulae for various indices of QV , the first, which he designates DM for "Deviation from the Mode", is a standardized form of the
variation ratio, and is analogous to
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
as deviation from the mean.
ModVR
The formula for the variation around the mode (ModVR) is derived as follows:
:
where ''f''
''m'' is the modal frequency, ''K'' is the number of categories and ''f''
''i'' is the frequency of the ''i''
th group.
This can be simplified to
:
where ''N'' is the total size of the sample.
Freeman's index (or variation ratio) is
[Freemen LC (1965) ''Elementary applied statistics''. New York: John Wiley and Sons pp. 40–43]
:
This is related to ''M'' as follows:
:
The ModVR is defined as
:
where ''v'' is Freeman's index.
Low values of ModVR correspond to small amount of variation and high values to larger amounts of variation.
When ''K'' is large, ModVR is approximately equal to Freeman's index ''v''.
RanVR
This is based on the range around the mode. It is defined to be
:
where ''f''
m is the modal frequency and ''f''
l is the lowest frequency.
AvDev
This is an analog of the mean deviation. It is defined as the arithmetic mean of the absolute differences of each value from the mean.
:
MNDif
This is an analog of the mean difference - the average of the differences of all the possible pairs of variate values, taken regardless of sign. The mean difference differs from the mean and standard deviation because it is dependent on the spread of the variate values among themselves and not on the deviations from some central value.
:
where ''f''
''i'' and ''f''
''j'' are the ''i''
th and ''j''
th frequencies respectively.
The MNDif is the
Gini coefficient applied to qualitative data.
VarNC
This is an analog of the variance.
:
It is the same index as Mueller and Schussler's Index of Qualitative Variation
[Mueller JE, Schuessler KP (1961) Statistical reasoning in sociology. Boston: Houghton Mifflin Company. pp. 177–179] and Gibbs'
M2 index.
It is distributed as a
chi square variable with ''K'' – 1
degrees of freedom
Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...
.
StDev
Wilson has suggested two versions of this statistic.
The first is based on AvDev.
:
The second is based on MNDif
:
HRel
This index was originally developed by
Claude Shannon
Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American people, American mathematician, electrical engineering, electrical engineer, and cryptography, cryptographer known as a "father of information theory".
As a 21-year-o ...
for use in specifying the properties of communication channels.
:
where ''p''
''i'' = ''f''
''i'' / ''N''.
This is equivalent to
information entropy
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
divided by the
and is useful for comparing relative variation between frequency tables of multiple sizes.
B index
Wilcox adapted a proposal of Kaiser
[Kaiser HF (1968) "A measure of the population quality of legislative apportionment." ''The American Political Science Review'' 62 (1) 208] based on the geometric mean and created the ''B index. The ''B'' index is defined as
:
R packages
Several of these indices have been implemented in the R language.
Gibb's indices and related formulae
proposed six indexes.
''M''1
The unstandardized index (''M''1) is
:
where ''K'' is the number of categories and
is the proportion of observations that fall in a given category ''i''.
''M''1 can be interpreted as one minus the likelihood that a random pair of samples will belong to the same category, so this formula for IQV is a standardized likelihood of a random pair falling in the same category. This index has also referred to as the index of differentiation, the index of sustenance differentiation and the geographical differentiation index depending on the context it has been used in.
''M''2
A second index is the ''M2'' is:
:
where ''K'' is the number of categories and
is the proportion of observations that fall in a given category ''i''. The factor of
is for standardization.
''M''1 and ''M''2 can be interpreted in terms of variance of a
multinomial distribution
In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided dice rolled ''n'' times. For ''n'' independent trials each of w ...
(there called an "expanded binomial model"). ''M''1 is the variance of the multinomial distribution and ''M''2 is the ratio of the variance of the multinomial distribution to the variance of a
binomial distribution
In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...
.
''M''4
The ''M''4 index is
:
where ''m'' is the mean.
''M''6
The formula for ''M''6 is
:
·
where ''K'' is the number of categories, ''X''
''i'' is the number of data points in the ''i''
th category, ''N'' is the total number of data points, , , is the
absolute value
In mathematics, the absolute value or modulus of a real number x, is the non-negative value without regard to its sign. Namely, , x, =x if is a positive number, and , x, =-x if x is negative (in which case negating x makes -x positive), an ...
(modulus) and
:
This formula can be simplified
:
where ''p''
''i'' is the proportion of the sample in the ''i''
th category.
In practice ''M''1 and ''M''6 tend to be highly correlated which militates against their combined use.
Related indices
The sum
:
has also found application. This is known as the Simpson index in
ecology
Ecology () is the study of the relationships between living organisms, including humans, and their physical environment. Ecology considers organisms at the individual, population, community, ecosystem, and biosphere level. Ecology overlaps wi ...
and as the
Herfindahl index or the Herfindahl-Hirschman index (HHI) in economics. A variant of this is known as the Hunter–Gaston index in microbiology
In linguistics and
cryptanalysis
Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic sec ...
this sum is known as the repeat rate. The
incidence of coincidence
Incidence may refer to:
Economics
* Benefit incidence, the availability of a benefit
* Expenditure incidence, the effect of government expenditure upon the distribution of private incomes
* Fiscal incidence, the economic impact of government taxa ...
(''IC'') is an unbiased
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of this statistic
[Friedman WF (1925) The incidence of coincidence and its applications in cryptanalysis. Technical Paper. Office of the Chief Signal Officer. United States Government Printing Office.]
:
where ''f''
''i'' is the count of the ''i''
th grapheme
In linguistics, a grapheme is the smallest functional unit of a writing system.
The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called ''graphemics' ...
in the text and ''n'' is the total number of graphemes in the text.
;''M''1
The ''M''1 statistic defined above has been proposed several times in a number of different settings under a variety of names. These include Gini's index of mutability,
[Gini CW (1912) Variability and mutability, contribution to the study of statistical distributions and relations. Studi Economico-Giuricici della R. Universita de Cagliari] Simpson's measure of diversity,
Bachi's index of linguistic homogeneity,
[Bachi R (1956) A statistical analysis of the revival of Hebrew in Israel. In: Bachi R (ed) Scripta Hierosolymitana, Vol III, Jerusalem: Magnus press pp 179–247] Mueller and Schuessler's index of qualitative variation,
[Mueller JH, Schuessler KF (1961) Statistical reasoning in sociology. Boston: Houghton Mifflin] Gibbs and Martin's index of industry diversification,
Lieberson's index. and Blau's index in sociology, psychology and management studies.
[Blau P (1977) Inequality and Heterogeneity. Free Press, New York] The formulation of all these indices are identical.
Simpson's ''D'' is defined as
:
where ''n'' is the total sample size and ''n''
''i'' is the number of items in the i
th category.
For large ''n'' we have
:
Another statistic that has been proposed is the coefficient of unalikeability which ranges between 0 and 1.
[Perry M, Kader G (2005) Variation as unalikeability. Teaching Stats 27 (2) 58–60]
:
where ''n'' is the sample size and ''c''(''x'',''y'') = 1 if ''x'' and ''y'' are alike and 0 otherwise.
For large ''n'' we have
:
where ''K'' is the number of categories.
Another related statistic is the quadratic entropy
:
which is itself related to the
Gini index.
;''M''2
Greenberg's monolingual non weighted index of linguistic diversity
is the ''M''2 statistic defined above.
;''M''7
Another index – the ''M''7 – was created based on the ''M''4 index of
[Lautard EH (1978) PhD thesis.]
:
where
:
and
:
where ''K'' is the number of categories, ''L'' is the number of subtypes, ''O''
''ij'' and ''E''
''ij'' are the number observed and expected respectively of subtype ''j'' in the ''i''
th category, ''n''
''i'' is the number in the ''i''
th category and ''p''
''j'' is the proportion of subtype ''j'' in the complete sample.
Note: This index was designed to measure women's participation in the work place: the two subtypes it was developed for were male and female.
Other single sample indices
These indices are summary statistics of the variation within the sample.
Berger–Parker index
The
Berger–Parker index equals the maximum
value in the dataset, i.e. the proportional abundance of the most abundant type. This corresponds to the weighted generalized mean of the
values when ''q'' approaches infinity, and hence equals the inverse of true diversity of order infinity (1/
∞''D'').
Brillouin index of diversity
This index is strictly applicable only to entire populations rather than to finite samples. It is defined as
:
where ''N'' is total number of individuals in the population, ''n''
''i'' is the number of individuals in the ''i''
th category and ''N''! is the
factorial
In mathematics, the factorial of a non-negative denoted is the product of all positive integers less than or equal The factorial also equals the product of n with the next smaller factorial:
\begin
n! &= n \times (n-1) \times (n-2) \t ...
of ''N''.
Brillouin's index of evenness is defined as
:
where ''I''
''B''(max) is the maximum value of ''I''
B.
Hill's diversity numbers
Hill suggested a family of diversity numbers
:
For given values of a, several of the other indices can be computed
*''a'' = 0: ''N''
''a'' = species richness
*''a'' = 1: ''N''
''a'' = Shannon's index
*''a'' = 2: ''N''
''a'' = 1/Simpson's index (without the small sample correction)
*''a'' = 3: ''N''
''a'' = 1/Berger–Parker index
Hill also suggested a family of evenness measures
:
where ''a'' > ''b''.
Hill's ''E''
4 is
:
Hill's ''E''
5 is
:
Margalef's index
:
where ''S'' is the number of data types in the sample and ''N'' is the total size of the sample.
[Margalef R (1958) Temporal succession and spatial
heterogeneity in phytoplankton. In: Perspectives in marine biology. Buzzati-Traverso (ed) Univ Calif Press, Berkeley pp 323–347]
Menhinick's index
:
where ''S'' is the number of data types in the sample and ''N'' is the total size of the sample.
In
linguistics
Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
this index is the identical with the Kuraszkiewicz index (Guiard index) where ''S'' is the number of distinct words (types) and ''N'' is the total number of words (tokens) in the text being examined.
[Kuraszkiewicz W (1951) Nakladen Wroclawskiego Towarzystwa Naukowego][Guiraud P (1954) Les caractères statistiques du vocabulaire. Presses Universitaires de France, Paris] This index can be derived as a special case of the Generalised Torquist function.
[Panas E (2001) The Generalized Torquist: Specification and estimation of a new vocabulary-text size function. J Quant Ling 8(3) 233–252]
Q statistic
This is a statistic invented by Kempton and Taylor.
and involves the quartiles of the sample. It is defined as
:
where ''R''
1 and ''R''
1 are the 25% and 75% quartiles respectively on the cumulative species curve, ''n''
''j'' is the number of species in the ''j''
th category, ''n''
Ri is the number of species in the class where ''R''
''i'' falls (''i'' = 1 or 2).
Shannon–Wiener index
This is taken from information theory
:
where ''N'' is the total number in the sample and ''p''
''i'' is the proportion in the ''i''
th category.
In ecology where this index is commonly used, ''H'' usually lies between 1.5 and 3.5 and only rarely exceeds 4.0.
An approximate formula for the standard deviation (SD) of ''H'' is
:
where ''p''
''i'' is the proportion made up by the ''i''
th category and ''N'' is the total in the sample.
A more accurate approximate value of the variance of ''H''(var(''H'')) is given by
[Hutcheson K (1970) A test for comparing diversities based on the Shannon formula. J Theo Biol 29: 151–154]
:
where ''N'' is the sample size and ''K'' is the number of categories.
A related index is the Pielou ''J'' defined as
:
One difficulty with this index is that ''S'' is unknown for a finite sample. In practice ''S'' is usually set to the maximum present in any category in the sample.
Rényi entropy
The Rényi entropy is a generalization of the Shannon entropy to other values of ''q'' than unity. It can be expressed:
:
which equals
:
This means that taking the logarithm of true diversity based on any value of ''q'' gives the Rényi entropy corresponding to the same value of ''q''.
The value of
is also known as the Hill number.
[
]
McIntosh's D and E
:
where ''N'' is the total sample size and ''n''''i'' is the number in the ''i''th category.
:
where ''K'' is the number of categories.
Fisher's alpha
This was the first index to be derived for diversity.
where ''K'' is the number of categories and ''N'' is the number of data points in the sample. Fisher's ''α'' has to be estimated numerically from the data.
The expected number of individuals in the ''r''th category where the categories have been placed in increasing size is
:
where ''X'' is an empirical parameter lying between 0 and 1. While X is best estimated numerically an approximate value can be obtained by solving the following two equations
:
:
where ''K'' is the number of categories and ''N'' is the total sample size.
The variance of ''α'' is approximately[Anscombe (1950) Sampling theory of the negative binomial and logarithmic series distributions. Biometrika 37: 358–382]
:
Strong's index
This index (''D''w) is the distance between the Lorenz curve
In economics, the Lorenz curve is a graphical representation of the distribution of income or of wealth. It was developed by Max O. Lorenz in 1905 for representing Economic inequality, inequality of the wealth distribution.
The curve is a graph o ...
of species distribution and the 45 degree line. It is closely related to the Gini coefficient.
In symbols it is
:
where max() is the maximum value taken over the ''N'' data points, ''K'' is the number of categories (or species) in the data set and ''c''''i'' is the cumulative total up and including the ''i''th category.
Simpson's E
This is related to Simpson's ''D'' and is defined as
:
where ''D'' is Simpson's ''D'' and ''K'' is the number of categories in the sample.
Smith & Wilson's indices
Smith and Wilson suggested a number of indices based on Simpson's ''D''.
:
:
where ''D'' is Simpson's ''D'' and ''K'' is the number of categories.
Heip's index
:
where ''H'' is the Shannon entropy and ''K'' is the number of categories.
This index is closely related to Sheldon's index which is
:
where ''H'' is the Shannon entropy and ''K'' is the number of categories.
Camargo's index
This index was created by Camargo in 1993.[Camargo JA (1993) Must dominance increase with the number of subordinate species in competitive interactions? J. Theor Biol 161 537–542]
where ''K'' is the number of categories and ''p''''i'' is the proportion in the ''i''th category.
Smith and Wilson's B
This index was proposed by Smith and Wilson in 1996.[Smith, Wilson (1996)]
:
where ''θ'' is the slope of the log(abundance)-rank curve.
Nee, Harvey, and Cotgreave's index
This is the slope of the log(abundance)-rank curve.
Bulla's E
There are two versions of this index - one for continuous distributions (''E''c) and the other for discrete (''E''d).
:
:
where
:
is the Schoener–Czekanoski index, ''K'' is the number of categories and ''N'' is the sample size.
Horn's information theory index
This index (''R''''ik'') is based on Shannon's entropy. It is defined as
:
where
:
:
:
:
:
:
:
In these equations ''x''''ij'' and ''x''''kj'' are the number of times the ''j''th data type appears in the ''i''th or ''k''th sample respectively.
Rarefaction index
In a rarefied sample a random subsample ''n'' in chosen from the total ''N'' items. In this sample some groups may be necessarily absent from this subsample. Let be the number of groups still present in the subsample of ''n'' items. is less than ''K'' the number of categories whenever at least one group is missing from this subsample.
The rarefaction curve, is defined as:
:
Note that 0 ≤ ''f''(''n'') ≤ ''K''.
Furthermore,
:
Despite being defined at discrete values of ''n'', these curves are most frequently displayed as continuous functions.
This index is discussed further in Rarefaction (ecology)
In ecology, rarefaction is a technique to assess species richness from the results of sampling. Rarefaction allows the calculation of species richness for a given number of individual samples, based on the construction of so-called rarefaction cur ...
.
Caswell's V
This is a ''z'' type statistic based on Shannon's entropy.[Caswell H (1976) Community structure: a neutral model analysis. Ecol Monogr 46: 327–354]
:
where ''H'' is the Shannon entropy, ''E''(''H'') is the expected Shannon entropy for a neutral model of distribution and ''SD''(''H'') is the standard deviation of the entropy. The standard deviation is estimated from the formula derived by Pielou
:
where ''p''''i'' is the proportion made up by the ''i''th category and ''N'' is the total in the sample.
Lloyd & Ghelardi's index
This is
:
where ''K'' is the number of categories and ''K is the number of categories according to MacArthur's broken stick model yielding the observed diversity.
Average taxonomic distinctness index
This index is used to compare the relationship between hosts and their parasites. It incorporates information about the phylogenetic relationship amongst the host species.
:
where ''s'' is the number of host species used by a parasite and ''ω''''ij'' is the taxonomic distinctness between host species ''i'' and ''j''.
Index of qualitative variation
Several indices with this name have been proposed.
One of these is
:
where ''K'' is the number of categories and ''p''i is the proportion of the sample that lies in the ith category.
Theil’s H
This index is also known as the multigroup entropy index or the information theory index. It was proposed by Theil in 1972.[Theil H (1972) Statistical decomposition analysis. Amsterdam: North-Holland Publishing Company>] The index is a weighted average of the samples entropy.
Let
:
and
where ''p''i is the proportion of type ''i'' in the ''a''th sample, ''r'' is the total number of samples, ''n''i is the size of the ''i''th sample, ''N'' is the size of the population from which the samples were obtained and ''E'' is the entropy of the population.
Indices for comparison of two or more data types within a single sample
Several of these indexes have been developed to document the degree to which different data types of interest may coexist within a geographic area.
Index of dissimilarity
Let ''A'' and ''B'' be two types of data item. Then the index of dissimilarity is
:
where
:
:
''A''''i'' is the number of data type ''A'' at sample site ''i'', ''B''''i'' is the number of data type ''B'' at sample site ''i'', ''K'' is the number of sites sampled and , , is the absolute value.
This index is probably better known as the index of dissimilarity (''D'').[Duncan OD, Duncan B (1955) A methodological analysis of segregation indexes. Am Sociol Review, 20: 210–217] It is closely related to the Gini index.
This index is biased as its expectation under a uniform distribution is > 0.
A modification of this index has been proposed by Gorard and Taylor.[Gorard S, Taylor C (2002b) What is segregation? A comparison of measures in terms of 'strong' and 'weak' compositional invariance. Sociology, 36(4), 875–895] Their index (GT) is
:
Index of segregation
The index of segregation (''IS'') is
:
where
:
:
and ''K'' is the number of units, ''A''''i'' and ''t''''i'' is the number of data type ''A'' in unit ''i'' and the total number of all data types in unit ''i''.
Hutchen's square root index
This index (''H'') is defined as[Hutchens RM (2004) One measure of segregation. International Economic Review 45: 555–578]
:
where ''p''''i'' is the proportion of the sample composed of the ''i''th variate.
Lieberson's isolation index
This index ( ''L''''xy'' ) was invented by Lieberson in 1981.
:
where ''X''''i'' and ''Y''''i'' are the variables of interest at the ''i''th site, ''K'' is the number of sites examined and ''X''tot is the total number of variate of type ''X'' in the study.
Bell's index
This index is defined as
:
where ''p''''x'' is the proportion of the sample made up of variates of type ''X'' and
:
where ''N''x is the total number of variates of type ''X'' in the study, ''K'' is the number of samples in the study and ''x''''i'' and ''p''''i'' are the number of variates and the proportion of variates of type ''X'' respectively in the ''i''th sample.
Index of isolation
The index of isolation is
:
where ''K'' is the number of units in the study, ''A''''i'' and ''t''''i'' is the number of units of type ''A'' and the number of all units in ''i''th sample.
A modified index of isolation has also been proposed
:
The ''MII'' lies between 0 and 1.
Gorard's index of segregation
This index (GS) is defined as
:
where
:
:
and ''A''''i'' and ''t''''i'' are the number of data items of type ''A'' and the total number of items in the ''i''th sample.
Index of exposure
This index is defined as
:
where
:
and ''A''''i'' and ''B''''i'' are the number of types ''A'' and ''B'' in the ''i''th category and ''t''''i'' is the total number of data points in the ''i''th category.
Ochiai index
This is a binary form of the cosine index.[Ochiai A (1957) Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions. Bull Jpn Soc Sci Fish 22: 526–530] It is used to compare presence/absence data of two data types (here ''A'' and ''B''). It is defined as
:
where ''a'' is the number of sample units where both ''A'' and ''B'' are found, ''b'' is number of sample units where ''A'' but not ''B'' occurs and ''c'' is the number of sample units where type ''B'' is present but not type ''A''.
Kulczyński's coefficient
This coefficient was invented by Stanisław Kulczyński
Stanisław Leon Kulczyński (9 May 1895 – 12 July 1975) was a Polish botanist and politician.
Son of Władysław Kulczyński the zoologist. Professor of Lwów University (in the Second Polish Republic, its rector from 1936). He resigned his pos ...
in 1927[Kulczynski S (1927) Die Pflanzenassoziationen der Pieninen. Bulletin International de l'Académie Polonaise des Sciences et des Lettres, Classe des Sciences] and is an index of association between two types (here ''A'' and ''B''). It varies in value between 0 and 1. It is defined as
:
where ''a'' is the number of sample units where type ''A'' and type ''B'' are present, ''b'' is the number of sample units where type ''A'' but not type ''B'' is present and ''c'' is the number of sample units where type ''B'' is present but not type ''A''.
Yule's Q
This index was invented by Yule in 1900.[Yule GU (1900) On the association of attributes in statistics. Philos Trans Roy Soc] It concerns the association of two different types (here ''A'' and ''B''). It is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''Q'' varies in value between -1 and +1. In the ordinal case ''Q'' is known as the Goodman-Kruskal ''γ''.
Because the denominator potentially may be zero, Leinhert and Sporer have recommended adding +1 to ''a'', ''b'', ''c'' and ''d''.[Lienert GA and Sporer SL (1982) Interkorrelationen seltner Symptome mittels Nullfeldkorrigierter YuleKoeffizienten. Psychologische Beitrage 24: 411–418]
Yule's Y
This index is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present.
Baroni–Urbani–Buser coefficient
This index was invented by Baroni-Urbani and Buser in 1976. It varies between 0 and 1 in value. It is defined as
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
When ''d'' = 0, this index is identical to the Jaccard index.
Hamman coefficient
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
Rogers–Tanimoto coefficient
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size
Sokal–Sneath coefficient
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
Sokal's binary distance
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
Russel–Rao coefficient
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
Phi coefficient
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present.
Soergel's coefficient
This coefficient is defined as
:
where ''b'' is the number of samples where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
Simpson's coefficient
This coefficient is defined as
:
where ''b'' is the number of samples where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A''.
Dennis' coefficient
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
Forbes' coefficient
This coefficient was proposed by Stephen Alfred Forbes
Stephen Alfred Forbes (May 29, 1844 – March 13, 1930) was the first chief of the Illinois Natural History Survey, a founder of aquatic ecosystem science and a dominant figure in the rise of American ecology. His publications are striking for th ...
in 1907.[Forbes SA (1907) On the local distribution of certain Illinois fishes: an essay in statistical ecology. Bulletin of the Illinois State Laboratory of Natural History 7:272–303] It is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size (''N = a + b + c + d'').
A modification of this coefficient which does not require the knowledge of ''d'' has been proposed by Alroy[Alroy J (2015) A new twist on a very old binary similarity coefficient. Ecology 96 (2) 575-586]
:
Where ''n = a + b + c''.
Simple match coefficient
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
Fossum's coefficient
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
Stile's coefficient
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'', ''d'' is the sample count where neither type ''A'' nor type ''B'' are present, ''n'' equals ''a'' + ''b'' + ''c'' + ''d'' and , , is the modulus (absolute value) of the difference.
Michael's coefficient
This coefficient is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present.
Peirce's coefficient
In 1884 Charles Peirce
Charles Sanders Peirce ( ; September 10, 1839 – April 19, 1914) was an American philosopher, logician, mathematician and scientist who is sometimes known as "the father of pragmatism".
Educated as a chemist and employed as a scientist for t ...
suggested the following coefficient
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present.
Hawkin–Dotson coefficient
In 1975 Hawkin and Dotson proposed the following coefficient
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
Benini coefficient
In 1901 Benini proposed the following coefficient
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'' and ''c'' is the number of samples where type ''B'' is present but not type ''A''. Min(''b'', ''c'') is the minimum of ''b'' and ''c''.
Gilbert coefficient
Gilbert proposed the following coefficient
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the sample count where neither type ''A'' nor type ''B'' are present. ''N'' is the sample size.
Gini index
The Gini index is
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'' and ''c'' is the number of samples where type ''B'' is present but not type ''A''.
Modified Gini index
The modified Gini index is
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'' and ''c'' is the number of samples where type ''B'' is present but not type ''A''.
Kuhn's index
Kuhn proposed the following coefficient in 1965
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'' and ''c'' is the number of samples where type ''B'' is present but not type ''A''. ''K'' is a normalizing parameter. ''N'' is the sample size.
This index is also known as the coefficient of arithmetic means.
Eyraud index
Eyraud proposed the following coefficient in 1936
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the number of samples where both ''A'' and ''B'' are not present.
Soergel distance
This is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the number of samples where both ''A'' and ''B'' are not present. ''N'' is the sample size.
Tanimoto index
This is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A'' and ''d'' is the number of samples where both ''A'' and ''B'' are not present. ''N'' is the sample size.
Piatetsky–Shapiro's index
This is defined as
:
where ''a'' is the number of samples where types ''A'' and ''B'' are both present, ''b'' is where type ''A'' is present but not type ''B'', ''c'' is the number of samples where type ''B'' is present but not type ''A''.
Indices for comparison between two or more samples
Czekanowski's quantitative index
This is also known as the Bray–Curtis index, Schoener's index, least common percentage index, index of affinity or proportional similarity. It is related to the Sørensen similarity index Sørensen () is a Danish-Norwegian patronymic surname meaning "son of Søren" (given name equivalent of Severin). , it is the eighth most common surname in Denmark. Immigrants to English-speaking countries often changed the spelling to ''Sorensen'' ...
.
:
where ''x''''i'' and ''x''''j'' are the number of species in sites ''i'' and ''j'' respectively and the minimum is taken over the number of species in common between the two sites.
Canberra metric
The Canberra distance
The Canberra distance is a numerical measure of the distance between pairs of points in a vector space, introduced in 1966
and refined in 1967 by Godfrey N. Lance and William T. Williams. It is a weighted version of ''L''₁ (Manhattan) distance. ...
is a weighted version of the ''L''1 metric. It was introduced by introduced in 1966 and refined in 1967 by G. N. Lance and W. T. Williams. It is used to define a distance between two vectors – here two sites with ''K'' categories within each site.
The Canberra distance ''d'' between vectors p and q in an ''K''-dimensional real
Real may refer to:
Currencies
* Brazilian real (R$)
* Central American Republic real
* Mexican real
* Portuguese real
* Spanish real
* Spanish colonial real
Music Albums
* ''Real'' (L'Arc-en-Ciel album) (2000)
* ''Real'' (Bright album) (2010)
...
vector space
In mathematics and physics, a vector space (also called a linear space) is a set whose elements, often called ''vectors'', may be added together and multiplied ("scaled") by numbers called '' scalars''. Scalars are often real numbers, but can ...
is
:
where ''p''''i'' and ''q''''i'' are the values of the ''i''th category of the two vectors.
Sorensen's coefficient of community
This is used to measure similarities between communities.
:
where ''s''1 and ''s''2 are the number of species in community 1 and 2 respectively and ''c'' is the number of species common to both areas.
Jaccard's index
This is a measure of the similarity between two samples:
:
where ''A'' is the number of data points shared between the two samples and ''B'' and ''C'' are the data points found only in the first and second samples respectively.
This index was invented in 1902 by the Swiss botanist Paul Jaccard
Paul Jaccard (18 November 1868 in Sainte-Croix, Switzerland, Sainte-Croix – 9 May 1944 in Zurich) was a professor of botany and plant physiology at the ETH Zurich. He studied at the University of Lausanne and ETH Zurich (PhD 1894). He continued s ...
.[Jaccard P (1902) Lois de distribution florale. Bulletin de la Socíeté Vaudoise des Sciences Naturelles 38:67-130]
Under a random distribution the expected value of ''J'' is[Archer AW and Maples CG (1989) Response of selected binomial coefficients to varying degrees of matrix sparseness and to matrices with known data interrelationships. Mathematical Geology 21: 741–753]
:
The standard error of this index with the assumption of a random distribution is
where ''N'' is the total size of the sample.
Dice's index
This is a measure of the similarity between two samples:
:
where ''A'' is the number of data points shared between the two samples and ''B'' and ''C'' are the data points found only in the first and second samples respectively.
Match coefficient
This is a measure of the similarity between two samples:
:
where ''N'' is the number of data points in the two samples and ''B'' and ''C'' are the data points found only in the first and second samples respectively.
Morisita's index
Morisita’s index of dispersion ( ''I''''m'' ) is the scaled probability that two points chosen at random from the whole population are in the same sample.[Morisita M (1959) Measuring the dispersion and the analysis of distribution patterns. Memoirs of the Faculty of Science, Kyushu University Series E. Biol 2:215–235] Higher values indicate a more clumped distribution.
:
An alternative formulation is
:
where ''n'' is the total sample size, ''m'' is the sample mean and ''x'' are the individual values with the sum taken over the whole sample. It is also equal to
:
where ''IMC'' is Lloyd's index of crowding.[Lloyd M (1967) Mean crowding. J Anim Ecol 36: 1–30]
This index is relatively independent of the population density but is affected by the sample size.
Morisita showed that the statistic
:
is distributed as a chi-squared variable with ''n'' − 1 degrees of freedom.
An alternative significance test for this index has been developed for large samples.[Pedigo LP & Buntin GD (1994) Handbook of sampling methods for arthropods in agriculture. CRC Boca Raton FL]
:
where ''m'' is the overall sample mean, ''n'' is the number of sample units and ''z'' is the normal distribution abscissa
In common usage, the abscissa refers to the (''x'') coordinate and the ordinate refers to the (''y'') coordinate of a standard two-dimensional graph.
The distance of a point from the y-axis, scaled with the x-axis, is called abscissa or x coo ...
. Significance is tested by comparing the value of ''z'' against the values of the normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu ...
.
Morisita's overlap index
Morisita's overlap index is used to compare overlap among samples.[Morisita M (1959) Measuring of the dispersion and analysis of distribution patterns. Memoirs of the Faculty of Science, Kyushu University, Series E Biology. 2: 215–235] The index is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats
:
: ''x''''i'' is the number of times species ''i'' is represented in the total ''X'' from one sample.
: ''y''''i'' is the number of times species ''i'' is represented in the total ''Y'' from another sample.
: ''D''''x'' and ''D''''y'' are the Simpson's index values for the ''x'' and ''y'' samples respectively.
: ''S'' is the number of unique species
''C''''D'' = 0 if the two samples do not overlap in terms of species, and ''C''''D'' = 1 if the species occur in the same proportions in both samples.
Horn's introduced a modification of the index
:
Standardised Morisita’s index
Smith-Gill developed a statistic based on Morisita’s index which is independent of both sample size and population density and bounded by −1 and +1. This statistic is calculated as follows
First determine Morisita's index ( ''I''''d'' ) in the usual fashion. Then let ''k'' be the number of units the population was sampled from. Calculate the two critical values
:
:
where χ2 is the chi square value for ''n'' − 1 degrees of freedom at the 97.5% and 2.5% levels of confidence.
The standardised index ( ''I''''p'' ) is then calculated from one of the formulae below
When ''I''''d'' ≥ ''M''''c'' > 1
:
When ''M''''c'' > ''I''''d'' ≥ 1
:
When 1 > ''I''''d'' ≥ ''M''''u''
:
When 1 > ''M''''u'' > ''I''''d''
:
''I''''p'' ranges between +1 and −1 with 95% confidence intervals of ±0.5. ''I''''p'' has the value of 0 if the pattern is random; if the pattern is uniform, ''I''''p'' < 0 and if the pattern shows aggregation, ''I''''p'' > 0.
Peet's evenness indices
These indices are a measure of evenness between samples.[Peet (1974) The measurements of species diversity. Annu Rev Ecol Syst 5: 285–307]
:
:
where ''I'' is an index of diversity, ''I''max and ''I''min are the maximum and minimum values of ''I'' between the samples being compared.
Loevinger's coefficient
Loevinger has suggested a coefficient ''H'' defined as follows:
:
where ''p''max and ''p''min are the maximum and minimum proportions in the sample.
Tversky index
The Tversky index is an asymmetric measure that lies between 0 and 1.
For samples ''A'' and ''B'' the Tversky index (''S'') is
:
The values of ''α'' and ''β'' are arbitrary. Setting both ''α'' and ''β'' to 0.5 gives Dice's coefficient. Setting both to 1 gives Tanimoto's coefficient.
A symmetrical variant of this index has also been proposed.[Jimenez S, Becerra C, Gelbukh ]
SOFTCARDINALITY-CORE: Improving text overlap with distributional measures for semantic textual similarity. Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the main conference and the shared task: semantic textual similarity, p194-201. June 7–8, 2013, Atlanta, Georgia, USA
/ref>
:
where
:
:
Several similar indices have been proposed.
Monostori ''et al.'' proposed the SymmetricSimilarity index[Monostori K, Finkel R, Zaslavsky A, Hodasz G and Patke M (2002) Comparison of overlap detection techniques. In: Proceedings of the 2002 International Conference on Computational Science. Lecture Notes in Computer Science 2329: 51-60]
:
where ''d''(''X'') is some measure of derived from ''X''.
Bernstein and Zobel have proposed the S2 and S3 indexes[Bernstein Y and Zobel J (2004) A scalable system for identifying co-derivative documents. In: Proceedings of 11th International Conference on String Processing and Information Retrieval (SPIRE) 3246: 55-67]
:
:
S3 is simply twice the SymmetricSimilarity index. Both are related to Dice's coefficient
Metrics used
A number of metrics (distances between samples) have been proposed.
Euclidean distance
While this is usually used in quantitative work it may also be used in qualitative work. This is defined as
:
where ''d''''jk'' is the distance between ''x''''ij'' and ''x''''ik''.
Gower's distance
This is defined as
:
where ''d''i is the distance between the ''i''th samples and ''w''i is the weighing give to the ''i''th distance.
Manhattan distance
While this is more commonly used in quantitative work it may also be used in qualitative work. This is defined as
:
where ''d''''jk'' is the distance between ''x''''ij'' and ''x''''ik'' and , , is the absolute value
In mathematics, the absolute value or modulus of a real number x, is the non-negative value without regard to its sign. Namely, , x, =x if is a positive number, and , x, =-x if x is negative (in which case negating x makes -x positive), an ...
of the difference between ''x''''ij'' and ''x''''ik''.
A modified version of the Manhattan distance can be used to find a zero (root
In vascular plants, the roots are the organs of a plant that are modified to provide anchorage for the plant and take in water and nutrients into the plant body, which allows plants to grow taller and faster. They are most often below the sur ...
) of a polynomial
In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and positive-integer powers of variables. An exa ...
of any degree
Degree may refer to:
As a unit of measurement
* Degree (angle), a unit of angle measurement
** Degree of geographical latitude
** Degree of geographical longitude
* Degree symbol (°), a notation used in science, engineering, and mathematics
...
using Lill's method
In mathematics, Lill's method is a visual method of finding the real roots of a univariate polynomial of any degree. It was developed by Austrian engineer Eduard Lill in 1867. A later paper by Lill dealt with the problem of complex roots.
Lill ...
.
Prevosti’s distance
This is related to the Manhattan distance. It was described by Prevosti ''et al.'' and was used to compare differences between chromosome
A chromosome is a long DNA molecule with part or all of the genetic material of an organism. In most chromosomes the very long thin DNA fibers are coated with packaging proteins; in eukaryotic cells the most important of these proteins are ...
s. Let ''P'' and ''Q'' be two collections of ''r'' finite probability distributions. Let these distributions have values that are divided into ''k'' categories. Then the distance ''D''PQ is
:
where ''r'' is the number of discrete probability distributions in each population, ''k''''j'' is the number of categories in distributions ''P''''j'' and ''Q''''j'' and ''p''''ji'' (respectively ''q''''ji'') is the theoretical probability of category ''i'' in distribution ''P''''j'' (''Q''''j'') in population ''P''(''Q'').
Its statistical properties were examined by Sanchez ''et al.'' who recommended a bootstrap procedure to estimate confidence intervals when testing for differences between samples.
Other metrics
Let
:
:
:
where min(''x'',''y'') is the lesser value of the pair ''x'' and ''y''.
Then
:
is the Manhattan distance,
:
is the Bray−Curtis distance,
:
is the Jaccard (or Ruzicka) distance and
:
is the Kulczynski distance.
Similarities between texts
HaCohen-Kerner et al. have proposed a variety of metrics for comparing two or more texts.[HaCohen-Kerner Y, Tayeb A and Ben-Dror N (2010) Detection of simple plagiarism in computer science papers. In: Proceedings of the 23rd International Conference on Computational Linguistics pp 421-429]
Ordinal data
If the categories are at least ordinal then a number of other indices may be computed.
Leik's D
Leik's measure of dispersion (''D'') is one such index.[Leik R (1966) A measure of ordinal consensus. Pacific sociological review 9 (2): 85–90] Let there be ''K'' categories and let ''p''''i'' be ''f''''i''/''N'' where ''f''''i'' is the number in the ''i''th category and let the categories be arranged in ascending order. Let
:
where ''a'' ≤ ''K''. Let ''d''a = ''c''a if ''c''a ≤ 0.5 and 1 − ''c''a ≤ 0.5 otherwise. Then
:
Normalised Herfindahl measure
This is the square of the coefficient of variation divided by ''N'' − 1 where ''N'' is the sample size.
:
where ''m'' is the mean and ''s'' is the standard deviation.
Potential-for-conflict Index
The potential-for-conflict Index (PCI) describes the ratio of scoring on either side of a rating scale’s centre point.[Manfredo M, Vaske, JJ, Teel TL (2003) The potential for conflict index: A graphic approach tp practical significance of human dimensions research. Human Dimensions of Wildlife 8: 219–228] This index requires at least ordinal data. This ratio is often displayed as a bubble graph.
The PCI uses an ordinal scale with an odd number of rating points (−''n'' to +''n'') centred at 0. It is calculated as follows
:
where ''Z'' = 2''n'', , ·, is the absolute value
In mathematics, the absolute value or modulus of a real number x, is the non-negative value without regard to its sign. Namely, , x, =x if is a positive number, and , x, =-x if x is negative (in which case negating x makes -x positive), an ...
(modulus), ''r''+ is the number of responses in the positive side of the scale, ''r''− is the number of responses in the negative side of the scale, ''X''+ are the responses on the positive side of the scale, ''X''− are the responses on the negative side of the scale and
:
Theoretical difficulties are known to exist with the PCI. The PCI can be computed only for scales with a neutral center point and an equal number of response options on either side of it. Also a uniform distribution of responses does not always yield the midpoint of the PCI statistic but rather varies with the number of possible responses or values in the scale. For example, five-, seven- and nine-point scales with a uniform distribution of responses give PCIs of 0.60, 0.57 and 0.50 respectively.
The first of these problems is relatively minor as most ordinal scales with an even number of response can be extended (or reduced) by a single value to give an odd number of possible responses. Scale can usually be recentred if this is required. The second problem is more difficult to resolve and may limit the PCI's applicability.
The PCI has been extended[Vaske JJ, Beaman J, Barreto H, Shelby LB (2010) An extension and further validation of the potential for conflict index. Leisure Sciences 32: 240–254]
:
where ''K'' is the number of categories, ''k''''i'' is the number in the ''i''th category, ''d''''ij'' is the distance between the ''i''th and ''i''th categories, and ''δ'' is the maximum distance on the scale multiplied by the number of times it can occur in the sample. For a sample with an even number of data points
:
and for a sample with an odd number of data points
:
where ''N'' is the number of data points in the sample and ''d''max is the maximum distance between points on the scale.
Vaske ''et al.'' suggest a number of possible distance measures for use with this index.
:
if the signs (+ or −) of ''r''''i'' and ''r''''j'' differ. If the signs are the same ''d''''ij'' = 0.
:
:
where ''p'' is an arbitrary real number > 0.
:
if sign(''r''''i'' ) ≠ sign(''r''''i'' ) and ''p'' is a real number > 0. If the signs are the same then ''d''''ij'' = 0. ''m'' is ''D''1, ''D''2 or ''D''3.
The difference between ''D''1 and ''D''2 is that the first does not include neutrals in the distance while the latter does. For example, respondents scoring −2 and +1 would have a distance of 2 under ''D''1 and 3 under ''D''2.
The use of a power (''p'') in the distances allows for the rescaling of extreme responses. These differences can be highlighted with ''p'' > 1 or diminished with ''p'' < 1.
In simulations with a variates drawn from a uniform distribution the PCI2 has a symmetric unimodal distribution. The tails of its distribution are larger than those of a normal distribution.
Vaske ''et al.'' suggest the use of a t test
A ''t''-test is any statistical hypothesis test in which the test statistic follows a Student's ''t''-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a ...
to compare the values of the PCI between samples if the PCIs are approximately normally distributed.
van der Eijk's A
This measure is a weighted average of the degree of agreement the frequency distribution.[Van der Eijk C (2001) Measuring agreement in ordered rating scales. Quality and quantity 35(3): 325–341] ''A'' ranges from −1 (perfect bimodal
In statistics, a multimodal distribution is a probability distribution with more than one mode (statistics), mode. These appear as distinct peaks (local maxima) in the probability density function, as shown in Figures 1 and 2. Categorical, ...
ity) to +1 (perfect unimodal
In mathematics, unimodality means possessing a unique mode. More generally, unimodality means there is only a single highest value, somehow defined, of some mathematical object.
Unimodal probability distribution
In statistics, a unimodal pr ...
ity). It is defined as
:
where ''U'' is the unimodality of the distribution, ''S'' the number of categories that have nonzero frequencies and ''K'' the total number of categories.
The value of ''U'' is 1 if the distribution has any of the three following characteristics:
* all responses are in a single category
* the responses are evenly distributed among all the categories
* the responses are evenly distributed among two or more contiguous categories, with the other categories with zero responses
With distributions other than these the data must be divided into 'layers'. Within a layer the responses are either equal or zero. The categories do not have to be contiguous. A value for ''A'' for each layer (''A''''i'') is calculated and a weighted average for the distribution is determined. The weights (''w''''i'') for each layer are the number of responses in that layer. In symbols
:
A uniform distribution
Uniform distribution may refer to:
* Continuous uniform distribution
* Discrete uniform distribution
* Uniform distribution (ecology)
* Equidistributed sequence In mathematics, a sequence (''s''1, ''s''2, ''s''3, ...) of real numbers is said to be ...
has ''A'' = 0: when all the responses fall into one category ''A'' = +1.
One theoretical problem with this index is that it assumes that the intervals are equally spaced. This may limit its applicability.
Related statistics
Birthday problem
If there are ''n'' units in the sample and they are randomly distributed into ''k'' categories (''n'' ≤ ''k''), this can be considered a variant of the birthday problem
In probability theory, the birthday problem asks for the probability that, in a set of randomly chosen people, at least two will share a birthday. The birthday paradox is that, counterintuitively, the probability of a shared birthday exceeds 5 ...
.[Von Mises R (1939) Uber Aufteilungs-und Besetzungs-Wahrcheinlichkeiten. Revue de la Facultd des Sciences de de I'Universite d'lstanbul NS 4: 145−163] The probability (''p'') of all the categories having only one unit is
:
If ''c'' is large and ''n'' is small compared with ''k''2/3 then to a good approximation
:
This approximation follows from the exact formula as follows:
:
;Sample size estimates
For ''p'' = 0.5 and ''p'' = 0.05 respectively the following estimates of ''n'' may be useful
:
:
This analysis can be extended to multiple categories. For ''p'' = 0.5 and ''p'' 0.05 we have respectively
:
:
where ''c''''i'' is the size of the ''i''th category. This analysis assumes that the categories are independent.
If the data is ordered in some fashion then for at least one event occurring in two categories lying within ''j'' categories of each other than a probability of 0.5 or 0.05 requires a sample size (''n'') respectively of[Sevast'yanov BA (1972) Poisson limit law for a scheme of sums of dependent random variables. (trans. S. M. Rudolfer) Theory of probability and its applications, 17: 695−699]
:
:
where ''k'' is the number of categories.
Birthday-death day problem
Whether or not there is a relation between birthdays and death days has been investigated with the statistic[Hoaglin DC, Mosteller, F and Tukey, JW (1985) Exploring data tables, trends, and shapes, New York: John Wiley]
:
where ''d'' is the number of days in the year between the birthday and the death day.
Rand index
The Rand index
The RAND Corporation (from the phrase "research and development") is an American nonprofit global policy think tank created in 1948 by Douglas Aircraft Company to offer research and analysis to the United States Armed Forces. It is financed ...
is used to test whether two or more classification systems agree on a data set.
Given a set
Set, The Set, SET or SETS may refer to:
Science, technology, and mathematics Mathematics
*Set (mathematics), a collection of elements
*Category of sets, the category whose objects and morphisms are sets and total functions, respectively
Electro ...
of elements and two partitions
Partition may refer to:
Computing Hardware
* Disk partitioning, the division of a hard disk drive
* Memory partition, a subdivision of a computer's memory, usually for use by a single job
Software
* Partition (database), the division of a ...
of to compare, , a partition of ''S'' into ''r'' subsets, and , a partition of ''S'' into ''s'' subsets, define the following:
* , the number of pairs of elements in that are in the same subset in and in the same subset in
* , the number of pairs of elements in that are in different subsets in and in different subsets in
* , the number of pairs of elements in that are in the same subset in and in different subsets in
* , the number of pairs of elements in that are in different subsets in and in the same subset in
The Rand index - - is defined as
:
Intuitively, can be considered as the number of agreements between and and as the number of disagreements between and .
Adjusted Rand index
The adjusted Rand index is the corrected-for-chance version of the Rand index. Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.
The contingency table
Given a set of elements, and two groupings or partitions (''e.g.'' clusterings) of these points, namely and , the overlap between and can be summarized in a contingency table where each entry denotes the number of objects in common between and : .
Definition
The adjusted form of the Rand Index, the Adjusted Rand Index, is
:
more specifically
:
where are values from the contingency table.
Since the denominator is the total number of pairs, the Rand index represents the ''frequency of occurrence'' of agreements over the total pairs, or the probability that and will agree on a randomly chosen pair.
Evaluation of indices
Different indices give different values of variation, and may be used for different purposes: several are used and critiqued in the sociology literature especially.
If one wishes to simply make ordinal comparisons between samples (is one sample more or less varied than another), the choice of IQV is relatively less important, as they will often give the same ordering.
Where the data is ordinal a method that may be of use in comparing samples is ORDANOVA.
In some cases it is useful to not standardize an index to run from 0 to 1, regardless of number of categories or samples , but one generally so standardizes it.
See also
* ANOSIM
* Baker’s gamma index
*Categorical data
In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or ...
*Diversity index
A diversity index is a quantitative measure that reflects how many different types (such as species) there are in a dataset (a community), and that can simultaneously take into account the phylogenetic relations among the individuals distributed a ...
*Fowlkes–Mallows index
The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings (clusters obtained after a clustering algorithm), and also a metric to measure confusion matrices. This measure of simi ...
*Goodman and Kruskal's gamma
In statistics, Goodman and Kruskal's gamma is a measure of rank correlation, i.e., the similarity of the orderings of the data when ranked by each of the quantities. It measures the strength of association of the cross tabulated data when both va ...
*Information entropy
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
*Logarithmic distribution
In probability and statistics, the logarithmic distribution (also known as the logarithmic series distribution or the log-series distribution) is a discrete probability distribution derived from the Maclaurin series expansion
:
-\ln(1-p) = p + ...
*PERMANOVA Permutational multivariate analysis of variance (PERMANOVA), is a non-parametric multivariate statistical permutation test. PERMANOVA is used to compare groups of objects and test the null hypothesis that the centroids and dispersion of the groups a ...
*Robinson–Foulds metric
The Robinson–Foulds or symmetric difference metric, often abbreviated as the RF distance, is a simple way to calculate the distance between phylogenetic trees. It is defined as ( + ) where is the number of partitions of data implied by the first ...
* Shepard diagram
* SIMPER
*Statistical dispersion
In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a Probability distribution, distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard de ...
* Variation ratio
*Whipple's index Whipple's index (or index of concentration), invented by American demographer George Chandler Whipple (1866–1924), is a method to measure the tendency for individuals to inaccurately report their actual age or date of birth. Respondents to a censu ...
Notes
References
*
*
*
*
* {{cite journal
, last=Wilcox
, first=Allen R.
, title=Indices of Qualitative Variation and Political Measurement
, date=June 1973
, volume=26
, issue=2
, journal=The Western Political Quarterly
, pages=325–343
, doi=10.2307/446831
, jstor=446831
Statistical deviation and dispersion
Summary statistics for categorical data
Categorical data