In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal
association. It was first introduced by
Henri Theil and is based on the concept of
information entropy
In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed ...
.
Definition
Suppose we have samples of two discrete random variables, ''X'' and ''Y''. By constructing the joint distribution, , from which we can calculate the
conditional distributions, and , and calculating the various entropies, we can determine the degree of association between the two variables.
The entropy of a single distribution is given as:
[
]
:
while the
conditional entropy
In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. Here, information is measured in shannons, n ...
is given as:
:
The uncertainty coefficient
[
] or proficiency
is defined as:
:
and tells us: given ''Y'', what fraction of the bits of ''X'' can we predict? In this case we can think of ''X'' as containing the total information, and of ''Y'' as allowing one to predict part of such information.
The above expression makes clear that the uncertainty coefficient is a normalised
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
''I(X;Y)''. In particular, the uncertainty coefficient ranges in
, 1
The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
as ''I(X;Y) < H(X)'' and both ''I(X,Y)'' and ''H(X)'' are positive or null.
Note that the value of ''U'' (but not ''H''!) is independent of the base of the ''log'' since all logarithms are proportional.
The uncertainty coefficient is useful for measuring the validity of a statistical classification algorithm and has the advantage over simpler accuracy measures such as
precision and recall
In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.
Precision (also calle ...
in that it is not affected by the relative fractions of the different classes, i.e., ''P''(''x'').
[
]
It also has the unique property that it won't penalize an algorithm for predicting the wrong classes, so long as it does so consistently (i.e., it simply rearranges the classes). This is useful in evaluating
clustering algorithms since cluster labels typically have no particular ordering.
[
]
Variations
The uncertainty coefficient is not symmetric with respect to the roles of ''X'' and ''Y''. The roles can be reversed and a symmetrical measure thus defined as a weighted average between the two:
:
Although normally applied to discrete variables, the uncertainty coefficient can be extended to continuous variables using density estimation
In statistics, probability density estimation or simply density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function. The unobservable density function is thought o ...
.{{citation needed, date=July 2012
See also
*Mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
*Rand index
The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance ...
*F-score
In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number o ...
*Binary classification
Binary classification is the task of classifying the elements of a set into one of two groups (each called ''class''). Typical binary classification problems include:
* Medical testing to determine if a patient has a certain disease or not;
* Qual ...
References
External links
libagf
Includes software for calculating uncertainty coefficients.
Statistical ratios
Summary statistics for contingency tables
Information theory
Statistics articles needing expert attention