The
mathematical theory of information is based on
probability theory
Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...
and
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the
unit of
information entropy
In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed ...
that is used. The most common unit of information is the ''bit'', or more correctly the
shannon, based on the
binary logarithm
In mathematics, the binary logarithm () is the exponentiation, power to which the number must be exponentiation, raised to obtain the value . That is, for any real number ,
:x=\log_2 n \quad\Longleftrightarrow\quad 2^x=n.
For example, th ...
. Although ''bit'' is more frequently used in place of ''shannon'', its name is not distinguished from the
bit as used in data processing to refer to a binary value or stream regardless of its entropy (information content). Other units include the
nat, based on the
natural logarithm
The natural logarithm of a number is its logarithm to the base of a logarithm, base of the e (mathematical constant), mathematical constant , which is an Irrational number, irrational and Transcendental number, transcendental number approxima ...
, and the
hartley, based on the base 10 or
common logarithm
In mathematics, the common logarithm (aka "standard logarithm") is the logarithm with base 10. It is also known as the decadic logarithm, the decimal logarithm and the Briggsian logarithm. The name "Briggsian logarithm" is in honor of the British ...
.
In what follows, an expression of the form
is considered by convention to be equal to zero whenever
is zero. This is justified because
for any logarithmic base.
Self-information
Shannon derived a measure of information content called the
self-information or "surprisal" of a message
:
:
where
is the probability that message
is chosen from all possible choices in the message space
. The base of the logarithm only affects a scaling factor and, consequently, the units in which the measured information content is expressed. If the logarithm is base 2, the measure of information is expressed in units of
shannons or more often simply "bits" (a
bit in other contexts is rather defined as a "binary digit", whose
average information content is at most 1 shannon).
Information from a source is gained by a recipient only if the recipient did not already have that information to begin with. Messages that convey information over a certain (P=1) event (or one which is ''known'' with certainty, for instance, through a back-channel) provide no information, as the above equation indicates. Infrequently occurring messages contain more information than more frequently occurring messages.
It can also be shown that a compound message of two (or more) unrelated messages would have a quantity of information that is the sum of the measures of information of each message individually. That can be derived using this definition by considering a compound message
providing information regarding the values of two random variables M and N using a message which is the concatenation of the elementary messages ''m'' and ''n'', each of whose information content are given by
and
respectively. If the messages ''m'' and ''n'' each depend only on M and N, and the processes M and N are
independent, then since
(the definition of statistical independence) it is clear from the above definition that
.
An example: The weather forecast broadcast is: "Tonight's forecast: Dark. Continued darkness until widely scattered light in the morning." This message contains almost no information. However, a forecast of a snowstorm would certainly contain information since such does not happen every evening. There would be an even greater amount of information in an accurate forecast of snow for a warm location, such as
Miami
Miami is a East Coast of the United States, coastal city in the U.S. state of Florida and the county seat of Miami-Dade County, Florida, Miami-Dade County in South Florida. It is the core of the Miami metropolitan area, which, with a populat ...
. The amount of information in a forecast of snow for a location where it never snows (impossible event) is the highest (infinity).
Entropy
The
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
of a discrete message space
is a measure of the amount of uncertainty one has about which message will be chosen. It is defined as the
average
In colloquial, ordinary language, an average is a single number or value that best represents a set of data. The type of average taken as most typically representative of a list of numbers is the arithmetic mean the sum of the numbers divided by ...
self-information of a message
from that message space:
:
where
:
denotes the
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
operation.
An important property of entropy is that it is maximized when all the messages in the message space are equiprobable (e.g.
). In this case
.
Sometimes the function
is expressed in terms of the probabilities of the distribution:
:
where each
and
An important special case of this is the
binary entropy function
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two values (0 and 1) for each digit
* Binary function, a function that takes two arguments
* Binary operation, a mathematical op ...
:
:
Joint entropy
The
joint entropy of two discrete random variables
and
is defined as the entropy of the
joint distribution of
and
:
:
If
and
are
independent, then the joint entropy is simply the sum of their individual entropies.
(Note: The joint entropy should not be confused with the
cross entropy, despite similar notations.)
Conditional entropy (equivocation)
Given a particular value of a random variable
, the conditional entropy of
given
is defined as:
:
where
is the
conditional probability
In probability theory, conditional probability is a measure of the probability of an Event (probability theory), event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This ...
of
given
.
The
conditional entropy of
given
, also called the equivocation of
about
is then given by:
:
This uses the
conditional expectation from probability theory.
A basic property of the conditional entropy is that:
:
Kullback–Leibler divergence (information gain)
The
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
(or information divergence, information gain, or relative entropy) is a way of comparing two distributions, a "true"
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
, and an arbitrary probability distribution
. If we compress data in a manner that assumes
is the distribution underlying some data, when, in reality,
is the correct distribution, Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression, or, mathematically,
:
It is in some sense the "distance" from
to
, although it is not a true
metric due to its not being symmetric.
Mutual information (transinformation)
It turns out that one of the most useful and important measures of information is the
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
, or transinformation. This is a measure of how much information can be obtained about one random variable by observing another. The mutual information of
relative to
(which represents conceptually the average amount of information about
that can be gained by observing
) is given by:
:
A basic property of the mutual information is that:
:
That is, knowing
, we can save an average of
bits in encoding
compared to not knowing
. Mutual information is
symmetric:
:
Mutual information can be expressed as the average
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
(information gain) of the
posterior probability distribution of
given the value of
to the
prior distribution on
:
:
In other words, this is a measure of how much, on the average, the probability distribution on
will change if we are given the value of
. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
:
Mutual information is closely related to the
log-likelihood ratio test in the context of contingency tables and the
multinomial distribution and to
Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.
Differential entropy
The basic measures of discrete entropy have been extended by analogy to
continuous spaces by replacing sums with integrals and
probability mass function
In probability and statistics, a probability mass function (sometimes called ''probability function'' or ''frequency function'') is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes i ...
s with
probability density function
In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a Function (mathematics), function whose value at any given sample (or point) in the sample space (the s ...
s. Although, in both cases, mutual information expresses the number of bits of information common to the two sources in question, the analogy does ''not'' imply identical properties; for example, differential entropy may be negative.
The differential analogies of entropy, joint entropy, conditional entropy, and mutual information are defined as follows:
:
:
:
:
:
where
is the joint density function,
and
are the marginal distributions, and
is the conditional distribution.
See also
*
Information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
References
Information theory