
In
probability theory
Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...
and
information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
, the mutual information (MI) of two
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
s is a measure of the mutual
dependence between the two variables. More specifically, it quantifies the "
amount of information" (in
units
Unit may refer to:
General measurement
* Unit of measurement, a definite magnitude of a physical quantity, defined and adopted by convention or by law
**International System of Units (SI), modern form of the metric system
**English units, histo ...
such as
shannons (
bits),
nats or
hartleys) obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.
Not limited to real-valued random variables and linear dependence like the
correlation coefficient
A correlation coefficient is a numerical measure of some type of linear correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two c ...
, MI is more general and determines how different the
joint distribution of the pair
is from the product of the marginal distributions of
and
. MI is the
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
of the
pointwise mutual information (PMI).
The quantity was defined and analyzed by
Claude Shannon
Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, computer scientist, cryptographer and inventor known as the "father of information theory" and the man who laid the foundations of th ...
in his landmark paper "
A Mathematical Theory of Communication
"A Mathematical Theory of Communication" is an article by mathematician Claude E. Shannon published in '' Bell System Technical Journal'' in 1948. It was renamed ''The Mathematical Theory of Communication'' in the 1949 book of the same name, a s ...
", although he did not call it "mutual information". This term was coined later by
Robert Fano
Roberto Mario "Robert" Fano (11 November 1917 – 13 July 2016) was an Italian-American computer scientist and professor of electrical engineering and computer science at the Massachusetts Institute of Technology. He became a student and working ...
. Mutual Information is also known as
information gain.
Definition
Let
be a pair of
random variables
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers ...
with values over the space
. If their joint distribution is
and the marginal distributions are
and
, the mutual information is defined as
:
where
is the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
, and
is the
outer product
In linear algebra, the outer product of two coordinate vectors is the matrix whose entries are all products of an element in the first vector with an element in the second vector. If the two coordinate vectors have dimensions ''n'' and ''m'', the ...
distribution which assigns probability
to each
.
Expressed in terms of the
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
and the
conditional entropy of the random variables
and
, one also has (See
relation to conditional and joint entropy):
:
Notice, as per property of the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
, that
is equal to zero precisely when the joint distribution coincides with the product of the marginals, i.e. when
and
are independent (and hence observing
tells you nothing about
).
is non-negative, it is a measure of the price for encoding
as a pair of independent random variables when in reality they are not.
If the
natural logarithm
The natural logarithm of a number is its logarithm to the base of a logarithm, base of the e (mathematical constant), mathematical constant , which is an Irrational number, irrational and Transcendental number, transcendental number approxima ...
is used, the unit of mutual information is the
nat. If the
log base 2 is used, the unit of mutual information is the
shannon, also known as the bit. If the
log base 10 is used, the unit of mutual information is the
hartley, also known as the ban or the dit.
In terms of PMFs for discrete distributions
The mutual information of two jointly discrete random variables
and
is calculated as a double sum:
:
,
where
is the
joint probability ''mass'' function of
and
, and
and
are the
marginal probability mass functions of
and
respectively.
In terms of PDFs for continuous distributions
In the case of jointly continuous random variables, the double sum is replaced by a
double integral
In mathematics (specifically multivariable calculus), a multiple integral is a definite integral of a function of several real variables, for instance, or .
Integrals of a function of two variables over a region in \mathbb^2 (the Real line, r ...
:
:
,
where
is now the joint probability ''density'' function of
and
, and
and
are the marginal probability density functions of
and
respectively.
Motivation
Intuitively, mutual information measures the information that
and
share: It measures how much knowing one of these variables reduces uncertainty about the other. For example, if
and
are independent, then knowing
does not give any information about
and vice versa, so their mutual information is zero. At the other extreme, if
is a deterministic function of
and
is a deterministic function of
then all information conveyed by
is shared with
: knowing
determines the value of
and vice versa. As a result, the mutual information is the same as the uncertainty contained in
(or
) alone, namely the
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
of
(or
). A very special case of this is when
and
are the same random variable.
Mutual information is a measure of the inherent dependence expressed in the
joint distribution of
and
relative to the marginal distribution of
and
under the assumption of independence. Mutual information therefore measures dependence in the following sense:
if and only if
In logic and related fields such as mathematics and philosophy, "if and only if" (often shortened as "iff") is paraphrased by the biconditional, a logical connective between statements. The biconditional is true in two cases, where either bo ...
and
are independent random variables. This is easy to see in one direction: if
and
are independent, then
, and therefore:
:
.
Moreover, mutual information is nonnegative (i.e.
see below) and
symmetric
Symmetry () in everyday life refers to a sense of harmonious and beautiful proportion and balance. In mathematics, the term has a more precise definition and is usually used to refer to an object that is invariant under some transformations ...
(i.e.
see below).
Properties
Nonnegativity
Using
Jensen's inequality on the definition of mutual information we can show that
is non-negative, i.e.
:
Symmetry
:
The proof is given considering the relationship with entropy, as shown below.
Supermodularity under independence
If
is independent of
, then
:
.
Relation to conditional and joint entropy
Mutual information can be equivalently expressed as:
:
where
and
are the marginal
entropies,
and
are the
conditional entropies, and
is the
joint entropy of
and
.
Notice the analogy to the union, difference, and intersection of two sets: in this respect, all the formulas given above are apparent from the Venn diagram reported at the beginning of the article.
In terms of a communication channel in which the output
is a noisy version of the input
, these relations are summarised in the figure:

Because
is non-negative, consequently,
. Here we give the detailed deduction of
for the case of jointly discrete random variables:
:
The proofs of the other identities above are similar. The proof of the general case (not just discrete) is similar, with integrals replacing sums.
Intuitively, if entropy
is regarded as a measure of uncertainty about a random variable, then
is a measure of what
does ''not'' say about
. This is "the amount of uncertainty remaining about
after
is known", and thus the right side of the second of these equalities can be read as "the amount of uncertainty in
, minus the amount of uncertainty in
which remains after
is known", which is equivalent to "the amount of uncertainty in
which is removed by knowing
". This corroborates the intuitive meaning of mutual information as the amount of information (that is, reduction in uncertainty) that knowing either variable provides about the other.
Note that in the discrete case
and therefore
. Thus
, and one can formulate the basic principle that a variable contains at least as much information about itself as any other variable can provide.
Relation to Kullback–Leibler divergence
For jointly discrete or jointly continuous pairs
, mutual information is the
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
from the product of the
marginal distribution
In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variable ...
s,
, of the
joint distribution , that is,
:
Furthermore, let
be the conditional mass or density function. Then, we have the identity
: