information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...

, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory. The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal

source coding In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...

of the random variable. The Shannon information is closely related to '' entropy'', which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average". This is the average amount of self-information an observer would expect to gain about a random variable when measuring it. The information content can be expressed in various units of information, of which the most common is the "bit" (more correctly called the ''shannon''), as explained below.

Definition

Claude Shannon's definition of self-information was chosen to meet several axioms: # An event with probability 100% is perfectly unsurprising and yields no information. # The less probable an event is, the more surprising it is and the more information it yields. # If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events. The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number

b>1

and an event

x

with probability

P

, the information content is defined as follows:

\mathrm(x) := - \log_b = -\log_b.

The base ''b'' corresponds to the scaling factor above. Different choices of ''b'' correspond to different units of information: when , the unit is the shannon (symbol Sh), often called a 'bit'; when , the unit is the natural unit of information (symbol nat); and when , the unit is the

hartley Hartley may refer to: Places Australia *Hartley, New South Wales *Hartley, South Australia **Electoral district of Hartley, a state electoral district Canada *Hartley Bay, British Columbia United Kingdom *Hartley, Cumbria *Hartley, Plymou ...

(symbol Hart). Formally, given a random variable

X

with

probability mass function In probability and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes it is also known as the discrete density function. The probability mass ...

p_

, the self-information of measuring

X

outcome Outcome may refer to: * Outcome (probability), the result of an experiment in probability theory * Outcome (game theory), the result of players' decisions in game theory * ''The Outcome'', a 2005 Spanish film * An outcome measure (or endpoint) ...

x

is defined as

\operatorname I_X(x) := 
 - \log
 = \log.

The use of the notation

I_X(x)

for self-information above is not universal. Since the notation

I(X;Y)

is also often used for the related quantity of mutual information, many authors use a lowercase

h_X(x)

for self-entropy instead, mirroring the use of the capital

H(X)

for the entropy.

Properties

Monotonically decreasing function of probability

For a given probability space, the measurement of rarer events are intuitively more "surprising", and yield more information content, than more common values. Thus, self-information is a strictly decreasing monotonic function of the probability, or sometimes called an "antitonic" function. While standard probabilities are represented by real numbers in the interval

, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline (t ...

/math>, self-informations are represented by extended real numbers in the interval

, \infty /math>. In particular, we have the following, for any choice of logarithmic base:

* If a particular event has a 100% probability of occurring, then its self-information is -\log(1) = 0 : its occurrence is "perfectly non-surprising" and yields no information.
* If a particular event has a 0% probability of occurring, then its self-information is -\log(0) = \infty : its occurrence is "infinitely surprising".

From this, we can get a few general properties:

* Intuitively, more information is gained from observing an unexpected event—it is "surprising". 
** For example, if there is a one-in-a-million chance of Alice winning the

lottery A lottery is a form of gambling that involves the drawing of numbers at random for a prize. Some governments outlaw lotteries, while others endorse it to the extent of organizing a national or state lottery. It is common to find some degree of ...

, her friend Bob will gain significantly more information from learning that she

won Won may refer to: *The Korean won from 1902–1910 *South Korean won, the currency of the Republic of Korea *North Korean won, the currency of the Democratic People's Republic of Korea * Won (Korean surname) * Won (Korean given name) * Won Buddhis ...

than that she lost on a given day. (See also '' Lottery mathematics''.) * This establishes an implicit relationship between the self-information of a

and its variance.

Relationship to log-odds

The Shannon information is closely related to the log-odds. In particular, given some event

x

, suppose that

p(x)

is the probability of

x

occurring, and that

p(\lnot x) = 1-p(x)

is the probability of

x

not occurring. Then we have the following definition of the log-odds:

\text(x) = \log\left(\frac\right)

This can be expressed as a difference of two Shannon informations:

\text(x) = \mathrm(\lnot x) - \mathrm(x)

In other words, the log-odds can be interpreted as the level of surprise when the event ''doesn't'' happen, minus the level of surprise when the event ''does'' happen.

Additivity of independent events

The information content of two independent events is the sum of each event's information content. This property is known as additivity in mathematics, and sigma additivity in particular in measure and probability theory. Consider two independent random variables

X,\, Y

with

p_X(x)

and

p_Y(y)

respectively. The joint probability mass function is

p_\!\left(x, y\right) = \Pr(X = x,\, Y = y) 
 = p_X\!(x)\,p_Y\!(y)

because

X

and

Y

are independent. The information content of the

(X, Y) = (x, y)

&= \operatorname_X(x) + \operatorname_Y(y) \end

See ' below for an example. The corresponding property for likelihoods is that the log-likelihood of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal (the degree to which an event supports a given model: a model is supported by an event to the extent that the event is unsurprising, given the model), this states that independent events add support: the information that the two events together provide for statistical inference is the sum of their independent information.

Relationship to entropy

The Shannon entropy of the random variable

X

above is defined as

\begin 
 \Eta(X) &= \sum_  \\
 &= \sum_  \\
 & \ 
  \operatorname,
\end

by definition equal to the expected information content of measurement of

X

. The expectation is taken over the discrete values over its

support Support may refer to: Arts, entertainment, and media * Supporting character Business and finance * Support (technical analysis) * Child support * Customer support * Income Support Construction * Support (structure), or lateral support, a ...

. Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies

\Eta(X) = \operatorname(X; X)

, where

\operatorname(X;X)

is the mutual information of

X

with itself. For

continuous random variables In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

the corresponding concept is differential entropy.

Notes

This measure has also been called surprisal, as it represents the " surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term (as a log-probability measure) was coined by Myron Tribus in his 1961 book ''Thermostatics and Thermodynamics''.R. B. Bernstein and R. D. Levine (1972) "Entropy and Chemical Change. I. Characterization of Product (and Reactant) Energy Distributions in Reactive Molecular Collisions: Information and Entropy Deficiency", ''The Journal of Chemical Physics'' 57, 434–44
link
Myron Tribus
(1961) Thermodynamics and Thermostatics: ''An Introduction to Energy, Information and States of Matter, with Engineering Applications'' (D. Van Nostrand, 24 West 40 Street, New York 18, New York, U.S.A) Tribus, Myron (1961), pp. 64–6
borrow
When the event is a random realization (of a variable) the self-information of the variable is defined as the

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...

of the self-information of the realization. Self-information is an example of a proper scoring rule.

Examples

Fair coin toss

Consider the Bernoulli trial of tossing a fair coin

X

. The

probabilities Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, ...

of the events of the coin landing as heads

\text

and tails

\text

(see fair coin and

obverse and reverse Obverse and its opposite, reverse, refer to the two flat faces of coins and some other two-sided objects, including paper money, flags, seals, medals, drawings, old master prints and other works of art, and printed fabrics. In this usage, '' ...

) are

one half One half ( : halves) is the irreducible fraction resulting from dividing one by two or the fraction resulting from dividing any number by its double. Multiplication by one half is equivalent to division by two, or "halving"; conversely, d ...

each,

p_X = p_X = \tfrac = 0.5

. Upon measuring the variable as heads, the associated information gain is

\operatorname_X(\text)
 = -\log_2 
 = -\log_2\! = 1,

so the information gain of a fair coin landing as heads is 1 shannon. Likewise, the information gain of measuring tails

T

\operatorname_X(T)
 = -\log_2 
 = -\log_2  = 1 \text.

Fair die roll

Suppose we have a fair six-sided die. The value of a dice roll is a

discrete uniform random variable In probability theory and statistics, the discrete uniform distribution is a symmetric probability distribution wherein a finite number of values are equally likely to be observed; every one of ''n'' values has equal probability 1/''n''. Anoth ...

X \sim \mathrm

, 6 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline (t ...

/math> with

p_X(k) = \begin
\frac, & k \in \ \\
0, & \text
\end

The probability of rolling a 4 is

p_X(4) = \frac

, as for any other valid roll. The information content of rolling a 4 is thus

\operatorname_(4) = -\log_2 
= -\log_2 
\approx 2.585\; \text

of information.

Two independent, identically distributed dice

Suppose we have two independent, identically distributed random variables

X,\, Y \sim \mathrm

/math> each corresponding to an independent fair 6-sided dice roll. The joint distribution of

X

and

Y

\begin
 p_\!\left(x, y\right) &  = \Pr(X = x,\, Y = y) 
 = p_X\!(x)\,p_Y\!(y) \\
 &  = \begin
  \displaystyle, \ &x, y \in

\cap \mathbb \\ 0 & \text \end \end The information content of the random variate

(X, Y) = (2,\, 4)

\begin
\operatorname_ 
 &= -\log_2\!
 = \log_2\! = 2 \log_2\! \\
 & \approx 5.169925 \text,
\end

and can also be calculated by additivity of events

\begin
\operatorname_ 
 &= -\log_2\!
 = -\log_2\! -\log_2\! \\
 & = 2\log_2\! \\
 & \approx 5.169925 \text.
\end

Information from frequency of rolls

If we receive information about the value of the dice without knowledge of which die had which value, we can formalize the approach with so-called counting variables

C_k := \delta_k(X) + \delta_k(Y) = \begin 
 0, & \neg\, (X = k \vee Y = k) \\
 1, & \quad X = k\, \veebar \, Y = k \\
 2, & \quad X = k\, \wedge \, Y = k
\end

for

k \in \

, then

\sum_^ = 2

and the counts have the multinomial distribution

\begin
 f(c_1,\ldots,c_6) &  = \Pr(C_1 = c_1 \text \dots \text C_6 = c_6) \\
 &  = \begin , 
    \ & \text \sum_^6 c_i=2 \\
  0 & \text \end \\
 &  = \begin , 
  \ & \text c_k \text 1 \\
  , \ & \text c_k = 2 \\
  0, \ & \text
 \end
\end

To verify this, the 6 outcomes

(X, Y) \in \left\_^ = \left\

correspond to the event

C_k = 2

and a

total probability In probability theory, the law (or formula) of total probability is a fundamental rule relating marginal probabilities to conditional probabilities. It expresses the total probability of an outcome which can be realized via several distinct even ...

of . These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same. Without knowledge to distinguish the dice rolling the other numbers, the other

\binom = 15

combinations correspond to one die rolling one number and the other die rolling a different number, each having probability . Indeed,

6 \cdot \tfrac + 15 \cdot \tfrac = 1

, as required. Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one dice was one number and the other was a different number. Take for examples the events

A_k = \

and

B_ = \ \cap \

for

j \ne k, 1 \leq j, k \leq 6

. For example,

A_2 = \

and

B_ = \

. The information contents are

\operatorname(A_2) = -\log_2\! = 5.169925 \text

\operatorname\left(B_\right) = - \log_2 \! \tfrac = 4.169925 \text

Let

\text = \bigcup_^

be the event that both dice rolled the same value and

\text = \overline

be the event that the dice differed. Then

\Pr(\text) = \tfrac

and

\Pr(\text) = \tfrac

. The information contents of the events are

\operatorname(\text) = -\log_2\! = 2.5849625 \text

\operatorname(\text) = -\log_2\! = 0.2630344 \text.

Information from sum of die

The probability mass or density function (collectively

probability measure In mathematics, a probability measure is a real-valued function defined on a set of events in a probability space that satisfies measure properties such as ''countable additivity''. The difference between a probability measure and the more gener ...

) of the sum of two independent random variables is the convolution of each probability measure. In the case of independent fair 6-sided dice rolls, the random variable

Z = X + Y

has probability mass function

p_Z(z) = p_X(x) * p_Y(y) =

, where

*

represents the discrete convolution. The

Z = 5

has probability

p_Z(5) = \frac =

. Therefore, the information asserted is

\operatorname_Z(5) = -\log_2 = \log_2
 \approx 3.169925 \text.

General discrete uniform distribution

Generalizing the example above, consider a general

(DURV)

\quad a, b \in \mathbb, \ b \ge a.

For convenience, define

N := b - a + 1

. The

p_X(k) = \begin
 \frac, & k \in

, b The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

\cap \mathbb \\ 0, & \text. \endIn general, the values of the DURV need not be integers, or for the purposes of information theory even uniformly spaced; they need only be equiprobable. The information gain of any observation

X = k

\operatorname_X(k) = -\log_2  = \log_2 \text.

Special case: constant random variable

b = a

above,

X

degenerates Degenerates is a musical group which originated in Grosse Pointe Park, Michigan in 1979, during the formative years of the Detroit hardcore scene. The group predated the Process of Elimination EP, which some reviewers view as the beginning of the ...

to a constant random variable with probability distribution deterministically given by

X = b

and probability measure the Dirac measure

p_X(k) = \delta_(k)

. The only value

X

can take is deterministically

b

, so the information content of any measurement of

X

\operatorname_X(b) = - \log_2 = 0.

In general, there is no information gained from measuring a known value.

Categorical distribution

Generalizing all of the above cases, consider a categorical discrete random variable with

\mathcal = \bigl\_^

and

given by

p_X(k) = \begin
 p_i, & k = s_i \in \mathcal
 \\ 0,  & \text .
\end

For the purposes of information theory, the values

s \in \mathcal

do not have to be numbers; they can be any mutually exclusive events on a

measure space A measure space is a basic object of measure theory, a branch of mathematics that studies generalized notions of volumes. It contains an underlying set, the subsets of this set that are feasible for measuring (the -algebra) and the method that i ...

of finite measure that has been normalized to a

p

. Without loss of generality, we can assume the categorical distribution is supported on the set

= \left\

; the mathematical structure is

isomorphic In mathematics, an isomorphism is a structure-preserving mapping between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between them. The word is ...

in terms of probability theory and therefore

as well. The information of the outcome

X = x

is given

\operatorname_X(x) = -\log_2.

From these examples, it is possible to calculate the information of any set of independent DRVs with known distributions by additivity.

Derivation

By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information a priori. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information. For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin, "''Weather forecast for tonight: dark. Continued dark overnight, with widely scattered light by morning.''" Assuming that one does not reside near the polar regions, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night. Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of event,

\omega_n

, depends only on the probability of that event.

\operatorname I(\omega_n) = f(\operatorname P(\omega_n))

for some function

f(\cdot)

to be determined below. If

\operatorname P(\omega_n) = 1

, then

\operatorname I(\omega_n) = 0

. If

\operatorname P(\omega_n) < 1

, then

\operatorname I(\omega_n) > 0

. Further, by definition, the measure of self-information is nonnegative and additive. If a message informing of event

C

is the intersection of two independent events

A

and

B

, then the information of event

C

occurring is that of the compound message of both independent events

A

and

B

occurring. The quantity of information of compound message

C

would be expected to equal the sum of the amounts of information of the individual component messages

A

and

B

respectively:

\operatorname I(C) = \operatorname I(A \cap B) = \operatorname I(A) + \operatorname I(B).

Because of the independence of events

A

and

B

, the probability of event

C

\operatorname P(C) = \operatorname P(A \cap B) = \operatorname P(A) \cdot \operatorname P(B).

However, applying function

f(\cdot)

results in

\begin
   \operatorname I(C) & = \operatorname I(A) + \operatorname I(B) \\
f(\operatorname P(C)) & = f(\operatorname P(A)) + f(\operatorname P(B)) \\
                      & = f\big(\operatorname P(A) \cdot \operatorname P(B)\big) \\
\end

Thanks to work on Cauchy's functional equation, the only monotone functions

f(\cdot)

having the property such that

f(x \cdot y) = f(x) + f(y)

are the logarithm functions

\log_b(x)

. The only operational difference between logarithms of different bases is that of different scaling constants, so we may assume

f(x) = K \log(x)

where

\log

is the

natural logarithm The natural logarithm of a number is its logarithm to the base of the mathematical constant , which is an irrational and transcendental number approximately equal to . The natural logarithm of is generally written as , , or sometimes, if ...

. Since the probabilities of events are always between 0 and 1 and the information associated with these events must be nonnegative, that requires that

K<0

. Taking into account these properties, the self-information

\operatorname I(\omega_n)

associated with outcome

\omega_n

with probability

\operatorname P(\omega_n)

is defined as:

\operatorname I(\omega_n) = -\log(\operatorname P(\omega_n)) = \log \left(\frac \right)

The smaller the probability of event

\omega_n

, the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of

I(\omega_n)

is bits. This is the most common practice. When using the

of base

e

, the unit will be the nat. For the base 10 logarithm, the unit of information is the

. As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 bits (probability 1/16), and the information content associated with getting a result other than the one specified would be ~0.09 bits (probability 15/16). See above for detailed examples.

References

External links

Examples of surprisal measures

* ttp://ilab.usc.edu/surprise/ Bayesian Theory of Surprise {{Authority control Information theory Entropy and information

Definition

Properties

Monotonically decreasing function of probability

Relationship to log-odds

Additivity of independent events

Relationship to entropy

Notes

Examples

Fair coin toss

Fair die roll

Two independent, identically distributed dice

Information from frequency of rolls

Information from sum of die

General discrete uniform distribution

Special case: constant random variable

Categorical distribution

Derivation

See also

References

Further reading

External links