information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...

, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the

probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...

of a particular event occurring from a

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

. It can be thought of as an alternative way of expressing probability, much like

odds In probability theory, odds provide a measure of the probability of a particular outcome. Odds are commonly used in gambling and statistics. For example for an event that is 40% probable, one could say that the odds are or When gambling, o ...

or log-odds, but which has particular mathematical advantages in the setting of information theory. The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable. The Shannon information is closely related to ''

entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...

'', which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average". This is the average amount of self-information an observer would expect to gain about a random variable when measuring it. The information content can be expressed in various

units of information A unit of information is any unit of measure of digital data size. In digital computing, a unit of information is used to describe the capacity of a digital data storage device. In telecommunications, a unit of information is used to describe ...

, of which the most common is the "bit" (more formally called the ''shannon''), as explained below. The term 'perplexity' has been used in language modelling to quantify the uncertainty inherent in a set of prospective events.

Definition

Claude Shannon Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, computer scientist, cryptographer and inventor known as the "father of information theory" and the man who laid the foundations of th ...

's definition of self-information was chosen to meet several axioms: # An event with probability 100% is perfectly unsurprising and yields no information. # The less probable an event is, the more surprising it is and the more information it yields. # If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events. The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number

b>1

and an event

x

with

P

, the information content is defined as follows:

\mathrm(x) := - \log_b = -\log_b.

The base ''b'' corresponds to the scaling factor above. Different choices of ''b'' correspond to different units of information: when , the unit is the shannon (symbol Sh), often called a 'bit'; when , the unit is the natural unit of information (symbol nat); and when , the unit is the hartley (symbol Hart). Formally, given a discrete random variable

X

with

probability mass function In probability and statistics, a probability mass function (sometimes called ''probability function'' or ''frequency function'') is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes i ...

p_

, the self-information of measuring

X

as outcome

x

is defined as

\operatorname I_X(x) := 
 - \log
 = \log.

The use of the notation

I_X(x)

for self-information above is not universal. Since the notation

I(X;Y)

is also often used for the related quantity of

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...

, many authors use a lowercase

h_X(x)

for self-entropy instead, mirroring the use of the capital

H(X)

for the entropy.

Properties

Monotonically decreasing function of probability

For a given

probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models ...

, the measurement of rarer events are intuitively more "surprising", and yield more information content than more "common" events. Thus, self-information is a strictly decreasing monotonic function of the probability, or sometimes called an "antitonic" function. While standard probabilities are represented by real numbers in the interval

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

/math>, self-informations are represented by extended real numbers in the interval

, \infty /math>. In particular, we have the following, for any choice of logarithmic base:

* If a particular event has a 100% probability of occurring, then its self-information is -\log(1) = 0 : its occurrence is "perfectly non-surprising" and yields no information.
* If a particular event has a 0% probability of occurring, then its self-information is -\log(0) = \infty : its occurrence is "infinitely surprising".

From this, we can get a few general properties:

* Intuitively, more information is gained from observing an unexpected event—it is "surprising". 
** For example, if there is a one-in-a-million chance of Alice winning the

lottery A lottery (or lotto) is a form of gambling that involves the drawing of numbers at random for a prize. Some governments outlaw lotteries, while others endorse it to the extent of organizing a national or state lottery. It is common to find som ...

, her friend Bob will gain significantly more information from learning that she won than that she lost on a given day. (See also '' Lottery mathematics''.) * This establishes an implicit relationship between the self-information of a

and its

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

Relationship to log-odds

The Shannon information is closely related to the log-odds. In particular, given some event

x

, suppose that

p(x)

is the probability of

x

occurring, and that

p(\lnot x) = 1-p(x)

is the probability of

x

not occurring. Then we have the following definition of the log-odds:

\text(x) = \log\left(\frac\right)

This can be expressed as a difference of two Shannon informations:

\text(x) = \mathrm(\lnot x) - \mathrm(x)

In other words, the log-odds can be interpreted as the level of surprise when the event ''doesn't'' happen, minus the level of surprise when the event ''does'' happen.

Additivity of independent events

The information content of two independent events is the sum of each event's information content. This property is known as additivity in mathematics, and sigma additivity in particular in measure and probability theory. Consider two independent random variables

X,\, Y

with

p_X(x)

and

p_Y(y)

respectively. The joint probability mass function is

p_\!\left(x, y\right) = \Pr(X = x,\, Y = y) 
 = p_X\!(x)\,p_Y\!(y)

because

X

and

Y

are independent. The information content of the outcome

(X, Y) = (x, y)

&= \operatorname_X(x) + \operatorname_Y(y) \end

See ' below for an example. The corresponding property for

likelihood A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the j ...

s is that the log-likelihood of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal (the degree to which an event supports a given model: a model is supported by an event to the extent that the event is unsurprising, given the model), this states that independent events add support: the information that the two events together provide for statistical inference is the sum of their independent information.

Relationship to entropy

The Shannon entropy of the random variable

X

above is defined as

\begin 
 \Eta(X) &= \sum_  \\
 &= \sum_  \\
 & \ 
  \operatorname,
\end

by definition equal to the expected information content of measurement of

X

. The expectation is taken over the discrete values over its support. Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies

\Eta(X) = \operatorname(X; X)

, where

\operatorname(X;X)

is the

X

with itself. For continuous random variables the corresponding concept is differential entropy.

Notes

This measure has also been called surprisal, as it represents the " surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term (as a log-probability measure) was introduced by Edward W. Samson in his 1951 report "Fundamental natural concepts of information theory". An early appearance in the Physics literature is in Myron Tribus' 1961 book ''Thermostatics and Thermodynamics''.Myron Tribus
(1961) Thermodynamics and Thermostatics: ''An Introduction to Energy, Information and States of Matter, with Engineering Applications'' (D. Van Nostrand, 24 West 40 Street, New York 18, New York, U.S.A) Tribus, Myron (1961), pp. 64–6
borrow
When the event is a random realization (of a variable) the self-information of the variable is defined as the

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

of the self-information of the realization.

Examples

Fair coin toss

Consider the

Bernoulli trial In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is ...

of tossing a fair coin

X

. The

probabilities Probability is a branch of mathematics and statistics concerning Event (probability theory), events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probab ...

of the events of the coin landing as heads

\text

and tails

\text

(see fair coin and

obverse and reverse The obverse and reverse are the two flat faces of coins and some other two-sided objects, including paper money, flags, seals, medals, drawings, old master prints and other works of art, and printed fabrics. In this usage, ''obverse'' mean ...

) are

one half One half is the multiplicative inverse of 2. It is an irreducible fraction with a numerator of 1 and a denominator of 2. It often appears in mathematical equations, recipes and measurements. As a word One half is one of the few fractions ...

each,

p_X = p_X = \tfrac = 0.5

. Upon

measuring Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events. In other words, measurement is a process of determining how large or small a physical quantity is as compared to ...

the variable as heads, the associated information gain is

\operatorname_X(\text)
 = -\log_2 
 = -\log_2\! = 1,

so the information gain of a fair coin landing as heads is 1 shannon. Likewise, the information gain of measuring tails

T

\operatorname_X(T)
 = -\log_2 
 = -\log_2  = 1 \text.

Fair die roll

Suppose we have a fair six-sided die. The value of a die roll is a discrete uniform random variable

X \sim \mathrm, 6 /math> with

p_X(k) = \begin
\frac, & k \in \ \\
0, & \text
\end

The probability of rolling a 4 is

p_X(4) = \frac

, as for any other valid roll. The information content of rolling a 4 is thus

\operatorname_(4) = -\log_2 
= -\log_2 
\approx 2.585\; \text

of information.

Two independent, identically distributed dice

Suppose we have two independent, identically distributed random variables

X,\, Y \sim \mathrm, 6 /math> each corresponding to an independent fair 6-sided dice roll.  The joint distribution of X and Y is \begin
 p_\!\left(x, y\right) &  = \Pr(X = x,\, Y = y) 
 = p_X\!(x)\,p_Y\!(y) \\
 &  = \begin
  \displaystyle, \ &x, y \in, 6 \cap \mathbb \\
  0 & \text \end
\end The information content of the random variate (X, Y) = (2,\, 4) is \begin
\operatorname_ 
 &= -\log_2\!
 = \log_2\! = 2 \log_2\! \\
 & \approx 5.169925 \text,
\end and can also be calculated by additivity of events \begin
\operatorname_ 
 &= -\log_2\!
 = -\log_2\! -\log_2\! \\
 & = 2\log_2\! \\
 & \approx 5.169925 \text.
\end

Information from frequency of rolls

If we receive information about the value of the dice without knowledge of which die had which value, we can formalize the approach with so-called counting variables

C_k := \delta_k(X) + \delta_k(Y) = \begin 
 0, & \neg\, (X = k \vee Y = k) \\
 1, & \quad X = k\, \veebar \, Y = k \\
 2, & \quad X = k\, \wedge \, Y = k
\end

for

k \in \

, then

\sum_^ = 2

and the counts have the multinomial distribution

\begin
 f(c_1,\ldots,c_6) &  = \Pr(C_1 = c_1 \text \dots \text C_6 = c_6) \\
 &  = \begin , 
    \ & \text \sum_^6 c_i=2 \\
  0 & \text \end \\
 &  = \begin , 
  \ & \text c_k \text 1 \\
  , \ & \text c_k = 2 \\
  0, \ & \text
 \end
\end

To verify this, the 6 outcomes

(X, Y) \in \left\_^ = \left\

correspond to the event

C_k = 2

and a total probability of . These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same. Without knowledge to distinguish the dice rolling the other numbers, the other

\binom = 15

combinations correspond to one die rolling one number and the other die rolling a different number, each having probability . Indeed,

6 \cdot \tfrac + 15 \cdot \tfrac = 1

, as required. Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one dice was one number and the other was a different number. Take for examples the events

A_k = \

and

B_ = \ \cap \

for

j \ne k, 1 \leq j, k \leq 6

. For example,

A_2 = \

and

B_ = \

. The information contents are

\operatorname(A_2) = -\log_2\! = 5.169925 \text

\operatorname\left(B_\right) = - \log_2 \! \tfrac = 4.169925 \text

Let

\text = \bigcup_^

be the event that both dice rolled the same value and

\text = \overline

be the event that the dice differed. Then

\Pr(\text) = \tfrac

and

\Pr(\text) = \tfrac

. The information contents of the events are

\operatorname(\text) = -\log_2\! = 2.5849625 \text

\operatorname(\text) = -\log_2\! = 0.2630344 \text.

Information from sum of dice

The probability mass or density function (collectively

probability measure In mathematics, a probability measure is a real-valued function defined on a set of events in a σ-algebra that satisfies Measure (mathematics), measure properties such as ''countable additivity''. The difference between a probability measure an ...

) of the sum of two independent random variables is the convolution of each probability measure. In the case of independent fair 6-sided dice rolls, the random variable

Z = X + Y

has probability mass function

p_Z(z) = p_X(x) * p_Y(y) =

, where

*

represents the discrete convolution. The outcome

Z = 5

has probability

p_Z(5) = \frac =

. Therefore, the information asserted is

\operatorname_Z(5) = -\log_2 = \log_2
 \approx 3.169925 \text.

General discrete uniform distribution

Generalizing the example above, consider a general discrete uniform random variable (DURV)

\quad a, b \in \mathbb, \ b \ge a.

For convenience, define

N := b - a + 1

. The

\cap \mathbb \\ 0, & \text. \end

In general, the values of the DURV need not be

integer An integer is the number zero (0), a positive natural number (1, 2, 3, ...), or the negation of a positive natural number (−1, −2, −3, ...). The negations or additive inverses of the positive natural numbers are referred to as negative in ...

s, or for the purposes of information theory even uniformly spaced; they need only be equiprobable. The information gain of any observation

X = k

\operatorname_X(k) = -\log_2  = \log_2 \text.

Special case: constant random variable

b = a

above,

X

degenerates to a constant random variable with probability distribution deterministically given by

X = b

and probability measure the

Dirac measure In mathematics, a Dirac measure assigns a size to a set based solely on whether it contains a fixed element ''x'' or not. It is one way of formalizing the idea of the Dirac delta function, an important tool in physics and other technical fields. ...

p_X(k) = \delta_(k)

. The only value

X

can take is deterministically

b

, so the information content of any measurement of

X

\operatorname_X(b) = - \log_2 = 0.

In general, there is no information gained from measuring a known value.

Categorical distribution

Generalizing all of the above cases, consider a categorical

discrete random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers ...

with support

\mathcal = \bigl\_^

and

given by

p_X(k) = \begin
 p_i, & k = s_i \in \mathcal
 \\ 0,  & \text .
\end

For the purposes of information theory, the values

s \in \mathcal

do not have to be

number A number is a mathematical object used to count, measure, and label. The most basic examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers can ...

s; they can be any

mutually exclusive In logic and probability theory, two events (or propositions) are mutually exclusive or disjoint if they cannot both occur at the same time. A clear example is the set of outcomes of a single coin toss, which can result in either heads or tails ...

events on a

measure space A measure space is a basic object of measure theory, a branch of mathematics that studies generalized notions of volumes. It contains an underlying set, the subsets of this set that are feasible for measuring (the -algebra) and the method that ...

of finite measure that has been normalized to a

p

. Without loss of generality, we can assume the categorical distribution is supported on the set

= \left\

; the mathematical structure is

isomorphic In mathematics, an isomorphism is a structure-preserving mapping or morphism between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between the ...

in terms of

probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

and therefore

as well. The information of the outcome

X = x

is given

\operatorname_X(x) = -\log_2.

From these examples, it is possible to calculate the information of any set of independent DRVs with known distributions by additivity.

Derivation

By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information

a priori ('from the earlier') and ('from the later') are Latin phrases used in philosophy to distinguish types of knowledge, Justification (epistemology), justification, or argument by their reliance on experience. knowledge is independent from any ...

. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information. For example, quoting a character (the Hippy Dippy Weatherman) of comedian

George Carlin George Denis Patrick Carlin (May 12, 1937 – June 22, 2008) was an American stand-up comedian, social critic, actor and author. Regarded as one of the greatest and most influential comedians of all time, he was dubbed "the dean of countercultur ...

''Weather forecast for tonight: dark.'' ''Continued dark overnight, with widely scattered light by morning.''

Assuming that one does not reside near the

polar regions The polar regions, also called the frigid geographical zone, zones or polar zones, of Earth are Earth's polar ice caps, the regions of the planet that surround its geographical poles (the North Pole, North and South Poles), lying within the pol ...

, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night. Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of event,

\omega_n

, depends only on the probability of that event.

\operatorname I(\omega_n) = f(\operatorname P(\omega_n))

for some function

f(\cdot)

to be determined below. If

\operatorname P(\omega_n) = 1

, then

\operatorname I(\omega_n) = 0

. If

\operatorname P(\omega_n) < 1

, then

\operatorname I(\omega_n) > 0

. Further, by definition, the measure of self-information is nonnegative and additive. If a message informing of event

C

is the intersection of two independent events

A

and

B

, then the information of event

C

occurring is that of the compound message of both independent events

A

and

B

occurring. The quantity of information of compound message

C

would be expected to equal the sum of the amounts of information of the individual component messages

A

and

B

respectively:

\operatorname I(C) = \operatorname I(A \cap B) = \operatorname I(A) + \operatorname I(B).

Because of the independence of events

A

and

B

, the probability of event

C

\operatorname P(C) = \operatorname P(A \cap B) = \operatorname P(A) \cdot \operatorname P(B).

However, applying function

f(\cdot)

results in

\begin
   \operatorname I(C) & = \operatorname I(A) + \operatorname I(B) \\
f(\operatorname P(C)) & = f(\operatorname P(A)) + f(\operatorname P(B)) \\
                      & = f\big(\operatorname P(A) \cdot \operatorname P(B)\big) \\
\end

Thanks to work on Cauchy's functional equation, the only monotone functions

f(\cdot)

having the property such that

f(x \cdot y) = f(x) + f(y)

are the

logarithm In mathematics, the logarithm of a number is the exponent by which another fixed value, the base, must be raised to produce that number. For example, the logarithm of to base is , because is to the rd power: . More generally, if , the ...

functions

\log_b(x)

. The only operational difference between logarithms of different bases is that of different scaling constants, so we may assume

f(x) = K \log(x)

where

\log

is the

natural logarithm The natural logarithm of a number is its logarithm to the base of a logarithm, base of the e (mathematical constant), mathematical constant , which is an Irrational number, irrational and Transcendental number, transcendental number approxima ...

. Since the probabilities of events are always between 0 and 1 and the information associated with these events must be nonnegative, that requires that

K<0

. Taking into account these properties, the self-information

\operatorname I(\omega_n)

associated with outcome

\omega_n

with probability

\operatorname P(\omega_n)

is defined as:

\operatorname I(\omega_n) = -\log(\operatorname P(\omega_n)) = \log \left(\frac \right)

The smaller the probability of event

\omega_n

, the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of

I(\omega_n)

is shannon. This is the most common practice. When using the

of base

e

, the unit will be the nat. For the base 10 logarithm, the unit of information is the hartley. As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 shannons (probability 1/16), and the information content associated with getting a result other than the one specified would be ~0.09 shannons (probability 15/16). See above for detailed examples.

References

External links

Examples of surprisal measures

* ttp://ilab.usc.edu/surprise/ Bayesian Theory of Surprise {{Authority control Information theory Entropy and information

Definition

Properties

Monotonically decreasing function of probability

Relationship to log-odds

Additivity of independent events

Relationship to entropy

Notes

Examples

Fair coin toss

Fair die roll

Two independent, identically distributed dice

Information from frequency of rolls

Information from sum of dice

General discrete uniform distribution

Special case: constant random variable

Categorical distribution

Derivation

See also

References

Further reading

External links