Information theory is the scientific study of the
quantification,
storage, and
communication
Communication (from la, communicare, meaning "to share" or "to be in relation with") is usually defined as the transmission of information. The term may also refer to the message communicated through such transmissions or the field of inqui ...
of
information
Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...
. The field was originally established by the works of
Harry Nyquist
Harry Nyquist (, ; February 7, 1889 – April 4, 1976) was a Swedish-American physicist and electronic engineer who made important contributions to communication theory.
Personal life
Nyquist was born in the village Nilsby of the parish Stora ...
and
Ralph Hartley, in the 1920s, and
Claude Shannon
Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, and cryptographer known as a "father of information theory".
As a 21-year-old master's degree student at the Massachusetts I ...
in the 1940s. The field is at the intersection of
probability theory
Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...
,
statistics
Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
,
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
,
statistical mechanics,
information engineering
Information engineering is the engineering discipline that deals with the generation, distribution, analysis, and use of information, data, and knowledge in systems. The field first became identifiable in the early 21st century.
The component ...
, and
electrical engineering
Electrical engineering is an engineering discipline concerned with the study, design, and application of equipment, devices, and systems which use electricity, electronics, and electromagnetism. It emerged as an identifiable occupation in the l ...
.
A key measure in information theory is
entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...
. Entropy quantifies the amount of uncertainty involved in the value of a
random variable or the outcome of a
random process. For example, identifying the outcome of a fair
coin flip
Coin flipping, coin tossing, or heads or tails is the practice of throwing a coin in the air and checking which side is showing when it lands, in order to choose between two alternatives, heads or tails, sometimes used to resolve a dispute betwe ...
(with two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a
die
Die, as a verb, refers to death, the cessation of life.
Die may also refer to:
Games
* Die, singular of dice, small throwable objects used for producing random numbers
Manufacturing
* Die (integrated circuit), a rectangular piece of a semicondu ...
(with six equally likely outcomes). Some other important measures in information theory are
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
,
channel capacity,
error exponents, and
relative entropy. Important sub-fields of information theory include
source coding,
algorithmic complexity theory,
algorithmic information theory and
information-theoretic security.
Applications of fundamental topics of information theory include source coding/
data compression (e.g. for
ZIP files), and channel coding/
error detection and correction
In information theory and coding theory with applications in computer science and telecommunication, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable commu ...
(e.g. for
DSL). Its impact has been crucial to the success of the
Voyager
Voyager may refer to:
Computing and communications
* LG Voyager, a mobile phone model manufactured by LG Electronics
* NCR Voyager, a computer platform produced by NCR Corporation
* Voyager (computer worm), a computer worm affecting Oracle ...
missions to deep space, the invention of the
compact disc
The compact disc (CD) is a digital optical disc data storage format that was co-developed by Philips and Sony to store and play digital audio recordings. In August 1982, the first compact disc was manufactured. It was then released in O ...
, the feasibility of mobile phones and the development of the Internet. The theory has also found applications in other areas, including
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properti ...
,
cryptography
Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or '' -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adv ...
,
neurobiology,
perception
Perception () is the organization, identification, and interpretation of sensory information in order to represent and understand the presented information or environment. All perception involves signals that go through the nervous syste ...
, linguistics, the evolution and function of molecular codes (
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
),
thermal physics,
molecular dynamics
Molecular dynamics (MD) is a computer simulation method for analyzing the physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a fixed period of time, giving a view of the dynamic "evolution" of th ...
,
quantum computing,
black holes,
information retrieval,
intelligence gathering
This is a list of intelligence gathering disciplines.
HUMINT
Human intelligence (HUMINT) are gathered from a person in the location in question. Sources can include the following:
* Advisors or foreign internal defense (FID) personnel wor ...
,
plagiarism detection,
pattern recognition,
anomaly detection and even art creation.
Overview
Information theory studies the transmission, processing, extraction, and utilization of information. Abstractly, information can be thought of as the resolution of uncertainty. In the case of communication of information over a noisy channel, this abstract concept was formalized in 1948 by Claude Shannon in a paper entitled ''
A Mathematical Theory of Communication'', in which information is thought of as a set of possible messages, and the goal is to send these messages over a noisy channel, and to have the receiver reconstruct the message with low probability of error, in spite of the channel noise. Shannon's main result, the
noisy-channel coding theorem
In information theory, the noisy-channel coding theorem (sometimes Shannon's theorem or Shannon's limit), establishes that for any given degree of noise contamination of a communication channel, it is possible to communicate discrete data (di ...
showed that, in the limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity, a quantity dependent merely on the statistics of the channel over which the messages are sent.
Coding theory is concerned with finding explicit methods, called ''codes'', for increasing the efficiency and reducing the error rate of data communication over noisy channels to near the channel capacity. These codes can be roughly subdivided into data compression (source coding) and
error-correction (channel coding) techniques. In the latter case, it took many years to find the methods Shannon's work proved were possible.
A third class of information theory codes are cryptographic algorithms (both
codes and
ciphers). Concepts, methods and results from coding theory and information theory are widely used in cryptography and
cryptanalysis. ''See the article
ban (unit) for a historical application.''
Historical background
The landmark event ''establishing'' the discipline of information theory and bringing it to immediate worldwide attention was the publication of Claude E. Shannon's classic paper "A Mathematical Theory of Communication" in the ''
Bell System Technical Journal
The ''Bell Labs Technical Journal'' is the in-house scientific journal for scientists of Nokia Bell Labs, published yearly by the IEEE society. The managing editor is Charles Bahr.
The journal was originally established as the ''Bell System Tech ...
'' in July and October 1948.
Prior to this paper, limited information-theoretic ideas had been developed at
Bell Labs
Nokia Bell Labs, originally named Bell Telephone Laboratories (1925–1984),
then AT&T Bell Laboratories (1984–1996)
and Bell Labs Innovations (1996–2007),
is an American industrial research and scientific development company owned by mul ...
, all implicitly assuming events of equal probability.
Harry Nyquist
Harry Nyquist (, ; February 7, 1889 – April 4, 1976) was a Swedish-American physicist and electronic engineer who made important contributions to communication theory.
Personal life
Nyquist was born in the village Nilsby of the parish Stora ...
's 1924 paper, ''Certain Factors Affecting Telegraph Speed'', contains a theoretical section quantifying "intelligence" and the "line speed" at which it can be transmitted by a communication system, giving the relation (recalling the
Boltzmann constant
The Boltzmann constant ( or ) is the proportionality factor that relates the average relative kinetic energy of particles in a gas with the thermodynamic temperature of the gas. It occurs in the definitions of the kelvin and the gas constan ...
), where ''W'' is the speed of transmission of intelligence, ''m'' is the number of different voltage levels to choose from at each time step, and ''K'' is a constant.
Ralph Hartley's 1928 paper, ''Transmission of Information'', uses the word ''information'' as a measurable quantity, reflecting the receiver's ability to distinguish one
sequence of symbols from any other, thus quantifying information as , where ''S'' was the number of possible symbols, and ''n'' the number of symbols in a transmission. The unit of information was therefore the
decimal digit
A numerical digit (often shortened to just digit) is a single symbol used alone (such as "2") or in combinations (such as "25"), to represent numbers in a positional numeral system. The name "digit" comes from the fact that the ten digits ( Lati ...
, which since has sometimes been called the
hartley
Hartley may refer to:
Places Australia
*Hartley, New South Wales
* Hartley, South Australia
** Electoral district of Hartley, a state electoral district
Canada
*Hartley Bay, British Columbia
United Kingdom
* Hartley, Cumbria
* Hartley, Pl ...
in his honor as a unit or scale or measure of information.
Alan Turing
Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical ...
in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war
Enigma ciphers.
Much of the mathematics behind information theory with events of different probabilities were developed for the field of
thermodynamics
Thermodynamics is a branch of physics that deals with heat, work, and temperature, and their relation to energy, entropy, and the physical properties of matter and radiation. The behavior of these quantities is governed by the four laws ...
by
Ludwig Boltzmann and
J. Willard Gibbs. Connections between information-theoretic entropy and thermodynamic entropy, including the important contributions by
Rolf Landauer
Rolf William Landauer (February 4, 1927 – April 27, 1999) was a German-American physicist who made important contributions in diverse areas of the thermodynamics of information processing, condensed matter physics, and the conductivity of dis ...
in the 1960s, are explored in ''
Entropy in thermodynamics and information theory''.
In Shannon's revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with the assertion:
:"The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point."
With it came the ideas of
* the information entropy and
redundancy of a source, and its relevance through the
source coding theorem;
* the mutual information, and the channel capacity of a noisy channel, including the promise of perfect loss-free communication given by the noisy-channel coding theorem;
* the practical result of the
Shannon–Hartley law for the channel capacity of a
Gaussian channel; as well as
* the
bit—a new way of seeing the most fundamental unit of information.
Quantities of information
Information theory is based on
probability theory
Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set ...
and statistics, where
quantified information is usually described in terms of bits. Information theory often concerns itself with measures of information of the distributions associated with random variables. One of the most important measures is called entropy, which forms the building block of many other measures. Entropy allows quantification of measure of information in a single random variable. Another useful concept is mutual information defined on two random variables, which describes the measure of information in common between those variables, which can be used to describe their correlation. The former quantity is a property of the probability distribution of a random variable and gives a limit on the rate at which data generated by independent samples with the given distribution can be reliably compressed. The latter is a property of the joint distribution of two random variables, and is the maximum rate of reliable communication across a noisy
channel in the limit of long block lengths, when the channel statistics are determined by the joint distribution.
The choice of logarithmic base in the following formulae determines the
unit of information entropy that is used. A common unit of information is the bit, based on the
binary logarithm. Other units include the
nat
Nat or NAT may refer to:
Computing
* Network address translation (NAT), in computer networking
Organizations
* National Actors Theatre, New York City, U.S.
* National AIDS trust, a British charity
* National Archives of Thailand
* National A ...
, which is based on the
natural logarithm, and the
decimal digit
A numerical digit (often shortened to just digit) is a single symbol used alone (such as "2") or in combinations (such as "25"), to represent numbers in a positional numeral system. The name "digit" comes from the fact that the ten digits ( Lati ...
, which is based on the
common logarithm.
In what follows, an expression of the form is considered by convention to be equal to zero whenever . This is justified because
for any logarithmic base.
Entropy of an information source
Based on the
probability mass function of each source symbol to be communicated, the Shannon
entropy
Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...
, in units of bits (per symbol), is given by
:
where is the probability of occurrence of the -th possible value of the source symbol. This equation gives the entropy in the units of "bits" (per symbol) because it uses a logarithm of base 2, and this base-2 measure of entropy has sometimes been called the
shannon in his honor. Entropy is also commonly computed using the natural logarithm (base , where is Euler's number), which produces a measurement of entropy in nats per symbol and sometimes simplifies the analysis by avoiding the need to include extra constants in the formulas. Other bases are also possible, but less commonly used. For example, a logarithm of base will produce a measurement in
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
s per symbol, and a logarithm of base 10 will produce a measurement in decimal digits (or
hartleys) per symbol.
Intuitively, the entropy of a discrete random variable is a measure of the amount of ''uncertainty'' associated with the value of when only its distribution is known.
The entropy of a source that emits a sequence of symbols that are
independent and identically distributed (iid) is bits (per message of symbols). If the source data symbols are identically distributed but not independent, the entropy of a message of length will be less than .
If one transmits 1000 bits (0s and 1s), and the value of each of these bits is known to the receiver (has a specific value with certainty) ahead of transmission, it is clear that no information is transmitted. If, however, each bit is independently equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been transmitted. Between these two extremes, information can be quantified as follows. If
is the set of all messages that could be, and is the probability of some
, then the entropy, , of is defined:
:
(Here, is the
self-information, which is the entropy contribution of an individual message, and
is the
expected value.) A property of entropy is that it is maximized when all the messages in the message space are equiprobable ; i.e., most unpredictable, in which case .
The special case of information entropy for a random variable with two outcomes is the binary entropy function, usually taken to the logarithmic base 2, thus having the shannon (Sh) as unit:
:
Joint entropy
The of two discrete random variables and is merely the entropy of their pairing: . This implies that if and are
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s
* Independe ...
, then their joint entropy is the sum of their individual entropies.
For example, if represents the position of a chess piece— the row and the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece.
:
Despite similar notation, joint entropy should not be confused with .
Conditional entropy (equivocation)
The or ''conditional uncertainty'' of given random variable (also called the ''equivocation'' of about ) is the average conditional entropy over :
:
Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that:
:
Mutual information (transinformation)
''
Mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...
'' measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of relative to is given by:
:
where (''S''pecific mutual ''I''nformation) is the
pointwise mutual information
In statistics, probability theory and information theory, pointwise mutual information (PMI), or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be i ...
.
A basic property of the mutual information is that
:
That is, knowing ''Y'', we can save an average of bits in encoding ''X'' compared to not knowing ''Y''.
Mutual information is
symmetric:
:
Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the
posterior probability distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterio ...
of ''X'' given the value of ''Y'' and the
prior distribution
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
on ''X'':
:
In other words, this is a measure of how much, on the average, the probability distribution on ''X'' will change if we are given the value of ''Y''. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
:
Mutual information is closely related to the
log-likelihood ratio test in the context of contingency tables and the
multinomial distribution and to
Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.
Kullback–Leibler divergence (information gain)
The ''
Kullback–Leibler divergence'' (or ''information divergence'', ''information gain'', or ''relative entropy'') is a way of comparing two distributions: a "true"
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
, and an arbitrary probability distribution . If we compress data in a manner that assumes is the distribution underlying some data, when, in reality, is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined
:
Although it is sometimes used as a 'distance metric', KL divergence is not a true
metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathe ...
since it is not symmetric and does not satisfy the
triangle inequality (making it a semi-quasimetric).
Another interpretation of the KL divergence is the "unnecessary surprise" introduced by a prior from the truth: suppose a number ''X'' is about to be drawn randomly from a discrete set with probability distribution . If Alice knows the true distribution , while Bob believes (has a
prior) that the distribution is , then Bob will be more
surprised than Alice, on average, upon seeing the value of ''X''. The KL divergence is the (objective) expected value of Bob's (subjective) surprisal minus Alice's surprisal, measured in bits if the ''log'' is in base 2. In this way, the extent to which Bob's prior is "wrong" can be quantified in terms of how "unnecessarily surprised" it is expected to make him.
Directed Information
Directed information
Directed information, I(X^n\to Y^n) , is an information theory measure that quantifies the information flow from the random process X^n = \ to the random process Y^n = \. The term ''directed information'' was coined by James Massey and is defined ...
,
, is an
information theory measure that quantifies the
information
Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...
flow from the random process
to the random process
. The term ''directed information'' was coined by
James Massey
James Lee Massey (February 11, 1934 – June 16, 2013) was an American information theorist and
cryptographer, Professor Emeritus of Digital Technology at ETH Zurich. His notable work includes the application of the Berlekamp–Massey algorithm ...
and is defined as
:
,
where
is the
conditional mutual information
In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.
Definition
For random varia ...
.
Differently from the Mutual informaion, the dircted information is not symmetric. The
measures the information bits that are transmitted caussaly from
to
. The Directed information has many applications in problems where
causality plays an important role such as
capacity of channel with feedback, capacity of discrete
memoryless networks with feedback,
gambling
Gambling (also known as betting or gaming) is the wagering of something of value ("the stakes") on a random event with the intent of winning something else of value, where instances of strategy are discounted. Gambling thus requires three ele ...
with causal side information,
compression with causal side information,
and in
real-time control
Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response. Real-time programs must guarantee response within specified time constra ...
communication settings, statistical physics.
Other quantities
Other important information theoretic quantities include
Rényi entropy In information theory, the Rényi entropy is a quantity that generalizes various notions of entropy, including Hartley entropy, Shannon entropy, collision entropy, and min-entropy. The Rényi entropy is named after Alfréd Rényi, who looked for t ...
(a generalization of entropy),
differential entropy (a generalization of quantities of information to continuous distributions), and the
conditional mutual information
In probability theory, particularly information theory, the conditional mutual information is, in its most basic form, the expected value of the mutual information of two random variables given the value of a third.
Definition
For random varia ...
.
Coding theory
Coding theory is one of the most important and direct applications of information theory. It can be subdivided into source coding theory and channel coding theory. Using a statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source.
* Data compression (source coding): There are two formulations for the compression problem:
**
lossless data compression: the data must be reconstructed exactly;
**
lossy data compression: allocates bits needed to reconstruct the data, within a specified fidelity level measured by a distortion function. This subset of information theory is called ''
rate–distortion theory''.
* Error-correcting codes (channel coding): While data compression removes as much redundancy as possible, an
error-correcting code adds just the right kind of redundancy (i.e., error correction) needed to transmit the data efficiently and faithfully across a noisy channel.
This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts. However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the
broadcast channel) or intermediary "helpers" (the
relay channel In information theory, a relay channel is a probability model of the communication between a sender and a receiver aided by one or more intermediate relay nodes.
General discrete-time memoryless relay channel
A discrete memoryless single-rela ...
), or more general
networks, compression followed by transmission may no longer be optimal.
Source theory
Any process that generates successive messages can be considered a of information. A memoryless source is one in which each message is an
independent identically distributed random variable, whereas the properties of
ergodicity and
stationarity impose less restrictive constraints. All such sources are
stochastic
Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...
. These terms are well studied in their own right outside information theory.
Rate
Information ''
rate'' is the average entropy per symbol. For memoryless sources, this is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is
:
that is, the conditional entropy of a symbol given all the previous symbols generated. For the more general case of a process that is not necessarily stationary, the ''average rate'' is
:
that is, the limit of the joint entropy per symbol. For stationary sources, these two expressions give the same result.
Information rate is defined as
:
It is common in information theory to speak of the "rate" or "entropy" of a language. This is appropriate, for example, when the source of information is English prose. The rate of a source of information is related to its redundancy and how well it can be compressed, the subject of .
Channel capacity
Communications over a channel is the primary motivation of information theory. However, channels often fail to produce exact reconstruction of a signal; noise, periods of silence, and other forms of signal corruption often degrade quality.
Consider the communications process over a discrete channel. A simple model of the process is shown below:
: