Information theory is the mathematical study of the
quantification,
storage, and
communication
Communication is commonly defined as the transmission of information. Its precise definition is disputed and there are disagreements about whether Intention, unintentional or failed transmissions are included and whether communication not onl ...
of
information
Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...
. The field was established and formalized by
Claude Shannon
Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, computer scientist, cryptographer and inventor known as the "father of information theory" and the man who laid the foundations of th ...
in the 1940s, though early contributions were made in the 1920s through the works of
Harry Nyquist and
Ralph Hartley. It is at the intersection of
electronic engineering
Electronic engineering is a sub-discipline of electrical engineering that emerged in the early 20th century and is distinguished by the additional use of active components such as semiconductor devices to amplify and control electric current flo ...
,
mathematics
Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...
,
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
,
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
,
neurobiology
Neuroscience is the scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions, and its disorders. It is a multidisciplinary science that combines physiology, anatomy, molecular biology, ...
,
physics
Physics is the scientific study of matter, its Elementary particle, fundamental constituents, its motion and behavior through space and time, and the related entities of energy and force. "Physical science is that department of knowledge whi ...
, and
electrical engineering
Electrical engineering is an engineering discipline concerned with the study, design, and application of equipment, devices, and systems that use electricity, electronics, and electromagnetism. It emerged as an identifiable occupation in the l ...
.
A key measure in information theory is
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
. Entropy quantifies the amount of uncertainty involved in the value of a
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
or the outcome of a
random process
In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Stoc ...
. For example, identifying the outcome of a
fair
A fair (archaic: faire or fayre) is a gathering of people for a variety of entertainment or commercial activities. Fairs are typically temporary with scheduled times lasting from an afternoon to several weeks. Fairs showcase a wide range of go ...
coin flip (which has two equally likely outcomes) provides less information (lower entropy, less uncertainty) than identifying the outcome from a roll of a
die (which has six equally likely outcomes). Some other important measures in information theory are
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
,
channel capacity
Channel capacity, in electrical engineering, computer science, and information theory, is the theoretical maximum rate at which information can be reliably transmitted over a communication channel.
Following the terms of the noisy-channel coding ...
,
error exponents, and
relative entropy. Important sub-fields of information theory include
source coding,
algorithmic complexity theory,
algorithmic information theory and
information-theoretic security.
Applications of fundamental topics of information theory include source coding/
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
(e.g. for
ZIP files), and channel coding/
error detection and correction
In information theory and coding theory with applications in computer science and telecommunications, error detection and correction (EDAC) or error control are techniques that enable reliable delivery of digital data over unreliable communi ...
(e.g. for
DSL
Digital subscriber line (DSL; originally digital subscriber loop) is a family of technologies that are used to transmit digital data over telephone lines. In telecommunications marketing, the term DSL is widely understood to mean asymmetric di ...
). Its impact has been crucial to the success of the
Voyager missions to deep space,
the invention of the
compact disc
The compact disc (CD) is a Digital media, digital optical disc data storage format co-developed by Philips and Sony to store and play digital audio recordings. It employs the Compact Disc Digital Audio (CD-DA) standard and was capable of hol ...
, the feasibility of mobile phones and the development of the
Internet
The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
and
artificial intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...
.
The theory has also found applications in other areas, including
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
,
cryptography
Cryptography, or cryptology (from "hidden, secret"; and ''graphein'', "to write", or ''-logy, -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of Adversary (cryptography), ...
,
neurobiology
Neuroscience is the scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions, and its disorders. It is a multidisciplinary science that combines physiology, anatomy, molecular biology, ...
,
perception
Perception () is the organization, identification, and interpretation of sensory information in order to represent and understand the presented information or environment. All perception involves signals that go through the nervous syste ...
,
signal processing
Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, Scalar potential, potential fields, Seismic tomograph ...
,
linguistics
Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
, the evolution and function of molecular codes (
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
),
thermal physics,
molecular dynamics
Molecular dynamics (MD) is a computer simulation method for analyzing the Motion (physics), physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a fixed period of time, giving a view of the dynamics ( ...
,
black hole
A black hole is a massive, compact astronomical object so dense that its gravity prevents anything from escaping, even light. Albert Einstein's theory of general relativity predicts that a sufficiently compact mass will form a black hole. Th ...
s,
quantum computing
A quantum computer is a computer that exploits quantum mechanical phenomena. On small scales, physical matter exhibits properties of wave-particle duality, both particles and waves, and quantum computing takes advantage of this behavior using s ...
,
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
,
intelligence gathering,
plagiarism detection,
pattern recognition
Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
,
anomaly detection, the analysis of
music
Music is the arrangement of sound to create some combination of Musical form, form, harmony, melody, rhythm, or otherwise Musical expression, expressive content. Music is generally agreed to be a cultural universal that is present in all hum ...
,
art creation,
imaging system design, study of
outer space, the dimensionality of
space
Space is a three-dimensional continuum containing positions and directions. In classical physics, physical space is often conceived in three linear dimensions. Modern physicists usually consider it, with time, to be part of a boundless ...
, and
epistemology
Epistemology is the branch of philosophy that examines the nature, origin, and limits of knowledge. Also called "the theory of knowledge", it explores different types of knowledge, such as propositional knowledge about facts, practical knowle ...
.
Overview
Information theory studies the transmission, processing, extraction, and utilization of
information
Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...
. Abstractly, information can be thought of as the resolution of uncertainty. In the case of communication of information over a noisy channel, this abstract concept was formalized in 1948 by
Claude Shannon
Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, computer scientist, cryptographer and inventor known as the "father of information theory" and the man who laid the foundations of th ...
in a paper entitled ''
A Mathematical Theory of Communication
"A Mathematical Theory of Communication" is an article by mathematician Claude E. Shannon published in '' Bell System Technical Journal'' in 1948. It was renamed ''The Mathematical Theory of Communication'' in the 1949 book of the same name, a s ...
'', in which information is thought of as a set of possible messages, and the goal is to send these messages over a noisy channel, and to have the receiver reconstruct the message with low probability of error, in spite of the channel noise. Shannon's main result, the
noisy-channel coding theorem
In information theory, the noisy-channel coding theorem (sometimes Shannon's theorem or Shannon's limit), establishes that for any given degree of noise contamination of a communication channel, it is possible (in theory) to communicate discrete ...
, showed that, in the limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity, a quantity dependent merely on the statistics of the channel over which the messages are sent.
Coding theory is concerned with finding explicit methods, called ''codes'', for increasing the efficiency and reducing the error rate of data communication over noisy channels to near the channel capacity. These codes can be roughly subdivided into data compression (source coding) and
error-correction (channel coding) techniques. In the latter case, it took many years to find the methods Shannon's work proved were possible.
A third class of information theory codes are
cryptographic algorithm
In cryptography, encryption (more specifically, encoding) is the process of transforming information in a way that, ideally, only authorized parties can decode. This process converts the original representation of the information, known as pla ...
s (both
code
In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communicati ...
s and
cipher
In cryptography, a cipher (or cypher) is an algorithm for performing encryption or decryption—a series of well-defined steps that can be followed as a procedure. An alternative, less common term is ''encipherment''. To encipher or encode i ...
s). Concepts, methods and results from coding theory and information theory are widely used in cryptography and
cryptanalysis
Cryptanalysis (from the Greek ''kryptós'', "hidden", and ''analýein'', "to analyze") refers to the process of analyzing information systems in order to understand hidden aspects of the systems. Cryptanalysis is used to breach cryptographic se ...
, such as the
unit ban.
Historical background
The landmark event ''establishing'' the discipline of information theory and bringing it to immediate worldwide attention was the publication of Claude E. Shannon's classic paper "A Mathematical Theory of Communication" in the ''
Bell System Technical Journal'' in July and October 1948. Historian
James Gleick rated the paper as the most important development of 1948, noting that the paper was "even more profound and more fundamental" than the
transistor
A transistor is a semiconductor device used to Electronic amplifier, amplify or electronic switch, switch electrical signals and electric power, power. It is one of the basic building blocks of modern electronics. It is composed of semicondu ...
. He came to be known as the "father of information theory".
Shannon outlined some of his initial ideas of information theory as early as 1939 in a letter to
Vannevar Bush
Vannevar Bush ( ; March 11, 1890 – June 28, 1974) was an American engineer, inventor and science administrator, who during World War II, World War II headed the U.S. Office of Scientific Research and Development (OSRD), through which almo ...
.
Prior to this paper, limited information-theoretic ideas had been developed at
Bell Labs
Nokia Bell Labs, commonly referred to as ''Bell Labs'', is an American industrial research and development company owned by Finnish technology company Nokia. With headquarters located in Murray Hill, New Jersey, Murray Hill, New Jersey, the compa ...
, all implicitly assuming events of equal probability.
Harry Nyquist's 1924 paper, ''Certain Factors Affecting Telegraph Speed'', contains a theoretical section quantifying "intelligence" and the "line speed" at which it can be transmitted by a communication system, giving the relation (recalling the
Boltzmann constant
The Boltzmann constant ( or ) is the proportionality factor that relates the average relative thermal energy of particles in a ideal gas, gas with the thermodynamic temperature of the gas. It occurs in the definitions of the kelvin (K) and the ...
), where ''W'' is the speed of transmission of intelligence, ''m'' is the number of different voltage levels to choose from at each time step, and ''K'' is a constant.
Ralph Hartley's 1928 paper, ''Transmission of Information'', uses the word ''information'' as a measurable quantity, reflecting the receiver's ability to distinguish one
sequence of symbols from any other, thus quantifying information as , where ''S'' was the number of possible symbols, and ''n'' the number of symbols in a transmission. The unit of information was therefore the
decimal digit
A numerical digit (often shortened to just digit) or numeral is a single symbol used alone (such as "1"), or in combinations (such as "15"), to represent numbers in positional notation, such as the common base 10. The name "digit" originate ...
, which since has sometimes been called the
hartley in his honor as a unit or scale or measure of information.
Alan Turing
Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. He was highly influential in the development of theoretical computer ...
in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war
Enigma ciphers.
Much of the mathematics behind information theory with events of different probabilities were developed for the field of
thermodynamics
Thermodynamics is a branch of physics that deals with heat, Work (thermodynamics), work, and temperature, and their relation to energy, entropy, and the physical properties of matter and radiation. The behavior of these quantities is governed b ...
by
Ludwig Boltzmann
Ludwig Eduard Boltzmann ( ; ; 20 February 1844 – 5 September 1906) was an Austrian mathematician and Theoretical physics, theoretical physicist. His greatest achievements were the development of statistical mechanics and the statistical ex ...
and
J. Willard Gibbs. Connections between information-theoretic entropy and thermodynamic entropy, including the important contributions by
Rolf Landauer in the 1960s, are explored in ''
Entropy in thermodynamics and information theory''.
In Shannon's revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with the assertion:
:"''The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point.''"
With it came the ideas of:
* the
information entropy
In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed ...
and
redundancy of a source, and its relevance through the
source coding theorem;
* the
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
, and the channel capacity of a noisy channel, including the promise of perfect loss-free communication given by the noisy-channel coding theorem;
* the practical result of the
Shannon–Hartley law for the channel capacity of a
Gaussian channel; as well as
* the
bit—a new way of seeing the most fundamental unit of information.
Quantities of information
Information theory is based on
probability theory
Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...
and statistics, where
quantified information is usually described in terms of bits. Information theory often concerns itself with measures of information of the distributions associated with random variables. One of the most important measures is called
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
, which forms the building block of many other measures. Entropy allows quantification of measure of information in a single random variable. Another useful concept is mutual information defined on two random variables, which describes the measure of information in common between those variables, which can be used to describe their correlation. The former quantity is a property of the probability distribution of a random variable and gives a limit on the rate at which data generated by independent samples with the given distribution can be reliably compressed. The latter is a property of the joint distribution of two random variables, and is the maximum rate of reliable communication across a noisy
channel in the limit of long block lengths, when the channel statistics are determined by the joint distribution.
The choice of logarithmic base in the following formulae determines the
unit of information entropy that is used. A common unit of information is the bit or
shannon, based on the
binary logarithm
In mathematics, the binary logarithm () is the exponentiation, power to which the number must be exponentiation, raised to obtain the value . That is, for any real number ,
:x=\log_2 n \quad\Longleftrightarrow\quad 2^x=n.
For example, th ...
. Other units include the
nat, which is based on the
natural logarithm
The natural logarithm of a number is its logarithm to the base of a logarithm, base of the e (mathematical constant), mathematical constant , which is an Irrational number, irrational and Transcendental number, transcendental number approxima ...
, and the
decimal digit
A numerical digit (often shortened to just digit) or numeral is a single symbol used alone (such as "1"), or in combinations (such as "15"), to represent numbers in positional notation, such as the common base 10. The name "digit" originate ...
, which is based on the
common logarithm
In mathematics, the common logarithm (aka "standard logarithm") is the logarithm with base 10. It is also known as the decadic logarithm, the decimal logarithm and the Briggsian logarithm. The name "Briggsian logarithm" is in honor of the British ...
.
In what follows, an expression of the form is considered by convention to be equal to zero whenever . This is justified because
for any logarithmic base.
Entropy of an information source
Based on the
probability mass function
In probability and statistics, a probability mass function (sometimes called ''probability function'' or ''frequency function'') is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes i ...
of each source symbol to be communicated, the Shannon
entropy
Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...
, in units of bits (per symbol), is given by
:
where is the probability of occurrence of the -th possible value of the source symbol. This equation gives the entropy in the units of "bits" (per symbol) because it uses a logarithm of base 2, and this base-2 measure of entropy has sometimes been called the
shannon in his honor. Entropy is also commonly computed using the natural logarithm (base , where is Euler's number), which produces a measurement of entropy in nats per symbol and sometimes simplifies the analysis by avoiding the need to include extra constants in the formulas. Other bases are also possible, but less commonly used. For example, a logarithm of base will produce a measurement in
byte
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
s per symbol, and a logarithm of base 10 will produce a measurement in decimal digits (or
hartleys) per symbol.
Intuitively, the entropy of a discrete random variable is a measure of the amount of ''uncertainty'' associated with the value of when only its distribution is known.
The entropy of a source that emits a sequence of symbols that are
independent and identically distributed
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
(iid) is bits (per message of symbols). If the source data symbols are identically distributed but not independent, the entropy of a message of length will be less than .

If one transmits 1000 bits (0s and 1s), and the value of each of these bits is known to the receiver (has a specific value with certainty) ahead of transmission, it is clear that no information is transmitted. If, however, each bit is independently equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been transmitted. Between these two extremes, information can be quantified as follows. If
is the set of all messages that could be, and is the probability of some
, then the entropy, , of is defined:
:
(Here, is the
self-information
In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative w ...
, which is the entropy contribution of an individual message, and
is the
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
.) A property of entropy is that it is maximized when all the messages in the message space are equiprobable ; i.e., most unpredictable, in which case .
The special case of information entropy for a random variable with two outcomes is the binary entropy function, usually taken to the logarithmic base 2, thus having the
shannon (Sh) as unit:
:
Joint entropy
The of two discrete random variables and is merely the entropy of their pairing: . This implies that if and are
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
, then their joint entropy is the sum of their individual entropies.
For example, if represents the position of a chess piece— the row and the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece.
:
Despite similar notation, joint entropy should not be confused with .
Conditional entropy (equivocation)
The or ''conditional uncertainty'' of given random variable (also called the ''equivocation'' of about ) is the average conditional entropy over :
:
Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that:
:
Mutual information (transinformation)
''
Mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
'' measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of relative to is given by:
:
where (''S''pecific mutual Information) is the
pointwise mutual information.
A basic property of the mutual information is that
:
That is, knowing ''Y'', we can save an average of bits in encoding ''X'' compared to not knowing ''Y''.
Mutual information is
symmetric
Symmetry () in everyday life refers to a sense of harmonious and beautiful proportion and balance. In mathematics, the term has a more precise definition and is usually used to refer to an object that is invariant under some transformations ...
:
:
Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the
posterior probability distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...
of ''X'' given the value of ''Y'' and the
prior distribution
A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
on ''X'':
:
In other words, this is a measure of how much, on the average, the probability distribution on ''X'' will change if we are given the value of ''Y''. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
:
Mutual information is closely related to the
log-likelihood ratio test in the context of contingency tables and the
multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided die rolled ''n'' times. For ''n'' statistical independence, indepen ...
and to
Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.
Kullback–Leibler divergence (information gain)
The ''
Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
'' (or ''information divergence'', ''information gain'', or ''relative entropy'') is a way of comparing two distributions: a "true"
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
, and an arbitrary probability distribution . If we compress data in a manner that assumes is the distribution underlying some data, when, in reality, is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined
:
Although it is sometimes used as a 'distance metric', KL divergence is not a true Metric (mathematics), metric since it is not symmetric and does not satisfy the triangle inequality (making it a semi-quasimetric).
Another interpretation of the KL divergence is the "unnecessary surprise" introduced by a prior from the truth: suppose a number ''X'' is about to be drawn randomly from a discrete set with probability distribution . If Alice knows the true distribution , while Bob believes (has a prior probability, prior) that the distribution is , then Bob will be more Information content, surprised than Alice, on average, upon seeing the value of ''X''. The KL divergence is the (objective) expected value of Bob's (subjective) Information content, surprisal minus Alice's surprisal, measured in bits if the ''log'' is in base 2. In this way, the extent to which Bob's prior is "wrong" can be quantified in terms of how "unnecessarily surprised" it is expected to make him.
Directed Information
''Directed information'',
, is an information theory measure that quantifies the
information
Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...
flow from the random process
to the random process
. The term ''directed information'' was coined by James Massey and is defined as
:
,
where
is the conditional mutual information
.
In contrast to ''mutual'' information, ''directed'' information is not symmetric. The
measures the information bits that are transmitted causally from
to
. The Directed information has many applications in problems where causality plays an important role such as channel capacity, capacity of channel with feedback,
capacity of discrete memoryless networks with feedback, sports gambling, gambling with causal side information, Data compression, compression with causal side information,
real-time control communication settings, and in statistical physics.
Other quantities
Other important information theoretic quantities include the Rényi entropy and the Tsallis entropy (generalizations of the concept of entropy), differential entropy (a generalization of quantities of information to continuous distributions), and the conditional mutual information. Also, pragmatic information has been proposed as a measure of how much information has been used in making a decision.
Coding theory

Coding theory is one of the most important and direct applications of information theory. It can be subdivided into source coding theory and channel coding theory. Using a statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source.
* Data compression (source coding): There are two formulations for the compression problem:
** lossless data compression: the data must be reconstructed exactly;
** lossy data compression: allocates bits needed to reconstruct the data, within a specified fidelity level measured by a distortion function. This subset of information theory is called ''rate–distortion theory''.
* Error-correcting codes (channel coding): While data compression removes as much redundancy as possible, an error-correcting code adds just the right kind of redundancy (i.e., error correction) needed to transmit the data efficiently and faithfully across a noisy channel.
This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts. However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the broadcast channel) or intermediary "helpers" (the relay channel), or more general computer network, networks, compression followed by transmission may no longer be optimal.
Source theory
Any process that generates successive messages can be considered a of information. A memoryless source is one in which each message is an Independent identically distributed random variables, independent identically distributed random variable, whereas the properties of ergodic theory, ergodicity and stationary process, stationarity impose less restrictive constraints. All such sources are stochastic process, stochastic. These terms are well studied in their own right outside information theory.
Rate
Information ''Entropy rate, rate'' is the average entropy per symbol. For memoryless sources, this is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is:
:
that is, the conditional entropy of a symbol given all the previous symbols generated. For the more general case of a process that is not necessarily stationary, the ''average rate'' is:
:
that is, the limit of the joint entropy per symbol. For stationary sources, these two expressions give the same result.
The information rate is defined as:
:
It is common in information theory to speak of the "rate" or "entropy" of a language. This is appropriate, for example, when the source of information is English prose. The rate of a source of information is related to its redundancy and how well it can be compressed, the subject of .
Channel capacity
Communications over a channel is the primary motivation of information theory. However, channels often fail to produce exact reconstruction of a signal; noise, periods of silence, and other forms of signal corruption often degrade quality.
Consider the communications process over a discrete channel. A simple model of the process is shown below:
:
Here ''X'' represents the space of messages transmitted, and ''Y'' the space of messages received during a unit time over our channel. Let be the conditional probability distribution function of ''Y'' given ''X''. We will consider to be an inherent fixed property of our communications channel (representing the nature of the ''Signal noise, noise'' of our channel). Then the joint distribution of ''X'' and ''Y'' is completely determined by our channel and by our choice of , the marginal distribution of messages we choose to send over the channel. Under these constraints, we would like to maximize the rate of information, or the ''Signal (electrical engineering), signal'', we can communicate over the channel. The appropriate measure for this is the mutual information, and this maximum mutual information is called the and is given by:
:
This capacity has the following property related to communicating at information rate ''R'' (where ''R'' is usually bits per symbol). For any information rate ''R'' < ''C'' and coding error ''ε'' > 0, for large enough ''N'', there exists a code of length ''N'' and rate ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ''ε''; that is, it is always possible to transmit with arbitrarily small block error. In addition, for any rate ''R'' > ''C'', it is impossible to transmit with arbitrarily small block error.
''Channel code, Channel coding'' is concerned with finding such nearly optimal codes that can be used to transmit data over a noisy channel with a small coding error at a rate near the channel capacity.
Capacity of particular channel models
* A continuous-time analog communications channel subject to Gaussian noise—see Shannon–Hartley theorem.
* A binary symmetric channel (BSC) with crossover probability ''p'' is a binary input, binary output channel that flips the input bit with probability ''p''. The BSC has a capacity of bits per channel use, where is the binary entropy function to the base-2 logarithm:
::
* A binary erasure channel (BEC) with erasure probability ''p'' is a binary input, ternary output channel. The possible channel outputs are 0, 1, and a third symbol 'e' called an erasure. The erasure represents complete loss of information about an input bit. The capacity of the BEC is bits per channel use.
::
Channels with memory and directed information
In practice many channels have memory. Namely, at time
the channel is given by the conditional probability
.
It is often more comfortable to use the notation
and the channel become
.
In such a case the capacity is given by the
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
rate when there is no feedback available and the Directed information rate in the case that either there is feedback or not
[ (if there is no feedback the directed information equals the mutual information).
]
Fungible information
Fungible information is the information
Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...
for which the means of encoding is not important. Classical information theorists and computer scientists are mainly concerned with information of this sort. It is sometimes referred as speakable information.
Applications to other fields
Intelligence uses and secrecy applications
Information theoretic concepts apply to cryptography and cryptanalysis. Turing's information unit, the Ban (unit), ban, was used in the Ultra (cryptography), Ultra project, breaking the German Enigma machine code and hastening the Victory in Europe Day, end of World War II in Europe. Shannon himself defined an important concept now called the unicity distance. Based on the redundancy of the plaintext, it attempts to give a minimum amount of ciphertext necessary to ensure unique decipherability.
Information theory leads us to believe it is much more difficult to keep secrets than it might first appear. A brute force attack can break systems based on public-key cryptography, asymmetric key algorithms or on most commonly used methods of symmetric-key algorithm, symmetric key algorithms (sometimes called secret key algorithms), such as block ciphers. The security of all such methods comes from the assumption that no known attack can break them in a practical amount of time.
Information theoretic security refers to methods such as the one-time pad that are not vulnerable to such brute force attacks. In such cases, the positive conditional mutual information between the plaintext and ciphertext (conditioned on the key (cryptography), key) can ensure proper transmission, while the unconditional mutual information between the plaintext and ciphertext remains zero, resulting in absolutely secure communications. In other words, an eavesdropper would not be able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key. However, as in any other cryptographic system, care must be used to correctly apply even information-theoretically secure methods; the Venona project was able to crack the one-time pads of the Soviet Union due to their improper reuse of key material.
Pseudorandom number generation
Pseudorandom number generators are widely available in computer language libraries and application programs. They are, almost universally, unsuited to cryptographic use as they do not evade the deterministic nature of modern computer equipment and software. A class of improved random number generators is termed cryptographically secure pseudorandom number generators, but even they require random seeds external to the software to work as intended. These can be obtained via Extractor (mathematics), extractors, if done carefully. The measure of sufficient randomness in extractors is min-entropy, a value related to Shannon entropy through Rényi entropy; Rényi entropy is also used in evaluating randomness in cryptographic systems. Although related, the distinctions among these measures mean that a random variable with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses.
Seismic exploration
One early commercial application of information theory was in the field of seismic oil exploration. Work in this field made it possible to strip off and separate the unwanted noise from the desired seismic signal. Information theory and digital signal processing offer a major improvement of resolution and image clarity over previous analog methods.
Semiotics
Semiotics, Semioticians and Winfried Nöth both considered Charles Sanders Peirce as having created a theory of information in his works on semiotics. Nauta defined semiotic information theory as the study of "''the internal processes of coding, filtering, and information processing.''"
Concepts from information theory such as redundancy and code control have been used by semioticians such as Umberto Eco and to explain ideology as a form of message transmission whereby a dominant social class emits its message by using signs that exhibit a high degree of redundancy such that only one message is decoded among a selection of competing ones.
Integrated process organization of neural information
Quantitative information theoretic methods have been applied in cognitive science to analyze the integrated process organization of neural information in the context of the binding problem in cognitive neuroscience. In this context, either an information-theoretical measure, such as (Gerald Edelman and Giulio Tononi's functional clustering model and dynamic core hypothesis (DCH)) or (Tononi's integrated information theory (IIT) of consciousness), is defined (on the basis of a reentrant process organization, i.e. the synchronization of neurophysiological activity between groups of neuronal populations), or the measure of the minimization of free energy on the basis of statistical methods (Karl J. Friston's free energy principle (FEP), an information-theoretical measure which states that every adaptive change in a self-organized system leads to a minimization of free energy, and the Bayesian brain hypothesis).
Miscellaneous applications
Information theory also has applications in the search for extraterrestrial intelligence, black hole information paradox, black holes, bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
, and Gambling and information theory, gambling.
See also
* Algorithmic probability
* Bayesian inference
* Communication theory
* Constructor theory – a generalization of information theory that includes quantum information
* Formal science
* Inductive probability
* Info-metrics
* Minimum message length
* Minimum description length
* Philosophy of information
Applications
* Active networking
* Cryptanalysis
* Cryptography
* Cybernetics
* Entropy in thermodynamics and information theory
* Gambling
* Intelligence (information gathering)
* reflection seismology, Seismic exploration
History
* Ralph Hartley, Hartley, R.V.L.
* History of information theory
* Claude Elwood Shannon, Shannon, C.E.
* Timeline of information theory
* Hubert Yockey, Yockey, H.P.
* Andrey Kolmogorov
Theory
* Coding theory
* Detection theory
* Estimation theory
* Fisher information
* Information algebra
* Information asymmetry
* Information field theory
* Information geometry
* Information theory and measure theory
* Kolmogorov complexity
* List of unsolved problems in information theory
* Logic of information
* Network coding
* Philosophy of information
* Quantum information science
* Source coding
Concepts
* Ban (unit)
* Channel capacity
* Communication channel
* Communication source
* Conditional entropy
* Covert channel
* Data compression
* Decoder
* Differential entropy
* Fungible information
* Information fluctuation complexity
* Information entropy
* Joint entropy
* Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
* Mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...
* Pointwise mutual information (PMI)
* Receiver (information theory)
* Redundancy (information theory), Redundancy
* Rényi entropy
* Self-information
* Unicity distance
* Variety (cybernetics), Variety
* Hamming distance
* Perplexity
References
Further reading
The classic work
* Claude Elwood Shannon, Shannon, C.E. (1948), "A Mathematical Theory of Communication
"A Mathematical Theory of Communication" is an article by mathematician Claude E. Shannon published in '' Bell System Technical Journal'' in 1948. It was renamed ''The Mathematical Theory of Communication'' in the 1949 book of the same name, a s ...
", ''Bell System Technical Journal'', 27, pp. 379–423 & 623–656, July & October, 1948
PDF.
* R.V.L. Hartley
"Transmission of Information"
''Bell System Technical Journal'', July 1928
* Andrey Kolmogorov (1968),
Three approaches to the quantitative definition of information
in ''International Journal of Computer Mathematics'', 2, pp. 157–168.
Other journal articles
* J. L. Kelly Jr.
Princeton
"A New Interpretation of Information Rate" ''Bell System Technical Journal'', Vol. 35, July 1956, pp. 917–26.
* R. Landauer
IEEE.org
"Information is Physical" ''Proc. Workshop on Physics and Computation PhysComp'92'' (IEEE Comp. Sci.Press, Los Alamitos, 1993) pp. 1–4.
*
*
Textbooks on information theory
* Alajaji, F. and Chen, P.N. An Introduction to Single-User Information Theory. Singapore: Springer, 2018.
* Arndt, C. ''Information Measures, Information and its Description in Science and Engineering'' (Springer Series: Signals and Communication Technology), 2004,
*
* Gallager, R. ''Information Theory and Reliable Communication.'' New York: John Wiley and Sons, 1968.
* Goldman, S. ''Information Theory''. New York: Prentice Hall, 1953. New York: Dover 1968 , 2005
*
* Csiszar, I, Korner, J. ''Information Theory: Coding Theorems for Discrete Memoryless Systems'' Akademiai Kiado: 2nd edition, 1997.
* David J. C. MacKay, MacKay, David J. C.
Information Theory, Inference, and Learning Algorithms
' Cambridge: Cambridge University Press, 2003.
* Mansuripur, M. ''Introduction to Information Theory''. New York: Prentice Hall, 1987.
* Robert McEliece, McEliece, R. ''The Theory of Information and Coding''. Cambridge, 2002.
* John R. Pierce, Pierce, JR. "An introduction to information theory: symbols, signals and noise". Dover (2nd Edition). 1961 (reprinted by Dover 1980).
*
*
* Stone, JV. Chapter 1 of boo
"Information Theory: A Tutorial Introduction"
University of Sheffield, England, 2014. .
* Yeung, RW.
A First Course in Information Theory
' Kluwer Academic/Plenum Publishers, 2002. .
* Yeung, RW.
Information Theory and Network Coding
' Springer 2008, 2002.
Other books
* Leon Brillouin, ''Science and Information Theory'', Mineola, N.Y.: Dover, [1956, 1962] 2004.
*
* A. I. Khinchin, ''Mathematical Foundations of Information Theory'', New York: Dover, 1957.
* H. S. Leff and A. F. Rex, Editors, ''Maxwell's Demon: Entropy, Information, Computing'', Princeton University Press, Princeton, New Jersey (1990).
* Robert K. Logan. ''What is Information? - Propagating Organization in the Biosphere, the Symbolosphere, the Technosphere and the Econosphere'', Toronto: DEMO Publishing.
* Tom Siegfried, ''The Bit and the Pendulum'', Wiley, 2000.
* Charles Seife, ''Decoding the Universe'', Viking, 2006.
* Jeremy Campbell, ''Grammatical Man'', Touchstone/Simon & Schuster, 1982,
* Henri Theil, ''Economics and Information Theory'', Rand McNally & Company - Chicago, 1967.
* Escolano, Suau, Bonev,
Information Theory in Computer Vision and Pattern Recognition
', Springer, 2009.
* Vlatko Vedral, ''Decoding Reality: The Universe as Quantum Information'', Oxford University Press 2010.
External links
*
* Lambert F. L. (1999),
, ''Journal of Chemical Education''
IEEE Information Theory Society
an
ITSOC Monographs, Surveys, and Reviews
{{DEFAULTSORT:Information Theory
Information theory,
Claude Shannon
Computer-related introductions in 1948
Cybernetics
Formal sciences
History of logic
History of mathematics
Information Age
Data compression