information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...

, Shannon's source coding theorem (or noiseless coding theorem) establishes the limits to possible

data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressio ...

, and the operational meaning of the

Shannon entropy Shannon may refer to: People * Shannon (given name) * Shannon (surname) * Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958) * Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum W ...

. Named after

Claude Shannon Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, and cryptographer known as a "father of information theory". As a 21-year-old master's degree student at the Massachusetts I ...

, the source coding theorem shows that (in the limit, as the length of a stream of independent and identically-distributed random variable (i.i.d.) data tends to infinity) it is impossible to compress the data such that the code rate (average number of bits per symbol) is less than the Shannon entropy of the source, without it being virtually certain that information will be lost. However it is possible to get the code rate arbitrarily close to the Shannon entropy, with negligible probability of loss. The source coding theorem for symbol codes places an upper and a lower bound on the minimal possible expected length of codewords as a function of the

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...

of the input word (which is viewed as a

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

) and of the size of the target alphabet.

Statements

''Source coding'' is a mapping from (a sequence of) symbols from an information

source Source may refer to: Research * Historical document * Historical source * Source (intelligence) or sub source, typically a confidential provider of non open-source intelligence * Source (journalism), a person, publication, publishing institute o ...

to a sequence of alphabet symbols (usually bits) such that the source symbols can be exactly recovered from the binary bits (lossless source coding) or recovered within some distortion (lossy source coding). This is the concept behind

Source coding theorem

In information theory, the source coding theorem (Shannon 1948) informally states that (MacKay 2003, pg. 81, Cover 2006, Chapter 5):

i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
random variables each with
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...
can be compressed into more than
bit The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s with negligible risk of information loss, as ; but conversely, if they are compressed into fewer than bits it is virtually certain that information will be lost.

The

NH(X)

coded sequence represents the compressed message in a biunivocal way, under the assumption that the decoder knows the source. From a practical point of view, this hypothesis is not always true. Consequently, when the entropy encoding is applied the transmitted message is

NH(X)+(inf. source)

. Usually, the information that characterizes the source is inserted at the beginning of the transmitted message.

Source coding theorem for symbol codes

Let denote two finite alphabets and let and denote the set of all finite words from those alphabets (respectively). Suppose that is a random variable taking values in and let be a uniquely decodable code from to where . Let denote the random variable given by the length of codeword . If is optimal in the sense that it has the minimal expected word length for , then (Shannon 1948): :

< \frac +1

Where

\mathbb

denotes the

expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...

operator.

Proof: Source coding theorem

Given is an

i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...

source, its

time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Ex ...

is i.i.d. with

in the discrete-valued case and

differential entropy Differential entropy (also referred to as continuous entropy) is a concept in information theory that began as an attempt by Claude Shannon to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuo ...

in the continuous-valued case. The Source coding theorem states that for any , i.e. for any rate larger than the

of the source, there is large enough and an encoder that takes i.i.d. repetition of the source, , and maps it to binary bits such that the source symbols are recoverable from the binary bits with probability of at least . Proof of Achievability. Fix some , and let :

p(x_1, \ldots, x_n) = \Pr \left_1 = x_1, \cdots, X_n = x_n \right

The

typical set In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution. That this set has total probability close to one is a consequence of the asympt ...

, , is defined as follows: :

A_n^\varepsilon =\left\.

The

Asymptotic Equipartition Property In information theory, the asymptotic equipartition property (AEP) is a general property of the output samples of a stochastic source. It is fundamental to the concept of typical set used in theories of data compression. Roughly speaking, the th ...

(AEP) shows that for large enough , the probability that a sequence generated by the source lies in the typical set, , as defined approaches one. In particular, for sufficiently large ,

P((X_1,X_2,\cdots,X_n) \in A_n^\varepsilon)

can be made arbitrarily close to 1, and specifically, greater than

1-\varepsilon

(See AEP for a proof). The definition of typical sets implies that those sequences that lie in the typical set satisfy: :

2^ \leq p \left (x_1, \cdots, x_n \right ) \leq 2^

Note that: *The probability of a sequence

(X_1,X_2,\cdots X_n)

being drawn from is greater than . *

\left,  A_n^\varepsilon \ \leq 2^

, which follows from the left hand side (lower bound) for

p(x_1,x_2,\cdots x_n)

. *

\left,  A_n^\varepsilon \ \geq (1-\varepsilon) 2^

, which follows from upper bound for

p(x_1,x_2,\cdots x_n)

and the lower bound on the total probability of the whole set . Since

\left,  A_n^\varepsilon \ \leq 2^, n(H(X)+\varepsilon)

bits are enough to point to any string in this set. The encoding algorithm: The encoder checks if the input sequence lies within the typical set; if yes, it outputs the index of the input sequence within the typical set; if not, the encoder outputs an arbitrary digit number. As long as the input sequence lies within the typical set (with probability at least ), the encoder doesn't make any error. So, the probability of error of the encoder is bounded above by . Proof of Converse. The converse is proved by showing that any set of size smaller than (in the sense of exponent) would cover a set of probability bounded away from .

Proof: Source coding theorem for symbol codes

For let denote the word length of each possible . Define

q_i = a^/C

, where is chosen so that . Then :

\begin
H(X) &=    -\sum_^n p_i \log_2 p_i \\
     &\leq -\sum_^n p_i \log_2 q_i \\
     &=    -\sum_^n p_i \log_2 a^ + \sum_^n p_i \log_2 C \\
     &=    -\sum_^n p_i \log_2 a^ + \log_2 C \\
     &\leq -\sum_^n - s_i p_i \log_2 a \\
     &=  \mathbb S \log_2 a \\
\end

where the second line follows from Gibbs' inequality and the fifth line follows from Kraft's inequality: :

C = \sum_^n a^ \leq 1

so . For the second inequality we may set :

s_i = \lceil - \log_a p_i \rceil

so that :

- \log_a p_i \leq s_i < -\log_a p_i + 1

and so :

a^ \leq p_i

and :

\sum  a^ \leq \sum p_i = 1

and so by Kraft's inequality there exists a prefix-free code having those word lengths. Thus the minimal satisfies :

\begin
\mathbb S & = \sum p_i s_i \\
& < \sum p_i \left( -\log_a p_i +1 \right) \\
& = \sum - p_i \frac +1 \\
& = \frac +1 \\
\end

Extension to non-stationary independent sources

Fixed Rate lossless source coding for discrete time non-stationary independent sources

Define typical set as: :

A_n^\varepsilon = \left \.

Then, for given , for large enough, . Now we just encode the sequences in the typical set, and usual methods in source coding show that the cardinality of this set is smaller than

2^

. Thus, on an average, bits suffice for encoding with probability greater than , where and can be made arbitrarily small, by making larger.

References

{{Reflist, refs= C.E. Shannon,
A Mathematical Theory of Communication
, ''

Bell System Technical Journal The ''Bell Labs Technical Journal'' is the in-house scientific journal for scientists of Nokia Bell Labs, published yearly by the IEEE society. The managing editor is Charles Bahr. The journal was originally established as the ''Bell System Tech ...

'', vol. 27, pp. 379–423, 623-656, July, October, 1948 David J. C. MacKay.
Information Theory, Inference, and Learning Algorithms
' Cambridge: Cambridge University Press, 2003. {{ISBN, 0-521-64298-1 {{cite book , last = Cover , first = Thomas M. , title = Elements of Information Theory , chapter = Chapter 5: Data Compression , year = 2006 , publisher = John Wiley & Sons , isbn = 0-471-24195-4, url=https://www.google.com/books/edition/Elements_of_Information_Theory/VWq5GG6ycxMC?hl=en, pages=103–142 Information theory Coding theory Data compression Presentation layer protocols Mathematical theorems in theoretical computer science Articles containing proofs