The chain rule for
Kolmogorov complexity
In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that prod ...
is an analogue of the chain rule for
information entropy
In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random variable X, which takes values in the alphabet \ ...
, which states:
:
That is, the combined
randomness
In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...
of two sequences ''X'' and ''Y'' is the sum of the randomness of ''X'' plus whatever randomness is left in ''Y'' once we know ''X''.
This follows immediately from the definitions of
conditional
Conditional (if then) may refer to:
* Causal conditional, if X then Y, where X is a cause of Y
* Conditional probability, the probability of an event A given that another event B has occurred
*Conditional proof, in logic: a proof that asserts a ...
and
joint entropy
In information theory, joint entropy is a measure of the uncertainty associated with a set of variables.
Definition
The joint Shannon entropy (in bits) of two discrete random variables X and Y with images \mathcal X and \mathcal Y is defined ...
, and the fact from
probability theory
Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...
that the
joint probability
Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...
is the product of the
marginal
Marginal may refer to:
* ''Marginal'' (album), the third album of the Belgian rock band Dead Man Ray, released in 2001
* ''Marginal'' (manga)
* '' El Marginal'', Argentine TV series
* Marginal seat or marginal constituency or marginal, in polit ...
and
conditional probability
In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occu ...
:
:
:
The equivalent statement for Kolmogorov complexity does not hold exactly; it is true only up to a
logarithm
In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number to the base is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 of ...
ic term:
:
(An exact version, ''KP''(''x'', ''y'') = ''KP''(''x'') + ''KP''(''y'', ''x''*) + O(1),
holds for the prefix complexity ''KP'', where ''x*'' is a shortest program for ''x''.)
It states that the shortest program printing ''X'' and ''Y'' is obtained by concatenating a shortest program printing ''X'' with a program printing ''Y'' given ''X'', plus
at most a logarithmic factor. The results implies that
algorithmic mutual information, an analogue of mutual information for Kolmogorov complexity is symmetric: ''I(x:y) = I(y:x) + O(log K(x,y))'' for all ''x,y''.
Proof
The ≤ direction is obvious: we can write a program to produce ''x'' and ''y'' by concatenating a program to produce ''x'', a program to produce ''y'' given
access to ''x'', and (whence the log term) the length of one of the programs, so
that we know where to separate the two programs for ''x'' and ''y'', ''x'' (log(''K''(''x'', ''y'')) upper-bounds this length).
For the ≥ direction, it suffices to show that for all k,l such that k+l = K(x,y) we have that either
''K(x, k,l) ≤ k + O(1)''
or
''K(y, x,k,l) ≤ l + O(1)''.
Consider the list ''(a
1,b
1), (a
2,b
2), ..., (a
e,b
e)'' of all pairs ''(a,b)'' produced by programs of length exactly ''K(x,y)''
ence K(a,b) ≤ K(x,y) Note that this list
* contains the pair ''(x,y)'',
* can be
enumerated given ''k'' and ''l'' (by running all programs of length ''K(x,y)'' in parallel),
* has at most ''2
K(x,y)'' elements (because there are at most 2
n programs of length n).
First, suppose that ''x'' appears less than ''2
l'' times as first element. We can specify ''y'' given ''x,k,l'' by enumerating ''(a
1,b
1), (a
2,b
2), ...'' and then selecting ''(x,y)'' in the sub-list of pairs ''(x,b)''. By assumption, the index of ''(x,y)'' in this sub-list is less than ''2
l'' and hence, there is a program for ''y'' given ''x,k,l'' of length ''l + O(1)''.
Now, suppose that ''x'' appears at least ''2
l'' times as first element. This can happen for at most ''2
K(x,y)-l = 2
k'' different strings. These strings can be enumerated given ''k,l'' and hence ''x'' can be specified by its index in this enumeration. The corresponding program for ''x'' has size ''k + O(1)''. Theorem proved.
References
*
*
* {{cite journal , last1=Zvonkin , first1=A K , last2=Levin , first2=L A , title=The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms , journal=Russian Mathematical Surveys , publisher=IOP Publishing , volume=25 , issue=6 , date=1970-12-31 , issn=0036-0279 , doi=10.1070/rm1970v025n06abeh001269 , pages=83–124, bibcode=1970RuMaS..25...83Z
*
*
Computability theory
Theory of computation
Articles containing proofs