Information distance is the distance between two finite objects (represented as
computer
A computer is a machine that can be programmed to carry out sequences of arithmetic or logical operations ( computation) automatically. Modern digital electronic computers can perform generic sets of operations known as programs. These prog ...
files) expressed as the number of bits in the shortest program which transforms one object into the other one or vice versa on a
universal computer. This is an extension of
Kolmogorov complexity
In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
.
[A.N. Kolmogorov, Three approaches to the quantitative definition of information, Problems Inform. Transmission, 1:1(1965), 1–7](_blank)
/ref> The Kolmogorov complexity of a ''single'' finite object is the information in that object; the information distance between a ''pair'' of finite objects is the minimum information required to go from one object to the other or vice versa.
Information distance was first defined and investigated in [M. Li, P.M.B. Vitanyi, Theory of Thermodynamics of Computation, Proc. IEEE Physics of Computation Workshop, Dallas, Texas, USA, 1992, 42–46](_blank)
/ref> based on thermodynamic
Thermodynamics is a branch of physics that deals with heat, work, and temperature, and their relation to energy, entropy, and the physical properties of matter and radiation. The behavior of these quantities is governed by the four laws of ...
principles, see also. Subsequently, it achieved final form in.[C.H. Bennett, P. Gacs, M. Li, P.M.B. Vitanyi, W. Zurek, Information distance, IEEE Transactions on Information Theory, 44:4(1998), 1407–1423](_blank)
/ref> It is applied in the normalized compression distance Normalized compression distance (NCD) is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few. Suc ...
and the normalized Google distance.
Properties
Formally the information distance between and is defined by
:
with a finite binary program for the fixed universal computer
with as inputs finite binary strings . In it is proven that
with
:
where is the Kolmogorov complexity
In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
defined by of the prefix type. This is the important quantity.
Universality
Let be the class of upper semicomputable distance
Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...
s that satisfy the density
Density (volumetric mass density or specific mass) is the substance's mass per unit of volume. The symbol most often used for density is ''ρ'' (the lower case Greek letter rho), although the Latin letter ''D'' can also be used. Mathematicall ...
condition
:
This excludes irrelevant distances such as for ;
it takes care that if the distance growth then the number of objects within that distance of a given object grows.
If then up to a constant additive term.
The probabilistic expressions of the distance is the first cohomological class in information symmetric cohomology, which may be conceived as a universality property.
Metricity
The distance is a metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathe ...
up to an additive
term in the metric (in)equalities. The probabilistic version of the metric is indeed unique has shown by Han in 1981.
Maximum overlap
If , then there is a program of length that converts to , and a program of length such that the program converts to . (The programs are of the self-delimiting format which means that one can decide where one program ends and the other begins in concatenation
In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalisations of concatenat ...
of the programs.) That is, the shortest programs to convert between two objects can be made maximally overlapping: For it can be divided into a program that converts object to object , and another program which concatenated with the first converts to while the concatenation
In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalisations of concatenat ...
of these two programs is a shortest program to convert between these objects.
Minimum overlap
The programs to convert between objects and can also be made minimal overlapping.
There exists a program of length up to an additive term of that maps to and has small complexity when is known (). Interchanging the two objects we have the other program Having in mind the parallelism between Shannon information theory and Kolmogorov complexity
In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
theory, one can say that this result is parallel to the Slepian-Wolf and Körner–Imre Csiszár–Marton theorems.
Applications
Theoretical
The result of An.A. Muchnik on minimum overlap above is an important theoretical application showing
that certain codes exist: to go to finite target object from any object there is a program which almost only
depends on the target object! This result is fairly precise and the error term cannot be significantly improved. Information distance was material in the textbook, it occurs in the Encyclopedia on Distances.
Practical
To determine the similarity of objects such as genomes, languages, music, internet attacks and worms, software programs, and so on, information distance is normalized and the Kolmogorov complexity
In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...
terms approximated by real-world compressors (the Kolmogorov complexity is a lower bound to the length in bits of a compressed version of the object). The result is the normalized compression distance Normalized compression distance (NCD) is a way of measuring the similarity between two objects, be it two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, two genomes, to name a few. Suc ...
(NCD) between the objects. This pertains to objects given as computer files like the genome of a mouse or text of a book. If the objects are just given by name such as `Einstein' or `table' or the name of a book or the name `mouse', compression does not make sense. We need outside information about what the name means. Using a data base (such as the internet) and a means to search the database (such as a search engine like Google) provides this information. Every search engine on a data base that provides aggregate page counts can be used in the normalized Google distance (NGD).
A python package for computing all information distances and volumes, multivariate mutual information, conditional mutual information, joint entropies, total correlations, in a dataset of n variables is available .
References
Related literature
* {{Citation , title=General Topology I: Basic Concepts and Constructions Dimension Theory , last1=Arkhangel'skii , first1=A. V. , last2=Pontryagin , first2=L. S. , year=1990 , isbn=3-540-18178-4 , publisher=Springer
Springer or springers may refer to:
Publishers
* Springer Science+Business Media, aka Springer International Publishing, a worldwide publishing group founded in 1842 in Germany formerly known as Springer-Verlag.
** Springer Nature, a multinationa ...
, series=Encyclopaedia of Mathematical Sciences
Statistical distance