Solomonoff's theory of inductive inference proves that, under its common sense assumptions (axioms), the best possible scientific model is the shortest algorithm that generates the empirical data under consideration. In addition to the choice of data, other assumptions are that, to avoid the post-hoc fallacy, the programming language must be chosen prior to the data and that the environment being observed is generated by an unknown algorithm. This is also called a theory of induction. Due to its basis in the dynamical ( state-space model) character of Algorithmic Information Theory, it encompasses ''statistical'' as well as dynamical information criteria for model selection. It was introduced by Ray Solomonoff, based on

probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

and

theoretical computer science Theoretical computer science is a subfield of computer science and mathematics that focuses on the Abstraction, abstract and mathematical foundations of computation. It is difficult to circumscribe the theoretical areas precisely. The Associati ...

. In essence, Solomonoff's induction derives the

posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posteri ...

of any

computable Computability is the ability to solve a problem by an effective procedure. It is a key topic of the field of computability theory within mathematical logic and the theory of computation within computer science. The computability of a problem is cl ...

theory, given a sequence of observed data. This posterior probability is derived from Bayes' rule and some ''universal'' prior, that is, a prior that assigns a positive probability to any computable theory. Solomonoff proved that this induction is incomputable (or more precisely, lower semi-computable), but noted that "this incomputability is of a very benign kind", and that it "in no way inhibits its use for practical prediction" (as it can be approximated from below more accurately with more computational resources). It is only "incomputable" in the benign sense that no scientific consensus is able to prove that the best current

scientific theory A scientific theory is an explanation of an aspect of the universe, natural world that can be or that has been reproducibility, repeatedly tested and has corroborating evidence in accordance with the scientific method, using accepted protocol (s ...

is the best of all possible theories. However, Solomonoff's theory does provide an objective criterion for deciding among the current scientific theories explaining a given set of observations. Solomonoff's induction naturally formalizes

Occam's razor In philosophy, Occam's razor (also spelled Ockham's razor or Ocham's razor; ) is the problem-solving principle that recommends searching for explanations constructed with the smallest possible set of elements. It is also known as the principle o ...

JJ McCall. Induction: From Kolmogorov and Solomonoff to De Finetti and Back to Kolmogorov – Metroeconomica, 2004 – Wiley Online Library.D Stork. Foundations of Occam's razor and parsimony in learning from ricoh.com – NIPS 2001 Workshop, 2001A.N. Soklakov. Occam's razor as a formal basis for a physical theor
from arxiv.org
– Foundations of Physics Letters, 2002 – SpringerM Hutter. On the existence and convergence of computable universal prior
arxiv.org
– Algorithmic Learning Theory, 2003 – Springer by assigning larger prior credences to theories that require a shorter algorithmic description.

Origin

Philosophical

The theory is based in philosophical foundations, and was founded by Ray Solomonoff around 1960. It is a mathematically formalized combination of

and the

Principle of Multiple Explanations Epicurus (, ; ; 341–270 BC) was an Greek philosophy, ancient Greek philosopher who founded Epicureanism, a highly influential school of philosophy that asserted that philosophy's purpose is to attain as well as to help others attain tranqui ...

.Ming Li and Paul Vitanyi, ''An Introduction to Kolmogorov Complexity and Its Applications.'' Springer-Verlag, N.Y., 2008p 339 ff. All

theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories. Marcus Hutter's universal artificial intelligence builds upon this to calculate the

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

of an action.

Principle

Solomonoff's induction has been argued to be the computational formalization of pure Bayesianism. To understand, recall that Bayesianism derives the posterior probability

\mathbb P D /math> of a theory T given data D by applying Bayes rule, which yields

: \mathbb P D = \frac where theories A are alternatives to theory T . For this equation to make sense, the quantities \mathbb P T /math> and \mathbb P A /math> must be well-defined for all theories T and A . In other words, any theory must define a probability distribution over observable data D . Solomonoff's induction essentially boils down to demanding that all such probability distributions be

. Interestingly, the set of computable probability distributions is a subset of the set of all programs, which is

countable In mathematics, a Set (mathematics), set is countable if either it is finite set, finite or it can be made in one to one correspondence with the set of natural numbers. Equivalently, a set is ''countable'' if there exists an injective function fro ...

. Similarly, the sets of observable data considered by Solomonoff were finite.

Without loss of generality ''Without loss of generality'' (often abbreviated to WOLOG, WLOG or w.l.o.g.; less commonly stated as ''without any loss of generality'' or ''with no loss of generality'') is a frequently used expression in mathematics. The term is used to indicat ...

, we can thus consider that any observable data is a finite

bit string A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level pa ...

. As a result, Solomonoff's induction can be defined by only invoking discrete probability distributions. Solomonoff's induction then allows to make probabilistic predictions of future data

F

, by simply obeying the laws of probability. Namely, we have

\mathbb P D = \mathbb E_T T,D ">.html" ;"title="mathbb P[F">T,D= \sum_T \mathbb P[F">T,D\mathbb P D /math>. This quantity can be interpreted as the average predictions \mathbb P[F">T,D/math> of all theories T given past data D, weighted by their posterior credences \mathbb P D /math>.

Mathematical

The proof of the "razor" is based on the known mathematical properties of a probability distribution over a countable set. These properties are relevant because the infinite set of all programs is a denumerable set. The sum S of the probabilities of all programs must be exactly equal to one (as per the definition of

probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...

) thus the probabilities must roughly decrease as we enumerate the infinite set of all programs, otherwise S will be strictly greater than one. To be more precise, for every

\epsilon

> 0, there is some length ''l'' such that the probability of all programs longer than ''l'' is at most

\epsilon

. This does not, however, preclude very long programs from having very high probability. Fundamental ingredients of the theory are the concepts of algorithmic probability and

Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that prod ...

. The universal

prior probability A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

of any prefix ''p'' of a computable sequence ''x'' is the sum of the probabilities of all programs (for a

universal computer A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algor ...

) that compute something starting with ''p''. Given some ''p'' and any computable but unknown probability distribution from which ''x'' is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of ''x'' in optimal fashion.

Mathematical guarantees

Solomonoff's completeness

The remarkable property of Solomonoff's induction is its completeness. In essence, the completeness theorem guarantees that the expected cumulative errors made by the predictions based on Solomonoff's induction are upper-bounded by the

of the (stochastic) data generating process. The errors can be measured using the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...

or the square of the difference between the induction's prediction and the probability assigned by the (stochastic) data generating process.

Solomonoff's uncomputability

Unfortunately, Solomonoff also proved that Solomonoff's induction is uncomputable. In fact, he showed that

computability Computability is the ability to solve a problem by an effective procedure. It is a key topic of the field of computability theory within mathematical logic and the theory of computation within computer science. The computability of a problem is c ...

and completeness are mutually exclusive: any complete theory must be uncomputable. The proof of this is derived from a game between the induction and the environment. Essentially, any computable induction can be tricked by a computable environment, by choosing the computable environment that negates the computable induction's prediction. This fact can be regarded as an instance of the '' no free lunch theorem''.

Modern applications

Artificial intelligence

Though Solomonoff's inductive inference is not

, several AIXI-derived algorithms approximate it in order to make it run on a modern computer. The more computing power they are given, the closer their predictions are to the predictions of inductive inference (their mathematical limit is Solomonoff's inductive inference). Another direction of inductive inference is based on E. Mark Gold's model of learning in the limit from 1967 and has developed since then more and more models of learning. The general scenario is the following: Given a class ''S'' of computable functions, is there a learner (that is, recursive functional) which for any input of the form (''f''(0),''f''(1),...,''f''(''n'')) outputs a hypothesis (an index ''e'' with respect to a previously agreed on acceptable numbering of all computable functions; the indexed function may be required consistent with the given values of ''f''). A learner ''M'' learns a function ''f'' if almost all its hypotheses are the same index ''e'', which generates the function ''f''; ''M'' learns ''S'' if ''M'' learns every ''f'' in ''S''. Basic results are that all recursively enumerable classes of functions are learnable while the class REC of all computable functions is not learnable. Many related models have been considered and also the learning of classes of recursively enumerable sets from positive data is a topic studied from Gold's pioneering paper in 1967 onwards. A far reaching extension of the Gold’s approach is developed by Schmidhuber's theory of generalized Kolmogorov complexities, which are kinds of

super-recursive algorithm In computability theory, super-recursive algorithms are posited as a generalization of hypercomputation: hypothetical algorithms that are more powerful, that is, compute more than Turing machines. The term was introduced by Mark Burgin, whose bo ...

References

Sources

* * Burgin, M. (2005), ''Super-recursive Algorithms'', Monographs in computer science, Springer. * Burgin, M., "How We Know What Technology Can Do", ''Communications of the ACM'', v. 44, No. 11, 2001, pp. 82–88. * Burgin, M.; Eberbach, E., "Universality for Turing Machines, Inductive Turing Machines and Evolutionary Algorithms", ''Fundamenta Informaticae'', v. 91, No. 1, 2009, 53–77. * Burgin, M.; Eberbach, E., "On Foundations of Evolutionary Computation: An Evolutionary Automata Approach", in ''Handbook of Research on Artificial Immune Systems and Natural Computing: Applying Complex Adaptive Technologies'' (Hongwei Mo, Ed.), IGI Global, Hershey, Pennsylvania, 2009, 342–360. * Burgin, M.; Eberbach, E., "Evolutionary Automata: Expressiveness and Convergence of Evolutionary Computation", ''Computer Journal'', v. 55, No. 9, 2012, pp. 1023–1029. * Burgin, M.; Klinger, A. Experience, Generations, and Limits in Machine Learning, ''Theoretical Computer Science'', v. 317, No. 1/3, 2004, pp. 71–91 * Davis, Martin (2006) "The Church–Turing Thesis: Consensus and opposition]". Proceedings, Computability in Europe 2006. Lecture Notes in Computer Science, 3988 pp. 125–132. * William Gasarch, Gasarch, W.; Smith, C. H. (1997) "A survey of inductive inference with an emphasis on queries". ''Complexity, logic, and recursion theory'', Lecture Notes in Pure and Appl. Math., 187, Dekker, New York, pp. 225–260. * Hay, Nick.
Universal Semimeasures: An Introduction
" CDMTCS Research Report Series, University of Auckland, Feb. 2007. * Jain, Sanjay; Osherson, Daniel; Royer, James; Sharma, Arun, ''Systems that Learn: An Introduction to Learning Theory'' (second edition),

MIT Press The MIT Press is the university press of the Massachusetts Institute of Technology (MIT), a private research university in Cambridge, Massachusetts. The MIT Press publishes a number of academic journals and has been a pioneer in the Open Ac ...

, 1999. * . * Li Ming; Vitanyi, Paul, ''An Introduction to Kolmogorov Complexity and Its Applications'', 2nd Edition, Springer Verlag, 1997. * Osherson, Daniel; Stob, Michael; Weinstein, Scott, ''Systems That Learn, An Introduction to Learning Theory for Cognitive and Computer Scientists'',

, 1986. * * * {{cite journal , doi=10.1016/S0019-9958(64)90131-7 , last=Solomonoff , first= Ray , title=A Formal Theory of Inductive Inference Part II , journal = Information and Control , url=http://world.std.com/~rjs/1964pt2.pdf , volume=7 , issue= 2 , pages= 224–254 , date=June 1964, doi-access=free

External links

Algorithmic probability – Scholarpedia
Algorithmic information theory Bayesian statistics Epistemology Inductive reasoning Machine learning Statistical inference