In
computational learning theory
In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms.
Overview
Theoretical results in machine learning m ...
, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to
probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.
Occam learnability implies PAC learning, and for a wide variety of
concept classes, the converse is also true: PAC learnability implies Occam learnability.
Introduction
Occam Learning is named after
Occam's razor
In philosophy, Occam's razor (also spelled Ockham's razor or Ocham's razor; ) is the problem-solving principle that recommends searching for explanations constructed with the smallest possible set of elements. It is also known as the principle o ...
, which is a principle stating that, given all other things being equal, a shorter explanation for observed data should be favored over a lengthier explanation. The theory of Occam learning is a formal and mathematical justification for this principle. It was first shown by Blumer, et al.
that Occam learning implies PAC learning, which is the standard model of learning in computational learning theory. In other words, ''parsimony'' (of the output hypothesis) implies ''predictive power''.
Definition of Occam learning
The succinctness of a concept
in
concept class can be expressed by the length
of the shortest bit string that can represent
in
. Occam learning connects the succinctness of a learning algorithm's output to its predictive power on unseen data.
Let
and
be concept classes containing target concepts and hypotheses respectively. Then, for constants
and
, a learning algorithm
is an
-Occam algorithm for
using
iff, given a set
of
samples labeled according to a concept
,
outputs a hypothesis
such that
*
is consistent with
on
(that is,
), and
*
[Kearns, M. J., & Vazirani, U. V. (1994)]
An introduction to computational learning theory
chapter 2. MIT press.[Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). ]
Occam's razor
'. Information processing letters, 24(6), 377-380.
where
is the maximum length of any sample
. An Occam algorithm is called ''efficient'' if it runs in time polynomial in
,
, and
We say a concept class
is ''Occam learnable'' with respect to a hypothesis class
if there exists an efficient Occam algorithm for
using
The relation between Occam and PAC learning
Occam learnability implies PAC learnability, as the following theorem of Blumer, et al.
shows:
Theorem (''Occam learning implies PAC learning'')
Let be an efficient -Occam algorithm for using . Then there exists a constant such that for any , for any distribution , given samples drawn from and labelled according to a concept of length bits each, the algorithm will output a hypothesis such that with probability at least .
Here,
is with respect to the concept
and distribution
. This implies that the algorithm
is also a PAC learner for the concept class
using hypothesis class
. A slightly more general formulation is as follows:
Theorem (''Occam learning implies PAC learning, cardinality version'')
Let . Let be an algorithm such that, given samples drawn from a fixed but unknown distribution and labeled according to a concept of length bits each, outputs a hypothesis that is consistent with the labeled samples. Then, there exists a constant such that if , then is guaranteed to output a hypothesis such that with probability at least .
While the above theorems show that Occam learning is sufficient for PAC learning, it doesn't say anything about ''necessity.'' Board and Pitt show that, for a wide variety of concept classes, Occam learning is in fact necessary for PAC learning. They proved that for any concept class that is ''polynomially closed under exception lists,'' PAC learnability implies the existence of an Occam algorithm for that concept class. Concept classes that are polynomially closed under exception lists include Boolean formulas, circuits,
deterministic finite automata
In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state auto ...
, decision-lists, decision-trees, and other geometrically-defined concept classes.
A concept class
is polynomially closed under exception lists if there exists a polynomial-time algorithm
such that, when given the representation of a concept
and a finite list
of ''exceptions'', outputs a representation of a concept
such that the concepts
and
agree except on the set
.
Proof that Occam learning implies PAC learning
We first prove the Cardinality version. Call a hypothesis
''bad'' if
, where again
is with respect to the true concept
and the underlying distribution
. The probability that a set of samples
is consistent with
is at most
, by the independence of the samples. By the union bound, the probability that there exists a bad hypothesis in
is at most
, which is less than
if
. This concludes the proof of the second theorem above.
Using the second theorem, we can prove the first theorem. Since we have a
-Occam algorithm, this means that any hypothesis output by
can be represented by at most
bits, and thus
. This is less than
if we set
for some constant
. Thus, by the Cardinality version Theorem,
will output a consistent hypothesis
with probability at least
. This concludes the proof of the first theorem above.
Improving sample complexity for common problems
Though Occam and PAC learnability are equivalent, the Occam framework can be used to produce tighter bounds on the sample complexity of classical problems including conjunctions,
conjunctions with few relevant variables, and decision lists.
Extensions
Occam algorithms have also been shown to be successful for PAC learning in the presence of errors, probabilistic concepts, function learning and Markovian non-independent examples.
[Aldous, D., & Vazirani, U. (1990, October). ]
A Markovian extension of Valiant's learning model
'. In Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on (pp. 392-396). IEEE.
See also
*
Structural risk minimization
Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becomin ...
*
Computational learning theory
In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms.
Overview
Theoretical results in machine learning m ...
References
{{reflist, 30em
Further reading
* Blumer, A.; Ehrenfeucht, A.; Haussler, D.; Warmuth, M. K
Learnability and the Vapnik-Chervonenkis dimension Journal of the ACM, 36(4):929–865, 1989.
Theoretical computer science
Computational learning theory