HOME

TheInfoList



OR:

The sample complexity of a
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
algorithm represents the number of training-samples that it needs in order to successfully learn a target function. More precisely, the sample complexity is the number of training-samples that we need to supply to the algorithm, so that the function returned by the algorithm is within an arbitrarily small error of the best possible function, with probability arbitrarily close to 1. There are two variants of sample complexity: * The weak variant fixes a particular input-output distribution; * The strong variant takes the worst-case sample complexity over all input-output distributions. The
No free lunch theorem In mathematical folklore, the "no free lunch" (NFL) theorem (sometimes pluralized) of David Wolpert and William Macready appears in the 1997 "No Free Lunch Theorems for Optimization".Wolpert, D.H., Macready, W.G. (1997),No Free Lunch Theorems for ...
, discussed below, proves that, in general, the strong sample complexity is infinite, i.e. that there is no algorithm that can learn the globally-optimal target function using a finite number of training samples. However, if we are only interested in a particular class of target functions (e.g, only linear functions) then the sample complexity is finite, and it depends linearly on the
VC dimension VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...
on the class of target functions.


Definition

Let X be a space which we call the input space, and Y be a space which we call the output space, and let Z denote the product X\times Y. For example, in the setting of binary classification, X is typically a finite-dimensional vector space and Y is the set \. Fix a hypothesis space \mathcal H of functions h\colon X\to Y. A learning algorithm over \mathcal H is a computable map from Z^* to \mathcal H. In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from X to Y. Typical learning algorithms include empirical risk minimization, without or with Tikhonov regularization. Fix a loss function \mathcal\colon Y\times Y\to\R_, for example, the square loss \mathcal(y, y') = (y - y')^2, where h(x) = y'. For a given distribution \rho on X\times Y, the expected risk of a hypothesis (a function) h\in\mathcal H is :\mathcal E(h) :=\mathbb E_\rho mathcal(h(x),y)\int_ \mathcal(h(x),y)\,d\rho(x,y) In our setting, we have h=\mathcal(S_n), where \mathcal is a learning algorithm and S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n is a sequence of vectors which are all drawn independently from \rho. Define the optimal risk \mathcal E^*_\mathcal = \underset\mathcal E(h). Set h_n=\mathcal(S_n), for each n. Note that h_n is a
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
and depends on the random variable S_n, which is drawn from the distribution \rho^n. The algorithm \mathcal is called consistent if \mathcal E(h_n) probabilistically converges to \mathcal E_\mathcal H^*. In other words, for all \epsilon, \delta > 0, there exists a positive integer N, such that, for all n \geq N, we have \Pr_ mathcal E(h_n) - \mathcal E^*_\mathcal\geq\varepsilon\delta. The sample complexity of \mathcal is then the minimum N for which this holds, as a function of \rho, \epsilon, and \delta. We write the sample complexity as N(\rho, \epsilon, \delta) to emphasize that this value of N depends on \rho, \epsilon, and \delta. If \mathcal is not consistent, then we set N(\rho,\epsilon,\delta)=\infty. If there exists an algorithm for which N(\rho,\epsilon,\delta) is finite, then we say that the hypothesis space \mathcal H is learnable. In others words, the sample complexity N(\rho,\epsilon,\delta) defines the rate of consistency of the algorithm: given a desired accuracy \epsilon and confidence \delta, one needs to sample N(\rho,\epsilon,\delta) data points to guarantee that the risk of the output function is within \epsilon of the best possible, with probability at least 1 - \delta . In probably approximately correct (PAC) learning, one is concerned with whether the sample complexity is ''polynomial'', that is, whether N(\rho,\epsilon,\delta) is bounded by a polynomial in 1/\epsilon and 1/\delta. If N(\rho,\epsilon,\delta) is polynomial for some learning algorithm, then one says that the hypothesis space \mathcal H is PAC-learnable. Note that this is a stronger notion than being learnable.


Unrestricted hypothesis space: infinite sample complexity

One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error. More formally, one asks whether there exists a learning algorithm \mathcal, such that, for all \epsilon, \delta > 0, there exists a positive integer N such that for all n \geq N, we have \sup_\rho\left(\Pr_ mathcal E(h_n) - \mathcal E^*_\mathcal\geq\varepsilonright)<\delta, where h_n=\mathcal(S_n), with S_n = ((x_1,y_1),\ldots,(x_n,y_n))\sim \rho^n as above. The
No Free Lunch Theorem In mathematical folklore, the "no free lunch" (NFL) theorem (sometimes pluralized) of David Wolpert and William Macready appears in the 1997 "No Free Lunch Theorems for Optimization".Wolpert, D.H., Macready, W.G. (1997),No Free Lunch Theorems for ...
says that without restrictions on the hypothesis space \mathcal H, this is not the case, i.e., there always exist "bad" distributions for which the sample complexity is arbitrarily large. Thus, in order to make statements about the rate of convergence of the quantity \sup_\rho\left(\Pr_ mathcal E(h_n) - \mathcal E^*_\mathcal\geq\varepsilonright), one must either *constrain the space of probability distributions \rho, e.g. via a parametric approach, or *constrain the space of hypotheses \mathcal H, as in distribution-free approaches.


Restricted hypothesis space: finite sample-complexity

The latter approach leads to concepts such as
VC dimension VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...
and
Rademacher complexity In computational learning theory ( machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of real-valued functions with respect to a probability distribution. Definitions Ra ...
which control the complexity of the space \mathcal H. A smaller hypothesis space introduces more bias into the inference process, meaning that \mathcal E^*_\mathcal may be greater than the best possible risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions. This trade-off leads to the concept of
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...
. It is a theorem from VC theory that the following three statements are equivalent for a hypothesis space \mathcal H: # \mathcal H is PAC-learnable. # The VC dimension of \mathcal H is finite. # \mathcal H is a uniform Glivenko-Cantelli class. This gives a way to prove that certain hypothesis spaces are PAC learnable, and by extension, learnable.


An example of a PAC-learnable hypothesis space

X = \R^d, Y = \, and let \mathcal H be the space of affine functions on X, that is, functions of the form x\mapsto \langle w,x\rangle+b for some w\in\R^d,b\in\R. This is the linear classification with offset learning problem. Now, note that four coplanar points in a square cannot be shattered by any affine function, since no affine function can be positive on two diagonally opposite vertices and negative on the remaining two. Thus, the VC dimension of \mathcal H is d + 1, so it is finite. It follows by the above characterization of PAC-learnable classes that \mathcal H is PAC-learnable, and by extension, learnable.


Sample-complexity bounds

Suppose \mathcal H is a class of binary functions (functions to \). Then, \mathcal H is (\epsilon,\delta)-PAC-learnable with a sample of size: N = O\bigg(\frac\bigg) where VC(\mathcal H) is the
VC dimension VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...
of \mathcal H. Moreover, any (\epsilon,\delta)-PAC-learning algorithm for \mathcal H must have sample-complexity: N = \Omega\bigg(\frac\bigg) Thus, the sample-complexity is a linear function of the
VC dimension VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...
of the hypothesis space. Suppose \mathcal H is a class of real-valued functions with range in ,T/math>. Then, \mathcal H is (\epsilon,\delta)-PAC-learnable with a sample of size: N = O\bigg(T^2\frac\bigg) where PD(\mathcal H) is Pollard's pseudo-dimension of \mathcal H.


Other Settings

In addition to the supervised learning setting, sample complexity is relevant to semi-supervised learning problems including
active learning Active learning is "a method of learning in which students are actively or experientially involved in the learning process and where there are different levels of active learning, depending on student involvement." states that "students partici ...
, where the algorithm can ask for labels to specifically chosen inputs in order to reduce the cost of obtaining many labels. The concept of sample complexity also shows up in
reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
, online learning, and unsupervised algorithms, e.g. for dictionary learning.


Efficiency in robotics

A high sample complexity means, that many calculations are needed for running a
Monte Carlo tree search In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in software that plays board games. In that context MCTS is used to solve the game tree. ...
. Its equal to a model free brute force search in the state space. In contrast, a high efficiency algorithm has a low sample complexity. Possible techniques for reducing the sample complexity are
metric learning Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. ...
and model based reinforcement learning.


References

{{Reflist Machine learning