machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, early stopping is a form of

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

used to avoid

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

when training a model with an iterative method, such as

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

. Such methods update the model to make it better fit the

training data In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

with each iteration. Up to a point, this improves the model's performance on data outside of the training set (e.g., the validation set). Past that point, however, improving the model's fit to the training data comes at the expense of increased

generalization error For supervised learning applications in machine learning and statistical learning theory, generalization errorMohri, M., Rostamizadeh A., Talwakar A., (2018) ''Foundations of Machine learning'', 2nd ed., Boston: MIT Press (also known as the out-of- ...

. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

Background

This section presents some of the basic machine-learning concepts required for a description of early stopping methods.

Overfitting

Machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

algorithms train a model based on a finite set of training data. During this training, the model is evaluated based on how well it predicts the observations contained in the training set. In general, however, the goal of a machine learning scheme is to produce a model that generalizes, that is, that predicts previously unseen observations. Overfitting occurs when a model fits the data in the training set well, while incurring larger

Regularization

Regularization, in the context of machine learning, refers to the process of modifying a learning algorithm so as to prevent overfitting. This generally involves imposing some sort of smoothness constraint on the learned model. This smoothness may be enforced explicitly, by fixing the number of parameters in the model, or by augmenting the cost function as in

Tikhonov regularization Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...

. Tikhonov regularization, along with principal component regression and many other regularization schemes, fall under the umbrella of spectral regularization, regularization characterized by the application of a filter. Early stopping also belongs to this class of methods.

Gradient descent methods

Gradient descent methods are first-order, iterative, optimization methods. Each iteration updates an approximate solution to the optimization problem by taking a step in the direction of the negative of the gradient of the objective function. By choosing the step-size appropriately, such a method can be made to converge to a local minimum of the objective function. Gradient descent is used in machine-learning by defining a ''

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

'' that reflects the error of the learner on the training set and then minimizing that function.

Early stopping based on analytical results

Early stopping in
statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on da ...

Early-stopping can be used to regularize non-parametric regression problems encountered in

. For a given input space,

X

, output space,

Y

, and samples drawn from an unknown probability measure,

\rho

, on

Z = X \times Y

, the goal of such problems is to approximate a ''regression function'',

f_

, given by :

f_\rho(x) = \int_Y y \, d\rho(y\mid x),\, x \in X,

where

\rho(y\mid x)

is the conditional distribution at

x

induced by

\rho

. One common choice for approximating the regression function is to use functions from a

reproducing kernel Hilbert space In functional analysis, a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Specifically, a Hilbert space H of functions from a set X (to \mathbb or \mathbb) is ...

. These spaces can be infinite dimensional, in which they can supply solutions that overfit training sets of arbitrary size. Regularization is, therefore, especially important for these methods. One way to regularize non-parametric regression problems is to apply an early stopping rule to an iterative procedure such as gradient descent. The early stopping rules proposed for these problems are based on analysis of upper bounds on the generalization error as a function of the iteration number. They yield prescriptions for the number of iterations to run that can be computed prior to starting the solution process.

Example: Least-squares loss

(Adapted from Yao, Rosasco and Caponnetto, 2007) Let

X\subseteq\mathbb^n

and

Y=\mathbb.

Given a set of samples :

\mathbf = \left \ \in Z^m,

drawn independently from

\rho

, minimize the functional :

\mathcal(f) = \int_ (f(x) - y)^2 \, d\rho

where,

f

is a member of the reproducing kernel Hilbert space

\mathcal

. That is, minimize the expected risk for a Least-squares loss function. Since

\mathcal

depends on the unknown probability measure

\rho

, it cannot be used for computation. Instead, consider the following empirical risk :

\mathcal_(f) = \frac \sum_^m \left(f(x_i) - y_\right)^2.

Let

f_

and

f_t^

be the ''t''-th iterates of gradient descent applied to the expected and empirical risks, respectively, where both iterations are initialized at the origin, and both use the step size

\gamma_

. The

f_

form the ''population iteration'', which converges to

f_

, but cannot be used in computation, while the

f_t^

form the ''sample iteration'' which usually converges to an overfitting solution. We want to control the difference between the expected risk of the sample iteration and the minimum expected risk, that is, the expected risk of the regression function: :

\mathcal(f_t^) - \mathcal(f_\rho)

This difference can be rewritten as the sum of two terms: the difference in expected risk between the sample and population iterations and that between the population iteration and the regression function: :

\mathcal(f_t^) - \mathcal(f_\rho) = \left \mathcal(f_t^) - \mathcal(f_t)\right + \left \mathcal(f_t) - \mathcal(f_\rho)\right /math>

This equation presents a bias-variance tradeoff, which is then solved to give an optimal stopping rule that may depend on the unknown probability distribution. That rule has associated probabilistic bounds on the generalization error. For the analysis leading to the early stopping rule and bounds, the reader is referred to the original article. In practice, data-driven methods, e.g. cross-validation can be used to obtain an adaptive stopping rule.

Early stopping in boosting

Boosting refers to a family of algorithms in which a set of weak learners (learners that are only slightly correlated with the true process) are combined to produce a strong learner. It has been shown, for several boosting algorithms (including

AdaBoost AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learnin ...

), that regularization via early stopping can provide guarantees of

consistency In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...

, that is, that the result of the algorithm approaches the true solution as the number of samples goes to infinity.

L-boosting

Boosting methods have close ties to the gradient descent methods described

above Above may refer to: *Above (artist) Tavar Zawacki (b. 1981, California) is a Polish, Portuguese - American abstract artist and internationally recognized visual artist based in Berlin, Germany. From 1996 to 2016, he created work under the ...

can be regarded as a boosting method based on the

L_

loss: ''LBoost''.

Validation-based early stopping

These early stopping rules work by splitting the original training set into a new training set and a validation set. The error on the validation set is used as a proxy for the

in determining when overfitting has begun. These methods are employed in the training of many iterative machine learning algorithms including

neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

. Prechelt gives the following summary of a naive implementation of holdout-based early stopping as follows: Cross-validation is an alternative that is applicable to non time-series scenarios. Cross-validation involves splitting multiple partitions of the data into training set and validation set – instead of a single partition into a training set and validation set. Even this simple procedure is complicated in practice by the fact that the validation error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad hoc rules for deciding when overfitting has truly begun.

References

{{Reflist Artificial neural networks