Double descent in
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
and
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
is the phenomenon where a
model
A model is an informative representation of an object, person, or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin , .
Models can be divided in ...
with a small number of
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s and a model with an extremely large number of parameters both have a small
training error, but a model whose number of parameters is about the same as the number of
data points
In statistics, a unit of observation is the unit described by the data that one analyzes. A study may treat groups as a unit of observation with a country as the unit of analysis, drawing conclusions on group characteristics from data collected a ...
used to train the model will have a much greater
test error than one with a much larger number of parameters. This phenomenon has been considered surprising, as it contradicts assumptions about
overfitting
In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
in classical machine learning.
History
Early observations of what would later be called double descent in specific models date back to 1989.
The term "double descent" was coined by Belkin et. al.
in 2019,
when the phenomenon gained popularity as a broader concept exhibited by many models. The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant overfitting error (an extrapolation of the
bias–variance tradeoff
In statistics and machine learning, the bias–variance tradeoff describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train ...
),
and the empirical observations in the 2010s that some modern machine learning techniques tend to perform better with larger models.
Theoretical models
Double descent occurs in
linear regression
In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...
with
isotropic
In physics and geometry, isotropy () is uniformity in all orientations. Precise definitions depend on the subject area. Exceptions, or inequalities, are frequently indicated by the prefix ' or ', hence '' anisotropy''. ''Anisotropy'' is also ...
Gaussian covariates and isotropic Gaussian noise.
A model of double descent at the
thermodynamic limit
In statistical mechanics, the thermodynamic limit or macroscopic limit, of a system is the Limit (mathematics), limit for a large number of particles (e.g., atoms or molecules) where the volume is taken to grow in proportion with the number of ...
has been analyzed using the
replica trick
In the statistical physics of spin glasses and other systems with quenched disorder, the replica trick is a mathematical technique based on the application of the formula:
\ln Z=\lim_
or:
\ln Z = \lim_ \frac
where
Z is most commonly the partit ...
, and the result has been confirmed numerically.
Empirical examples
The scaling behavior of double descent has been found to follow a
broken neural scaling law
In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and ...
[Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". International Conference on Learning Representations (ICLR), 2023.] functional form.
See also
*
Grokking (machine learning)
References
Further reading
*
*
*
*
*
External links
*
*
Understanding "Deep Double Descent"at evhub.
Model selection
Machine learning
Statistical classification
{{stat-stub