Model Collapse
   HOME

TheInfoList



OR:

Model collapse is a phenomenon where
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
models gradually degrade due to errors coming from uncurated training on the outputs of another model, such as prior versions of itself. Such outputs are known as
synthetic data Synthetic data is information that's artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. Data g ...
. It is a possible mechanism for mode collapse. Shumailov et al. coined the term and described two specific stages to the degradation: ''early model collapse'' and ''late model collapse'': * In early model collapse, the model begins losing information about the tails of the
distribution Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...
– mostly affecting minority data. Later work highlighted that early model collapse is hard to notice, since overall performance may appear to improve, while the model loses performance on minority data. * In late model collapse, the model loses a significant proportion of its performance, confusing concepts and losing most of its
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of number ...
.


Mechanism

Using synthetic data as training data can lead to issues with the quality and reliability of the trained model. Model collapse occurs for three main reasons: # functional approximation errors # sampling errors # learning errors Importantly, it happens in even the simplest of models, where not all of the error sources are present. In more complex models the errors often compound, leading to faster collapse.


Disagreement over real-world impact

Some researchers and commentators on model collapse warn that the phenomenon could fundamentally threaten future generative AI development: As AI-generated data is shared on the Internet, it will inevitably end up in future training datasets, which are often crawled from the Internet. If training on "
slop Slop or SLOP may refer to: *Slop (clothing) *Hose (clothing) *Slop is the common name for household food scraps * Strategic Lateral Offset Procedure, in aviation, a procedure for avoiding collisions * a popular term for Backlash (engineering) *'' ...
" (large quantities of unlabeled synthetic data) inevitably leads to model collapse, this could therefore pose a difficult problem. However, recently, other researchers have disagreed with this argument, showing that if synthetic data accumulates alongside human-generated data, model collapse is avoided. The researchers argue that data accumulating over time is a more realistic description of reality than deleting all existing data every year, and that the real-world impact of model collapse may not be as catastrophic as feared. An alternative branch of the literature investigates the use of machine learning detectors and watermarking to identify model generated data and filter it out.


Mathematical models of the phenomenon


1D Gaussian model

In 2024, a first attempt has been made at illustrating collapse for the simplest possible model — a single dimensional
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu i ...
fit using unbiased estimators of mean and variance, computed on samples from the previous generation. To make this more precise, we say that original data follows a normal distribution X^0 \sim \mathcal(\mu,\sigma^2), and we possess M_0 samples X^0_j for j \in . Denoting a general sample X^i_j as sample j \in at generation i, then the next generation model is estimated using the sample mean and variance: \mu_ = \frac\sum_j X^i_j; \quad \sigma_^2 = \frac\sum _j(X^i_j-\mu_)^2. Leading to a conditionally normal next generation model X^_j, \mu_,\;\sigma_\sim \mathcal(\mu_,\sigma_^2). In theory, this is enough to calculate the full distribution of X^i_j. However, even after the first generation, the full distribution is no longer normal: It follows a variance-gamma distribution. To continue the analysis, instead of writing the probability density function at each generation, it is possible to explicitly construct them in terms of independent random variables using
Cochran's theorem In statistics, Cochran's theorem, devised by William G. Cochran, is a theorem used to justify results relating to the probability distributions of statistics that are used in the analysis of variance. Statement Let ''U''1, ..., ''U'N'' be i.i. ...
. To be precise, \mu_1 and \sigma_1are independent, with \mu_1 \sim \mathcal\left(\mu, \frac\right) and (M_0-1)\,\sigma_1^2 \sim \sigma^2\,\Gamma\left(\frac, \frac12\right), following a
Gamma distribution In probability theory and statistics, the gamma distribution is a two- parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma dis ...
. Denoting with Z Gaussian random variables distributed according to \mathcal(0, 1) and with S^i random variables distributed with \frac\Gamma\left(\frac, \frac12\right), it turns out to be possible to write samples at each generation as X^0_j = \mu + \sigma Z^0_j, X^1_j = \mu + \fracZ^1 + \sigma\sqrtZ^1_j, and more generally X^n_j = \mu + \fracZ^1 + \frac\sqrtZ^2 + \dots + \frac\sqrtZ^n+\sigma\sqrtZ^n_j. Note, that these are not joint distributions, as Z^n and S^n depend directly on Z^_j, but when considering X^n_j on its own the formula above provides all the information about the full distribution. To analyse the model collapse, we can first calculate variance and mean of samples at generation n. This would tell us what kind of distributions we expect to arrive at after n generations. It is possible to find its exact value in closed form, but the mean and variance of the square root of gamma distribution are expressed in terms of gamma functions, making the result quite clunky. Following, it is possible to expand all results to second order in each of 1/M_i, assuming each sample size to be large. It is then possible to show that \frac\operatorname(X^n_j) = \frac+\frac+ \dots + \frac+1 + \mathcal\left(M_i^\right). And if all sample sizes M_i = M are constant, this diverges linearly as n\to\infty: \operatorname(X^n_j) = \sigma^2\left(1+\frac\right); \quad \mathbb(X^n_j) = \mu. This is the same scaling as for a single dimensional Gaussian random walk. However, divergence of the variance of X^n_j does not directly provide any information about the corresponding estimates of \mu_ and \sigma_, particularly how different they are from the original \mu and \sigma. It turns out to be possible to calculate the distance between the true distribution and the approximated distribution at step n+1, using the Wasserstein-2 distance (which is also sometimes referred to as
risk In simple terms, risk is the possibility of something bad happening. Risk involves uncertainty about the effects/implications of an activity with respect to something that humans value (such as health, well-being, wealth, property or the environme ...
): \mathbb\left mathbb^2_2\left(\mathcal(\mu,\sigma^2),\mathcal(\mu_,\sigma^2_)\right)\right\frac\sigma^2\left(\frac+\frac+ \dots + \frac\right)+\mathcal\left(M_i^\right), \operatorname\left mathbb^2_2\left(\mathcal(\mu,\sigma^2),\mathcal(\mu_,\sigma^2_)\right)\right\frac\sigma^4\left(\frac+\frac+ \dots + \frac + \sum_\frac\right)+\mathcal\left(M_i^\right). This directly shows why model collapse occurs in this simple model. Due to errors from re-sampling the approximated distribution, each generation ends up corresponding to a new step in a random walk of model parameters. For a constant sample size at each generation, the average distance from the starting point diverges, and in order for the end distribution approximation to be accurate, or for the distance to be finite, the sampling rate M_i needs to increase superlinearly, i.e. one needs to collect increasingly more samples over time, perhaps quadratically. However, even in that case the expected distance after n steps remains non-zero and the only case in which it does in fact end up being zero is when sampling is infinite at each step. Overall, this only shows us how far on average one ends up from the original distribution, but the process can only "terminate", if the estimated variance at a certain generation becomes small enough, effectively turning the distribution into a delta function. This is shown to occur for a general gaussian model in the subsection below. Empirical investigation has confirmed this theoretical analysis.


N-D Gaussian model

Furthermore, in the case of multidimensional model with fully synthetic data, exact collapse can be shown.


Linear regression

In the case of a
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...
model, scaling laws and bounds on learning can be obtained.


Statistical language model

In the case of a linear
softmax The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
classifier for next token prediction, exact bounds on learning with even a partially synthetic dataset can be obtained.


Impact on large language models

In the context of
large language model A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...
s, research found that training LLMs on predecessor-generated text — language models are trained on the synthetic data produced by previous models — causes a consistent decrease in the lexical, syntactic, and semantic diversity of the model outputs through successive iterations, notably remarkable for tasks demanding high levels of creativity.


See also

*
Generation loss Generation loss is the loss of quality between subsequent copies or transcodes of data. Anything that reduces the quality of the representation when copying, and would cause further reduction in quality on making a copy of the copy, can be cons ...
*
Generative artificial intelligence Generative artificial intelligence (generative AI, GenAI, or GAI) is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models machine learning, learn the underlying p ...


Notes


References

{{reflist Generative artificial intelligence