machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, a neural scaling law is an empirical

scaling law In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one q ...

that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.

Introduction

In general, a neural model can be characterized by 4 parameters: size of the model, size of the training dataset, cost of training, error rate after training. Each of these four variables can be precisely defined into a real number, and they are empirically found to be related by simple

statistical laws Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...

, called "scaling laws". These are usually written as

N, D, C, L

(number of parameters, dataset size, computing cost, loss).

Size of the model

In most cases, the size of the model is simply the number of parameters. However, one complication arises with the use of sparse models, such as mixture-of-expert models. In sparse models, during every inference, only a fraction of the parameters are used. In comparison, most other kinds of neural networks, such as Transformer networks, always use all their parameters during every inference.

Size of the training dataset

The size of the

training dataset In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model fr ...

is usually quantified by the number of data points it contains. Larger training datasets are typically preferred as they provide a richer and more diverse source of information for the model to learn from. This in turn can lead to improved generalization performance when the model is applied to unseen data.Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. However, increasing the size of the training dataset also increases the computational resources and time required for model training. With the "pretrain, then finetune" method used in most

large language models A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...

, there are two kinds of training dataset: the pretraining dataset and the finetuning dataset. Their sizes would have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset. In some cases, a small amount of high quality data suffices for finetuning, and more data does not improve performance.

Cost of training

The cost of training is typically measured in terms of time (how long it takes to train the model) and computational resources (how much processing power and memory are required to train the model). It's important to note that the cost of training can be significantly reduced with efficient training algorithms, optimized software libraries, and parallel computing on specialized hardware like

GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mob ...

or TPUs. The cost of training a neural model is a function of several factors including the size of the model, the size of the training dataset, the complexity of the training algorithm, and the computational resources available. In particular, doubling the training dataset does not necessarily double the cost of training, because one may train the model for several times over the same dataset (each being an "

epoch In chronology and periodization, an epoch or reference epoch is an instant in time chosen as the origin of a particular calendar era. The "epoch" serves as a reference point from which time is measured. The moment of epoch is usually decided by ...

").

Performance

The performance of a neural model is evaluated based on its ability to accurately predict the output given the input data. Common metrics for evaluating model performance include: *

accuracy Accuracy and precision are two measures of '' observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their '' true value'', while ''precision'' is how close the measurements are to each ot ...

, precision, recall, and

F1 score In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the ...

for classification tasks; *

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwe ...

(MSE) or

mean absolute error In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of ''Y'' versus ''X'' include comparisons of predicted versus observed, subsequent time versus initial time, and ...

(MAE) for regression tasks; * negative

log-likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

per token (logarithm of

perplexity In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good a ...

) for

language modeling A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...

. *

Elo rating The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess. It is named after its creator Arpad Elo, a Hungarian-American physics professor. The Elo system was invented as an improved c ...

in a competition against other models, such as

gameplay Gameplay is the specific way in which players interact with a game, and in particular with video games. Gameplay is the pattern defined through the game rules, connection between player and the game, challenges and overcoming them, plot and pla ...

or preference by a human judge Performance can be improved by using more data, larger models, different training algorithms, regularizing the model to prevent overfitting, and early stopping using a validation set.

Examples

(Hestness, Narang, et al, 2017)

The 2017 paper is a common reference point for neural scaling laws fitted by statistical analysis on experimental data. Previous works before the 2000s, as cited in the paper, were either theoretical or orders of magnitude smaller in scale. Whereas previous works generally found the scaling exponent to scale like

L \propto D^

, with

\alpha \in \

, the paper found that

\alpha \in .07, 0.35 /math>.

Of the factors they varied, only task can change the exponent \alpha . Changing the architecture optimizers, regularizers, and loss functions, would only change the proportionality factor, not the exponent. For example, for the same task, one architecture might have L = 1000 D^while another might have L = 500 D^. They also found that for a given architecture, the number of parameters necessary to reach lowest levels of loss, given a fixed dataset size, grows like N \propto D^for another exponent \beta .

They studied machine translation with LSTM (\alpha \sim 0.13), generative language modelling with LSTM (\alpha \in .06, 0.09 \beta \approx 0.7), ImageNet classification with ResNet (\alpha \in .3, 0.5 \beta \approx 0.6), and speech recognition (\alpha \approx 0.3).

(Henighan, Kaplan, et al, 2020)

A 2020 analysis studied statistical relations between

C, N, D, L

over a wide range of values and found similar scaling laws, over the range of

N \in 0^3, 10^9 /math>, C\in 0^, 10^/math>, and over multiple modalities (text, video, image, text to image, etc.). In particular, the scaling laws it found are (Table 1 of):

* For each modality, they fixed one of the two C, N, and varying the other one (D is varied along using D = C/6N), the achievable test loss satisfies L = L_0 + \left( \frac\right)^\alpha where x is the varied variable, and L_0, x_0, \alpha are parameters to be found by statistical fitting. The parameter \alpha is the most important one. 
** When N is the varied variable, \alpha ranges from 0.037 to 0.24 depending on the model modality. This corresponds to the \alpha = 0.34 from the Chinchilla scaling paper.
** When C is the varied variable, \alpha ranges from 0.048 to 0.19 depending on the model modality. This corresponds to the \beta = 0.28 from the Chinchilla scaling paper.
* Given fixed computing budget, optimal model parameter count is consistently around N_(C) =\left(\frac\right)^ = 9.0\times 10^ C^The parameter 9.0 \times 10^varies by a factor of up to 10 for different modalities. The exponent parameter 0.7 varies from 0.64 to 0.75 for different modalities. This exponent corresponds to the \approx 0.5 from the Chinchilla scaling paper.
* It's "strongly suggested" (but not statistically checked) that D_(C) \propto N_(C)^ \propto C^. This exponent corresponds to the \approx 0.5 from the Chinchilla scaling paper.

The scaling law of L = L_0 + (C_0/C)^was confirmed during the training of

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standa ...

(Figure 3.1 ).

Chinchilla scaling (Hoffmann, et al, 2022)

One particular scaling law (" Chinchilla scaling") states that, for a

large language model A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...

(LLM) autoregressively trained for one epoch, with a cosine

learning rate In machine learning and statistics, the learning rate is a Hyperparameter (machine learning), tuning parameter in an Mathematical optimization, optimization algorithm that determines the step size at each iteration while moving toward a minimum of ...

schedule, we have:

\begin
C = C_0 ND\\
L = \frac + \frac + L_0
\end

where the variables are *

C

is the cost of training the model, in

FLOPS In computing, floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate mea ...

. *

N

is the number of parameters in the model. *

D

is the number of tokens in the training set. *

L

is the average negative log-likelihood loss per token ( nats/token), achieved by the trained LLM on the test dataset. **

L_0

represents the loss of an ideal generative process on the test data **

\frac

captures the fact that a Transformer language model with

N

parameters underperforms the ideal generative process **

\frac

captures the fact that the model trained on

D

tokens underperforms the ideal generative process and the statistical parameters are *

C_0 = 6

, meaning that it costs 6 FLOPs per parameter to train on one token. This is estimated by Kaplan et al. Note that training cost is much higher than inference cost, as training entails both forward and backward passes, whereas inference costs 1 to 2 FLOPs per parameter to infer on one token. *

\alpha = 0.34, \beta = 0.28, A = 406.4, B = 410.7, L_0 = 1.69

. Although Besiroglu et. al. claims that the statistical estimation is slightly off, and should be

\alpha = 0.35, \beta = 0.37, A = 482.01, B = 2085.43, L_0 = 1.82

. The statistical laws were fitted over experimental data with

N\in \times 10^7, 1.6 \times 10^D \in \times 10^9, 5\times 10^C \in 0^, 10^/math>.

Since there are 4 variables related by 2 equations, imposing 1 additional constraint and 1 additional optimization objective allows us to solve for all four variables. In particular, for any fixed C, we can uniquely solve for all 4 variables that minimizes L . This provides us with the optimal D_(C), N_(C) for any fixed C : N_(C)=G\left(\frac\right)^a, \quad D_(C)=G^\left(\frac\right)^b, \quad \text  \quad G=\left(\frac\right)^, \quad a=\frac \text  b=\frac \text Plugging in the numerical values, we obtain the "Chinchilla efficient" model size and training dataset size, as well as the test loss achievable: \begin
N_(C) = 0.6 \; C^ \\
D_(C) = 0.3 \; C^ \\
L_(C) = 1070 \; C^ + 1.7
\end Similarly, we may find the optimal training dataset size and training compute budget for any fixed model parameter size, and so on.

There are other estimates for "Chinchilla efficient" model size and training dataset size. The above is based on a statistical model of L = \frac + \frac + L_0 . One can also directly fit a statistical law for D_(C), N_(C) without going through the detour, for which one obtains: \begin
N_(C) = 0.1 \; C^\\
D_(C) = 1.7 \; C^
\end or as tabulated:

In simpler terms, the Chinchilla scaling law for training

transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

language models suggests that when given an increased budget (in

FLOPs In computing, floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate mea ...

), to achieve compute-optimal, the number of model parameters (N) and the number of tokens for training the model (D) should scale in approximately equal proportions. This conclusion differs from the previous scaling law for neural language models, which states that N should be scaled faster than D. The discrepancy arises from setting different cycle lengths for cosine

schedulers. In estimating the Chinchilla scaling, the authors set the cycle length to be the same as the training steps, as experimental results indicate that larger cycles overestimate the loss of the models.

Beyond Chinchilla scaling

As Chinchilla scaling has been the reference point for many large-scaling training runs, there had been a concurrent effort to go "beyond Chinchilla scaling", meaning to modify some of the training pipeline in order to obtain the same loss with less effort, or deliberately train for longer than what is "Chinchilla optimal". Usually, the goal is to make the scaling law exponent larger, which means the same loss can be trained for much less compute. For instance, filtering data can make the scaling law exponent larger. Another strand of research studies how to deal with limited data, as according to Chinchilla scaling laws, the training dataset size for the largest language models already approaches what is available on the internet. found that augmenting the dataset with a mix of "denoising objectives" constructed from the dataset improves performance. studies optimal scaling when all available data is already exhausted (such as in rare languages), so one must train multiple epoches over the same dataset (whereas Chinchilla scaling requires only one epoch). The Phi series of small language models were trained on textbook-like data generated by large language models, for which data is only limited by amount of compute available. Chinchilla optimality was defined as "optimal for training compute", whereas in actual production-quality models, there will be a lot of inference after training is complete. "Overtraining" during training means better performance during inference.

LLaMA The llama (; ) (''Lama glama'') is a domesticated South American camelid, widely used as a meat and pack animal by Andean cultures since the Pre-Columbian era. Llamas are social animals and live with others as a herd. Their wool is so ...

models were overtrained for this reason. Subsequent studies discovered scaling laws in the overtraining regime, for dataset sizes up to 32x more than Chinchilla-optimal.

Broken neural scaling laws (BNSL)

A 2022 analysis found that many scaling behaviors of artificial neural networks follow a smoothly broken power law functional form:

y =  a + \bigg(bx^\bigg) \prod_^n \left(1 + \left(\frac\right)^\right)^

in which

x

refers to the quantity being scaled (i.e.

C

N

D

, number of training steps, number of inference steps, or model input size) and

y

refers to the ''downstream'' (or upstream) performance evaluation metric of interest (e.g. prediction error,

cross entropy In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is ...

, calibration error, AUROC, BLEU score percentage,

, reward,

, solve rate, or

FID A fid is a conical tool traditionally made of wood or bone. It is used to work with rope and canvas in marlinespike seamanship. A fid differs from a marlinspike in material and purposes. A marlinspike is used in working with wire rope, natural ...

score) in zero-shot, prompted, or

fine-tuned In theoretical physics, fine-tuning is the process in which parameters of a model must be adjusted very precisely in order to fit with certain observations. This had led to the discovery that the fundamental constants and quantities fall into suc ...

settings. The parameters

a, b, c_0, c_1 ... c_n, d_1 ...  d_n, f_1 ... f_n

are found by statistical fitting. On a

log–log plot In science and engineering, a log–log graph or log–log plot is a two-dimensional graph of numerical data that uses logarithmic scales on both the horizontal and vertical axes. Power functions – relationships of the form y=ax^k – appear as ...

, when

f_i

is not too large and

a

is subtracted out from the y-axis, this functional form looks like a series of linear segments connected by arcs; the

n

transitions between the segments are called "breaks", hence the name ''broken neural scaling laws (BNSL)''. The scenarios in which the scaling behaviors of artificial neural networks were found to follow this functional form include large-scale

vision Vision, Visions, or The Vision may refer to: Perception Optical perception * Visual perception, the sense of sight * Visual system, the physical mechanism of eyesight * Computer vision, a field dealing with how computers can be made to gain und ...

language Language is a structured system of communication. The structure of a language is its grammar and the free components are its vocabulary. Languages are the primary means by which humans communicate, and may be conveyed through a variety of ...

, audio, video,

diffusion Diffusion is the net movement of anything (for example, atoms, ions, molecules, energy) generally from a region of higher concentration to a region of lower concentration. Diffusion is driven by a gradient in Gibbs free energy or chemical p ...

, generative modeling, multimodal learning, contrastive learning,

AI alignment In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems towards their designers’ intended goals and interests. An ''aligned'' AI system advances the intended objective; a ''misaligned'' AI system is compet ...

, AI capabilities,

robotics Robotics is an interdisciplinarity, interdisciplinary branch of computer science and engineering. Robotics involves design, construction, operation, and use of robots. The goal of robotics is to design machines that can help and assist human ...

, out-of-distribution (OOD) generalization, continual learning,

transfer learning Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize ...

, uncertainty estimation /

calibration In measurement technology and metrology, calibration is the comparison of measurement values delivered by a device under test with those of a calibration standard of known accuracy. Such a standard could be another measurement device of known a ...

, out-of-distribution detection, adversarial robustness,

distillation Distillation, or classical distillation, is the process of separating the components or substances from a liquid mixture by using selective boiling and condensation, usually inside an apparatus known as a still. Dry distillation is the he ...

, sparsity, retrieval, quantization,

pruning Pruning is a horticultural, arboricultural, and silvicultural practice involving the selective removal of certain parts of a plant, such as branches, buds, or roots. The practice entails the ''targeted'' removal of diseased, damaged, dead ...

fairness Fairness or being fair can refer to: * Justice * The character in the award-nominated musical comedy '' A Theory of Justice: The Musical.'' * Equity (law), a legal principle allowing for the use of discretion and fairness when applying justice ...

, molecules, computer programming/coding, math word problems, arithmetic, emergent abilities,

double descent In statistics and machine learning, double descent is the phenomenon where a statistical model with a small number of parameters and a model with an extremely large number of parameters have a small error, but a model whose number of parameters ...

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

, unsupervised/ self-supervised learning, and

reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...

(single agent and multi-agent). The architectures for which the scaling behaviors of artificial neural networks were found to follow this functional form include residual neural networks,

transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the Autobots and the Decepticons, two alien robot factions at war that can transform into other forms ...

, MLPs, MLP-mixers,

recurrent neural networks A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...

convolutional neural networks In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networ ...

, graph neural networks, U-nets, encoder-decoder (and encoder-only) (and decoder-only) models, ensembles (and non-ensembles),

MoE Moe, MOE, MoE or m.o.e. may refer to: In arts and entertainment Characters * Moe Szyslak, from the animated television show ''The Simpsons'' * Moe, leader of The Three Stooges, played by Moe Howard * Moe Higurashi, supporting character in ''Yash ...

(mixture of experts) (and non-MoE) models, and sparse pruned (and non-sparse unpruned) models.

OpenAI o1 and reasoning capabilities

In 2024, OpenAI introduced o1, a large language model trained with reinforcement learning to perform complex reasoning. The model demonstrated significant improvements in mathematics, scientific reasoning, and coding tasks, outperforming previous models and even human experts in some areas. OpenAI researchers found that o1's performance consistently improved with both increased reinforcement learning (train-time compute) and more time spent thinking (test-time compute). This scaling behavior differs from traditional language model pretraining, suggesting new avenues for improving AI reasoning capabilities. The model's ability to produce an internal chain of thought before responding allowed for better problem-solving and improved safety performance.

Other examples

Vision transformers

Vision transformers, similar to language transformers, exhibit scaling laws. A 2022 research trained vision transformers, with parameter counts

N\in \times 10^6, 2\times 10^9 /math>, on image sets of sizes D \in \times 10^, 3\times 10^/math>, for computing C\in .2, 10^4 (in units of TPUv3-core-days).

After training the model, it is finetuned on ImageNet training set. Let L be the error probability of the finetuned model classifying ImageNet test set. They found \min_ L = 0.09 + \frac .

Neural machine translation

Ghorbani, Behrooz et al. studied scaling laws for

neural machine translation Neural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. Properties They requi ...

(specifically, English as source, and German as target) in encoder-decoder

Transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

models, trained until convergence on the same datasets (thus they did not fit scaling laws for computing cost

C

or dataset size

D

). They varied

N \in 0^8, 3.5 \times 10^9 /math> They found three results:

* L is a scaling law function of N_E, N_D, where N_E, N_D are encoder and decoder parameter count. It is not simply a function of total parameter count N = N_E + N_D . The function has form L\left(N_e, N_d\right)=\alpha\left(\frac\right)^\left(\frac\right)^+L_, where \alpha, p_e, p_d, L_, \bar N_e, \bar N_d are fitted parameters. They found that N_d/N \approx 0.55 minimizes loss if N is held fixed.
* L "saturates" (that is, it reaches L_\infty) for smaller models when the training and testing datasets are "source-natural" than "target-natural". A "source-natural" data point means a pair of English-German sentences, and the model is asked to translate the English sentence into German, and the English sentence is written by a natural English writer, while the German sentence is translated from the English sentence by a machine translator. To construct the two kinds of datasets, the authors collected natural English and German sentences online, then used machine translation to generate their translations.
* As models grow larger, models trained on source-original datasets can achieve low loss but bad BLEU score . In contrast, models trained on target-original datasets achieve low loss and good BLEU score in tandem (Figure 10, 11).

The authors hypothesize that source-natural datasets have uniform and dull target sentences, and so a model that is trained to predict the target sentences would quickly overfit.

 trained Transformers for machine translations with sizes N \in \times 10^5 , 5.6 \times 10^7 /math> on dataset sizes D \in \times 10^5, 6 \times 10^9 /math>. They found the Kaplan et al (2020) scaling law applied to machine translation: L(N, D)=\left left(\frac\right)^+\frac\right . They also found the BLEU score scaling as BLEU \approx C e^.

Transfer learning

Hernandez, Danny et al. studied scaling laws for

in language models. They trained a family of Transformers in three ways: * pretraining on English, finetuning on Python * pretraining on an equal mix of English and Python, finetuning on Python * training on Python The idea is that pretraining on English should help the model achieve low loss on a test set of Python text. Suppose the model has parameter count

N

, and after being finetuned on

D_F

Python tokens, it achieves some loss

L

. We say that its "transferred token count" is

D_T

, if another model with the same

N

achieves the same

L

after training on

D_F + D_T

Python tokens. They found

D_T=1.9 e 4\left(D_F\right)^(N)^

for pretraining on English text, and

D_T=2.1 e 5\left(D_F\right)^(N)^

for pretraining on English and non-Python code.

References

Artificial neural networks Artificial intelligence Deep learning Power laws Statistical laws