
In
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, a neural scaling law is an empirical
scaling law
In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a relative change in the other quantity proportional to the change raised to a constant exponent: one quantity var ...
that describes how
neural network
A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
performance changes as key factors are scaled up or down. These factors typically include the number of parameters,
training dataset size,
and training cost.
Introduction
In general, a
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
model can be characterized by four parameters: model size, training dataset size, training cost, and the post-training error rate (e.g., the test set error rate). Each of these variables can be defined as a real number, usually written as
(respectively: parameter count, dataset size, computing cost, and
loss).
A neural scaling law is a theoretical or empirical
statistical law between these parameters. There are also other parameters with other scaling laws.
Size of the model
In most cases, the model's size is simply the number of parameters. However, one complication arises with the use of sparse models, such as
mixture-of-expert models. With sparse models, during inference, only a fraction of their parameters are used. In comparison, most other kinds of neural networks, such as
transformer
In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...
models, always use all their parameters during inference.
Size of the training dataset
The size of the training dataset is usually quantified by the number of data points within it. Larger training datasets are typically preferred, as they provide a richer and more diverse source of information from which the model can learn. This can lead to improved generalization performance when the model is applied to new, unseen data.
[Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.] However, increasing the size of the training dataset also increases the computational resources and time required for model training.
With the "pretrain, then finetune" method used for most
large language model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation.
The largest and most capable LLMs are g ...
s, there are two kinds of training dataset: the ''pretraining'' dataset and the ''finetuning'' dataset. Their sizes have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset.
In some cases, a small amount of high quality data suffices for finetuning, and more data does not necessarily improve performance.
Cost of training

Training cost is typically measured in terms of time (how long it takes to train the model) and computational resources (how much processing power and memory are required). It is important to note that the cost of training can be significantly reduced with efficient training algorithms, optimized software libraries, and
parallel computing
Parallel computing is a type of computing, computation in which many calculations or Process (computing), processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. ...
on specialized hardware such as
GPUs
A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
or
TPUs.
The cost of training a neural network model is a function of several factors, including model size, training dataset size, the training algorithm
complexity
Complexity characterizes the behavior of a system or model whose components interact in multiple ways and follow local rules, leading to non-linearity, randomness, collective dynamics, hierarchy, and emergence.
The term is generally used to c ...
, and the computational resources available.
In particular, doubling the training dataset size does not necessarily double the cost of training, because one may train the model for several times over the same dataset (each being an "
epoch
In chronology and periodization, an epoch or reference epoch is an instant in time chosen as the origin of a particular calendar era. The "epoch" serves as a reference point from which time is measured.
The moment of epoch is usually decided b ...
").
Performance


The performance of a neural network model is evaluated based on its ability to accurately predict the output given some input data. Common metrics for evaluating model performance include:
* Negative
log-likelihood per token (logarithm of
perplexity) for
language model
A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...
ing;
*
Accuracy
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''.
''Precision'' is how close the measurements are to each other.
The ...
,
precision, recall, and
F1 score for
classification
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
tasks;
*
Mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...
(MSE) or
mean absolute error (MAE) for
regression tasks;
*
Elo rating
The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess or esports. It is named after its creator Arpad Elo, a Hungarian-American chess master and physics professor.
The Elo system wa ...
in a competition against other models, such as
gameplay
Gameplay is the specific way in which players interact with a game. The term applies to both video games and Tabletop game, tabletop games. Gameplay is the connection between the player and the game, the player's overcoming of challenges, and t ...
or
preference by a human judge.
Performance can be improved by using more data, larger models, different training algorithms,
regularizing the model to prevent
overfitting
In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
, and early stopping using a validation set.
When the performance is a number bounded within the range of