Broken Neural Scaling Law
   HOME

TheInfoList



OR:

In
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, a neural scaling law is an empirical
scaling law In statistics, a power law is a functional relationship between two quantities, where a relative change in one quantity results in a relative change in the other quantity proportional to the change raised to a constant exponent: one quantity var ...
that describes how
neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.


Introduction

In general, a
deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
model can be characterized by four parameters: model size, training dataset size, training cost, and the post-training error rate (e.g., the test set error rate). Each of these variables can be defined as a real number, usually written as N, D, C, L (respectively: parameter count, dataset size, computing cost, and loss). A neural scaling law is a theoretical or empirical statistical law between these parameters. There are also other parameters with other scaling laws.


Size of the model

In most cases, the model's size is simply the number of parameters. However, one complication arises with the use of sparse models, such as mixture-of-expert models. With sparse models, during inference, only a fraction of their parameters are used. In comparison, most other kinds of neural networks, such as
transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...
models, always use all their parameters during inference.


Size of the training dataset

The size of the training dataset is usually quantified by the number of data points within it. Larger training datasets are typically preferred, as they provide a richer and more diverse source of information from which the model can learn. This can lead to improved generalization performance when the model is applied to new, unseen data.Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. However, increasing the size of the training dataset also increases the computational resources and time required for model training. With the "pretrain, then finetune" method used for most
large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...
s, there are two kinds of training dataset: the ''pretraining'' dataset and the ''finetuning'' dataset. Their sizes have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset. In some cases, a small amount of high quality data suffices for finetuning, and more data does not necessarily improve performance.


Cost of training

Training cost is typically measured in terms of time (how long it takes to train the model) and computational resources (how much processing power and memory are required). It is important to note that the cost of training can be significantly reduced with efficient training algorithms, optimized software libraries, and
parallel computing Parallel computing is a type of computing, computation in which many calculations or Process (computing), processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. ...
on specialized hardware such as
GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
or TPUs. The cost of training a neural network model is a function of several factors, including model size, training dataset size, the training algorithm
complexity Complexity characterizes the behavior of a system or model whose components interact in multiple ways and follow local rules, leading to non-linearity, randomness, collective dynamics, hierarchy, and emergence. The term is generally used to c ...
, and the computational resources available. In particular, doubling the training dataset size does not necessarily double the cost of training, because one may train the model for several times over the same dataset (each being an "
epoch In chronology and periodization, an epoch or reference epoch is an instant in time chosen as the origin of a particular calendar era. The "epoch" serves as a reference point from which time is measured. The moment of epoch is usually decided b ...
").


Performance

The performance of a neural network model is evaluated based on its ability to accurately predict the output given some input data. Common metrics for evaluating model performance include: * Negative log-likelihood per token (logarithm of perplexity) for
language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...
ing; *
Accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''. ''Precision'' is how close the measurements are to each other. The ...
, precision, recall, and F1 score for
classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
tasks; *
Mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...
(MSE) or mean absolute error (MAE) for regression tasks; *
Elo rating The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess or esports. It is named after its creator Arpad Elo, a Hungarian-American chess master and physics professor. The Elo system wa ...
in a competition against other models, such as
gameplay Gameplay is the specific way in which players interact with a game. The term applies to both video games and Tabletop game, tabletop games. Gameplay is the connection between the player and the game, the player's overcoming of challenges, and t ...
or preference by a human judge. Performance can be improved by using more data, larger models, different training algorithms, regularizing the model to prevent
overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
, and early stopping using a validation set. When the performance is a number bounded within the range of
, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
/math>, such as accuracy, precision, etc., it often scales as a
sigmoid function A sigmoid function is any mathematical function whose graph of a function, graph has a characteristic S-shaped or sigmoid curve. A common example of a sigmoid function is the logistic function, which is defined by the formula :\sigma(x ...
of cost, as seen in the figures.


Examples


(Hestness, Narang, et al, 2017)

The 2017 paper is a common reference point for neural scaling laws fitted by statistical analysis on experimental data. Previous works before the 2000s, as cited in the paper, were either theoretical or orders of magnitude smaller in scale. Whereas previous works generally found the scaling exponent to scale like L \propto D^ , with \alpha \in \ , the paper found that \alpha \in .07, 0.35/math>. Of the factors they varied, only task can change the exponent \alpha. Changing the architecture optimizers, regularizers, and loss functions, would only change the proportionality factor, not the exponent. For example, for the same task, one architecture might have L = 1000 D^ while another might have L = 500 D^. They also found that for a given architecture, the number of parameters necessary to reach lowest levels of loss, given a fixed dataset size, grows like N \propto D^ for another exponent \beta. They studied machine translation with LSTM (\alpha \sim 0.13 ), generative language modelling with LSTM (\alpha \in .06, 0.09 \beta \approx 0.7), ImageNet classification with ResNet (\alpha \in .3, 0.5 \beta \approx 0.6), and speech recognition with two hybrid (LSTMs complemented by either CNNs or an attention decoder) architectures (\alpha \approx 0.3).


(Henighan, Kaplan, et al, 2020)

A 2020 analysis studied statistical relations between C, N, D, L over a wide range of values and found similar scaling laws, over the range of N \in 0^3, 10^9/math>, C\in 0^, 10^/math>, and over multiple modalities (text, video, image, text to image, etc.). In particular, the scaling laws it found are (Table 1 of ): * For each modality, they fixed one of the two C, N, and varying the other one (D is varied along using D = C/6N), the achievable test loss satisfiesL = L_0 + \left( \frac\right)^\alphawhere x is the varied variable, and L_0, x_0, \alpha are parameters to be found by statistical fitting. The parameter \alpha is the most important one. ** When N is the varied variable, \alpha ranges from 0.037 to 0.24 depending on the model modality. This corresponds to the \alpha = 0.34 from the Chinchilla scaling paper. ** When C is the varied variable, \alpha ranges from 0.048 to 0.19 depending on the model modality. This corresponds to the \beta = 0.28 from the Chinchilla scaling paper. * Given fixed computing budget, optimal model parameter count is consistently aroundN_(C) =\left(\frac\right)^ = 9.0\times 10^ C^The parameter 9.0 \times 10^ varies by a factor of up to 10 for different modalities. The exponent parameter 0.7 varies from 0.64 to 0.75 for different modalities. This exponent corresponds to the \approx 0.5 from the Chinchilla scaling paper. * It's "strongly suggested" (but not statistically checked) that D_(C) \propto N_(C)^ \propto C^. This exponent corresponds to the \approx 0.5 from the Chinchilla scaling paper. The scaling law of L = L_0 + (C_0/C)^ was confirmed during the training of
GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based ...
(Figure 3.1 ).


Chinchilla scaling (Hoffmann, et al, 2022)

One particular scaling law (" Chinchilla scaling") states that, for a
large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...
(LLM) autoregressively trained for one epoch, with a cosine learning rate schedule, we have:\begin C = C_0 ND\\ L = \frac + \frac + L_0 \endwhere the variables are * C is the cost of training the model, in
FLOPS Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measu ...
. * N is the number of parameters in the model. * D is the number of tokens in the training set. * L is the average negative log-likelihood loss per token ( nats/token), achieved by the trained LLM on the test dataset. ** L_0 represents the loss of an ideal generative process on the test data ** \frac captures the fact that a Transformer language model with N parameters underperforms the ideal generative process ** \frac captures the fact that the model trained on D tokens underperforms the ideal generative process and the statistical parameters are * C_0 = 6, meaning that it costs 6 FLOPs per parameter to train on one token. This is estimated by Kaplan et al. Note that training cost is much higher than inference cost, as training entails both forward and backward passes, whereas inference costs 1 to 2 FLOPs per parameter to infer on one token. * \alpha = 0.34, \beta = 0.28, A = 406.4, B = 410.7, L_0 = 1.69. Although Besiroglu et al. claims that the statistical estimation is slightly off, and should be \alpha = 0.35, \beta = 0.37, A = 482.01, B = 2085.43, L_0 = 1.82. The statistical laws were fitted over experimental data with N\in \times 10^7, 1.6 \times 10^ D \in \times 10^9, 5\times 10^ C \in 0^, 10^/math>. Since there are 4 variables related by 2 equations, imposing 1 additional constraint and 1 additional optimization objective allows us to solve for all four variables. In particular, for any fixed C, we can uniquely solve for all 4 variables that minimizes L. This provides us with the optimal D_(C), N_(C) for any fixed C:N_(C)=G\left(\frac\right)^a, \quad D_(C)=G^\left(\frac\right)^b, \quad \text \quad G=\left(\frac\right)^, \quad a=\frac \text b=\frac \text Plugging in the numerical values, we obtain the "Chinchilla efficient" model size and training dataset size, as well as the test loss achievable:\begin N_(C) = 0.6 \; C^ \\ D_(C) = 0.3 \; C^ \\ L_(C) = 1070 \; C^ + 1.7 \endSimilarly, we may find the optimal training dataset size and training compute budget for any fixed model parameter size, and so on. There are other estimates for "Chinchilla efficient" model size and training dataset size. The above is based on a statistical model of L = \frac + \frac + L_0. One can also directly fit a statistical law for D_(C), N_(C) without going through the detour, for which one obtains:\begin N_(C) = 0.1 \; C^\\ D_(C) = 1.7 \; C^ \endor as tabulated:


Discrepancy

The Chinchilla scaling law analysis for training
transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...
language models suggests that for a given training compute budget (C), to achieve the minimal pretraining loss for that budget, the number of model parameters (N) and the number of training tokens (D) should be scaled in equal proportions, N_(C) \propto C^, D_(C) \propto C^. This conclusion differs from analysis conducted by Kaplan et al., which found that N should be increased more quickly than D, N_(C) \propto C^, D_(C) \propto C^. This discrepancy can primarily be attributed to the two studies using different methods for measuring model size. Kaplan et al.: * did not count the parameters in the token embedding layer, which when analyzed at smaller model sizes leads to biased coefficients; * studied smaller models than the Chinchilla group, magnifying the effect; * assumed that L_\infty = 0. Secondary effects also arise due to differences in hyperparameter tuning and learning rate schedules. Kaplan et al.: * used a warmup schedule that was too long for smaller models, making them appear less efficient; * did not fully tuning optimization hyperparameters.


Beyond Chinchilla scaling

As Chinchilla scaling has been the reference point for many large-scaling training runs, there had been a concurrent effort to go "beyond Chinchilla scaling", meaning to modify some of the training pipeline in order to obtain the same loss with less effort, or deliberately train for longer than what is "Chinchilla optimal". Usually, the goal is to make the scaling law exponent larger, which means the same loss can be trained for much less compute. For instance, filtering data can make the scaling law exponent larger. Another strand of research studies how to deal with limited data, as according to Chinchilla scaling laws, the training dataset size for the largest language models already approaches what is available on the internet. found that augmenting the dataset with a mix of "denoising objectives" constructed from the dataset improves performance. studies optimal scaling when all available data is already exhausted (such as in rare languages), so one must train multiple epoches over the same dataset (whereas Chinchilla scaling requires only one epoch). The Phi series of small language models were trained on textbook-like data generated by large language models, for which data is only limited by amount of compute available. Chinchilla optimality was defined as "optimal for training compute", whereas in actual production-quality models, there will be a lot of inference after training is complete. "Overtraining" during training means better performance during inference.
LLaMA The llama (; or ) (''Lama glama'') is a domesticated South American camelid, widely used as a List of meat animals, meat and pack animal by Inca empire, Andean cultures since the pre-Columbian era. Llamas are social animals and live with ...
models were overtrained for this reason. Subsequent studies discovered scaling laws in the overtraining regime, for dataset sizes up to 32x more than Chinchilla-optimal.


Broken neural scaling laws (BNSL)

A 2022 analysis found that many scaling behaviors of artificial neural networks follow a smoothly broken power law functional form: y = a + \bigg(bx^\bigg) \prod_^n \left(1 + \left(\frac\right)^\right)^ in which x refers to the quantity being scaled (i.e. C, N, D, number of training steps, number of inference steps, or model input size) and y refers to the ''downstream'' (or upstream) performance evaluation metric of interest (e.g. prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward,
Elo rating The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess or esports. It is named after its creator Arpad Elo, a Hungarian-American chess master and physics professor. The Elo system wa ...
, solve rate, or FID score) in zero-shot, prompted, or fine-tuned settings. The parameters a, b, c_0, c_1 ... c_n, d_1 ... d_n, f_1 ... f_n are found by statistical fitting. On a
log–log plot In science and engineering, a log–log graph or log–log plot is a two-dimensional graph of numerical data that uses logarithmic scales on both the horizontal and vertical axes. Exponentiation#Power_functions, Power functions – relationshi ...
, when f_i is not too large and a is subtracted out from the y-axis, this functional form looks like a series of linear segments connected by arcs; the n transitions between the segments are called "breaks", hence the name ''broken neural scaling laws (BNSL)''. The scenarios in which the scaling behaviors of artificial neural networks were found to follow this functional form include large-scale
vision Vision, Visions, or The Vision may refer to: Perception Optical perception * Visual perception, the sense of sight * Visual system, the physical mechanism of eyesight * Computer vision, a field dealing with how computers can be made to gain und ...
,
language Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...
, audio, video,
diffusion Diffusion is the net movement of anything (for example, atoms, ions, molecules, energy) generally from a region of higher concentration to a region of lower concentration. Diffusion is driven by a gradient in Gibbs free energy or chemical p ...
,
generative model In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsiste ...
ing,
multimodal learning Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, imp ...
, contrastive learning, AI alignment, AI capabilities,
robotics Robotics is the interdisciplinary study and practice of the design, construction, operation, and use of robots. Within mechanical engineering, robotics is the design and construction of the physical structures of robots, while in computer s ...
, out-of-distribution (OOD) generalization, continual learning,
transfer learning Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recogniz ...
, uncertainty estimation /
calibration In measurement technology and metrology, calibration is the comparison of measurement values delivered by a device under test with those of a calibration standard of known accuracy. Such a standard could be another measurement device of known ...
, out-of-distribution detection, adversarial robustness,
distillation Distillation, also classical distillation, is the process of separating the component substances of a liquid mixture of two or more chemically discrete substances; the separation process is realized by way of the selective boiling of the mixt ...
, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, arithmetic, emergent abilities, double descent,
supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...
, unsupervised/ self-supervised learning, and
reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...
(single agent and multi-agent). The architectures for which the scaling behaviors of artificial neural networks were found to follow this functional form include residual neural networks,
transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...
, MLPs, MLP-mixers,
recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...
s,
convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
s, graph neural networks, U-nets, encoder-decoder (and encoder-only) (and decoder-only) models, ensembles (and non-ensembles), MoE (mixture of experts) (and non-MoE) models, and sparse pruned (and non-sparse unpruned) models.


Inference scaling

Other than scaling up training compute, one can also scale up inference compute (or "test-time compute"). As an example, the
Elo rating The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess or esports. It is named after its creator Arpad Elo, a Hungarian-American chess master and physics professor. The Elo system wa ...
of AlphaGo improves steadily as it is allowed to spend more time on its
Monte Carlo Tree Search In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in software that plays board games. In that context MCTS is used to solve the game tree. MCTS ...
per play. For AlphaGo Zero, increasing Elo by 120 requires either 2x model size and training, or 2x test-time search. Lecture at Paul G. Allen School on Thursday, May 23, 2024, 3:30 pm Similarly, a language model for solving competition-level coding challenges, AlphaCode, consistently improved (log-linearly) in performance with more search time. For Hex, 10x training-time compute trades for 15x test-time compute. For '' Libratus'' for heads up no-limit
Texas hold 'em Texas hold 'em (also known as Texas holdem, hold 'em, and holdem) is the most popular variant of the card game of poker. Two cards, known as hole cards, are dealt face down to each player, and then five Community card poker, community cards ...
, and ''Cicero'' for ''
Diplomacy Diplomacy is the communication by representatives of State (polity), state, International organization, intergovernmental, or Non-governmental organization, non-governmental institutions intended to influence events in the international syste ...
'', and many other abstract games of partial information, inference-time searching improves performance at a similar tradeoff ratio, for up to 100,000x effective increase in training-time compute. In 2024, the OpenAI o1 report documented that o1's performance consistently improved with both increased train-time compute and test-time compute, and gave numerous examples of test-time compute scaling in mathematics, scientific reasoning, and coding tasks. One method for scaling up test-time compute is process-based supervision, where a model generates a step-by-step reasoning chain to answer a question, and another model (either human or AI) provides a reward score on some of the intermediate steps, not just the final answer. Process-based supervision can be scaled arbitrarily by using synthetic reward score without another model, for example, by running Monte Carlo rollouts and scoring each step in the reasoning according to how likely it leads to the right answer. Another method is by revision models, which are models trained to solve a problem multiple times, each time revising the previous attempt.


Other examples


Vision transformers

Vision transformers, similar to language transformers, exhibit scaling laws. A 2022 research trained vision transformers, with parameter counts N\in \times 10^6, 2\times 10^9/math>, on image sets of sizes D \in \times 10^, 3\times 10^/math>, for computing C\in .2, 10^4 (in units of TPUv3-core-days). After training the model, it is finetuned on ImageNet training set. Let L be the error probability of the finetuned model classifying ImageNet test set. They found \min_ L = 0.09 + \frac .


Neural machine translation

Ghorbani, Behrooz et al. studied scaling laws for neural machine translation (specifically, English as source, and German as target) in encoder-decoder
Transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...
models, trained until convergence on the same datasets (thus they did not fit scaling laws for computing cost C or dataset size D). They varied N \in 0^8, 3.5 \times 10^9/math> They found three results: * L is a scaling law function of N_E, N_D, where N_E, N_D are encoder and decoder parameter count. It is not simply a function of total parameter count N = N_E + N_D. The function has form L\left(N_e, N_d\right)=\alpha\left(\frac\right)^\left(\frac\right)^+L_, where \alpha, p_e, p_d, L_, \bar N_e, \bar N_d are fitted parameters. They found that N_d/N \approx 0.55 minimizes loss if N is held fixed. * L "saturates" (that is, it reaches L_\infty) for smaller models when the training and testing datasets are "source-natural" than "target-natural". A "source-natural" data point means a pair of English-German sentences, and the model is asked to translate the English sentence into German, and the English sentence is written by a natural English writer, while the German sentence is translated from the English sentence by a machine translator. To construct the two kinds of datasets, the authors collected natural English and German sentences online, then used machine translation to generate their translations. * As models grow larger, models trained on source-original datasets can achieve low loss but bad BLEU score. In contrast, models trained on target-original datasets achieve low loss and good BLEU score in tandem (Figure 10, 11 ). The authors hypothesize that source-natural datasets have uniform and dull target sentences, and so a model that is trained to predict the target sentences would quickly overfit. trained Transformers for machine translations with sizes N \in \times 10^5 , 5.6 \times 10^7/math> on dataset sizes D \in \times 10^5, 6 \times 10^9/math>. They found the Kaplan et al. (2020) scaling law applied to machine translation: L(N, D)=\left left(\frac\right)^+\frac\right. They also found the BLEU score scaling as BLEU \approx C e^.


Transfer learning

Hernandez, Danny et al. studied scaling laws for
transfer learning Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recogniz ...
in language models. They trained a family of Transformers in three ways: * pretraining on English, finetuning on Python * pretraining on an equal mix of English and Python, finetuning on Python * training on Python The idea is that pretraining on English should help the model achieve low loss on a test set of Python text. Suppose the model has parameter count N, and after being finetuned on D_F Python tokens, it achieves some loss L. We say that its "transferred token count" is D_T, if another model with the same N achieves the same L after training on D_F + D_T Python tokens. They found D_T=1.9 e 4\left(D_F\right)^(N)^ for pretraining on English text, and D_T=2.1 e 5\left(D_F\right)^(N)^ for pretraining on English and non-Python code.


Precision

Kumar et al. study scaling laws for numerical precision in the training of language models. They train a family of language models with weights, activations, and KV cache in varying numerical precision in both integer and floating-point type to measure the effects on loss as a function of precision. For training, their scaling law accounts for lower precision by wrapping the effects of precision into an overall "effective parameter count" that governs loss scaling, using the parameterization N \mapsto N_\text(P) = N(1-e^). This illustrates how training in lower precision degrades performance by reducing the true capacity of the model in a manner that varies exponentially with bits. For inference, they find that extreme overtraining of language models past Chinchilla-optimality can lead to models being more sensitive to quantization, a standard technique for efficient deep learning. This is demonstrated by observing that the degradation in loss due to weight quantization increases as an approximate power law in the token/parameter ratio D/N seen during pretraining, so that models pretrained on extreme token budgets can perform worse in terms of validation loss than those trained on more modest token budgets if post-training quantization is applied. Other work examining the effects of overtraining include Sardana et al. and Gadre et al.


Densing laws

Xiao et al. considered the parameter efficiency ("density") of models over time. The idea is that over time, researchers would discover models that use their parameters more efficiently, in that models with the same performance can have fewer parameters. A model can have an actual parameter count N, defined as the actual number of parameters in the model, and an "effective" parameter count \hat N, defined as how many parameters it would have taken a previous well-known model to reach he same performance on some benchmarks, such as MMLU. \hat N is not measured directly, but rather by measuring the actual model performance S, then plugging it back to a previously fitted scaling law, such as the Chinchilla scaling law, to obtain what \hat N would be required to reach that performance S, according to that previously fitted scaling laws. A densing law states that \ln \left(\frac\right)_ = At + B, where t is real-world time, measured in days.


See also

*
Large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...
* Foundation model *
Artificial general intelligence Artificial general intelligence (AGI)—sometimes called human‑level intelligence AI—is a type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks. Some researchers argue that sta ...
{{div col end


References

Artificial neural networks Artificial intelligence Deep learning Power laws Statistical laws