machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, normalization is a statistical technique with various applications. There are two main forms of normalization, namely ''data normalization'' and ''activation normalization''. Data normalization (or feature scaling) includes methods that rescale input data so that the

features Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...

have the same range, mean, variance, or other statistical properties. For instance, a popular choice of feature scaling method is min-max normalization, where each feature is transformed to have the same range (typically

,1 /math> or 1,1 /math>). This solves the problem of different features having vastly different scales, for example if one feature is measured in kilometers and another in nanometers.

Activation normalization, on the other hand, is specific to

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

, and includes methods that rescale the activation of hidden neurons inside

neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

. Normalization is often used to: * increase the speed of training convergence, * reduce sensitivity to variations and feature scales in input data, * reduce

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

, * and produce better model generalization to unseen data. Normalization techniques are often theoretically justified as reducing covariance shift, smoothing optimization landscapes, and increasing

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

, though they are mainly justified by empirical success.

Batch normalization

Batch normalization (BatchNorm) operates on the activations of a layer for each mini-batch. Consider a simple feedforward network, defined by chaining together modules:

x^ \mapsto x^ \mapsto x^ \mapsto \cdots

where each network module can be a linear transform, a nonlinear activation function, a convolution, etc.

x^

is the input vector,

x^

is the output vector from the first module, etc. BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just after

x^

, then the network would operate accordingly:

\cdots \mapsto x^ \mapsto \mathrm(x^) \mapsto x^ \mapsto \cdots

The BatchNorm module does not operate over individual inputs. Instead, it must operate over one batch of inputs at a time. Concretely, suppose we have a batch of inputs

x^_, x^_, \dots, x^_

, fed all at once into the network. We would obtain in the middle of the network some vectors:

x^_, x^_, \dots, x^_

The BatchNorm module computes the coordinate-wise mean and variance of these vectors:

\begin
\mu^_i &= \frac 1B \sum_^B x^_ \\
(\sigma^_i)^2 &= \frac \sum_^B (x_^ - \mu_i^)^2
\end

where

i

indexes the coordinates of the vectors, and

b

indexes the elements of the batch. In other words, we are considering the

i

-th coordinate of each vector in the batch, and computing the mean and variance of these numbers. It then normalizes each coordinate to have zero mean and unit variance:

\hat^_ = \frac

The

\epsilon

is a small positive constant such as

10^

added to the variance for numerical stability, to avoid

division by zero In mathematics, division by zero, division (mathematics), division where the divisor (denominator) is 0, zero, is a unique and problematic special case. Using fraction notation, the general example can be written as \tfrac a0, where a is the di ...

. Finally, it applies a linear transformation:

y^_ = \gamma_i \hat^_ + \beta_i

Here,

\gamma

and

\beta

are parameters inside the BatchNorm module. They are learnable parameters, typically trained by

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

. The following is a

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...

implementation of BatchNorm: import numpy as np def batchnorm(x, gamma, beta, epsilon=1e-9): # Mean and variance of each feature mu = np.mean(x, axis=0) # shape (N,) var = np.var(x, axis=0) # shape (N,) # Normalize the activations x_hat = (x - mu) / np.sqrt(var + epsilon) # shape (B, N) # Apply the linear transform y = gamma * x_hat + beta # shape (B, N) return y

Interpretation

\gamma

and

\beta

allow the network to learn to undo the normalization, if this is beneficial. BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus solely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be augmented with a linear transformation layer on top. It is claimed in the original publication that BatchNorm works by reducing internal covariance shift, though the claim has both supporters and detractors.

Special cases

The original paper recommended to only use BatchNorms after a linear transform, not after a nonlinear activation. That is,

\phi(\mathrm(Wx + b))

, not

\mathrm(\phi(Wx + b))

. Also, the bias

b

does not matter, since it would be canceled by the subsequent mean subtraction, so it is of the form

\mathrm(Wx)

. That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to zero. For

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

s (CNNs), BatchNorm must preserve the translation-invariance of these models, meaning that it must treat all outputs of the same

kernel Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...

as if they are different data points within a batch. This is sometimes called Spatial BatchNorm, or BatchNorm2D, or per-channel BatchNorm. Concretely, suppose we have a 2-dimensional convolutional layer defined by:

x^_ = \sum_ K^_ x_^ + b^_c

where: *

x^_

is the activation of the neuron at position

(h, w)

in the

c

-th channel of the

l

-th layer. *

K^_

is a kernel tensor. Each channel

c

corresponds to a kernel

K^_

, with indices

\Delta h, \Delta w, c'

. *

b^_c

is the bias term for the

c

-th channel of the

l

-th layer. In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch. That is, it is applied once per ''kernel''

c

(equivalently, once per channel

c

), not per ''activation''

x^_

\begin
\mu^_c &= \frac \sum_^B \sum_^H \sum_^W x^_ \\
(\sigma^_c)^2 &= \frac \sum_^B \sum_^H \sum_^W (x_^ - \mu_c^)^2
\end

where

B

is the batch size,

H

is the height of the feature map, and

W

is the width of the feature map. That is, even though there are only

B

data points in a batch, all

BHW

outputs from the kernel in this batch are treated equally. Subsequently, normalization and the linear transform is also done per kernel:

\begin
\hat^_ &= \frac \\
y^_ &= \gamma_c \hat^_ + \beta_c
\end

Similar considerations apply for BatchNorm for ''n''-dimensional convolutions. The following is a Python implementation of BatchNorm for 2D convolutions: import numpy as np def batchnorm_cnn(x, gamma, beta, epsilon=1e-9): # Calculate the mean and variance for each channel. mean = np.mean(x, axis=(0, 1, 2), keepdims=True) var = np.var(x, axis=(0, 1, 2), keepdims=True) # Normalize the input tensor. x_hat = (x - mean) / np.sqrt(var + epsilon) # Scale and shift the normalized tensor. y = gamma * x_hat + beta return y For multilayered

recurrent neural networks Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

(RNN), BatchNorm is usually applied only for the ''input-to-hidden'' part, not the ''hidden-to-hidden'' part. Let the hidden state of the

l

-th layer at time

t

h_t^

. The standard RNN, without normalization, satisfies

h^_t = \phi(W^ h_t^ + U^ h_^ + b^)

where

W^, U^, b^

are weights and biases, and

\phi

is the activation function. Applying BatchNorm, this becomes

h^_t = \phi(\mathrm(W^ h_t^) + U^ h_^)

There are two possible ways to define what a "batch" is in BatchNorm for RNNs: ''frame-wise'' and ''sequence-wise''. Concretely, consider applying an RNN to process a batch of sentences. Let

h_^

be the hidden state of the

l

-th layer for the

t

-th token of the

b

-th input sentence. Then frame-wise BatchNorm means normalizing over

b

\begin
\mu_t^ &= \frac \sum_^B h_^ \\
(\sigma_t^)^2 &= \frac \sum_^B  (h_t^ - \mu_t^)^2
\end

and sequence-wise means normalizing over

(b, t)

\begin
\mu^ &= \frac \sum_^B\sum_^T h_^ \\
(\sigma^)^2 &= \frac \sum_^B\sum_^T (h_t^ - \mu^)^2
\end

Frame-wise BatchNorm is suited for causal tasks such as next-character prediction, where future frames are unavailable, forcing normalization per frame. Sequence-wise BatchNorm is suited for tasks such as speech recognition, where the entire sequences are available, but with variable lengths. In a batch, the smaller sequences are padded with zeroes to match the size of the longest sequence of the batch. In such setups, frame-wise is not recommended, because the number of unpadded frames decreases along the time axis, leading to increasingly poorer statistics estimates. It is also possible to apply BatchNorm to LSTMs.

Improvements

BatchNorm has been very popular and there were many attempted improvements. Some examples include: * ghost batching: randomly partition a batch into sub-batches and perform BatchNorm separately on each; * weight decay on

\gamma

and

\beta

; * and combining BatchNorm with GroupNorm. A particular problem with BatchNorm is that during training, the mean and variance are calculated on the fly for each batch (usually as an exponential moving average), but during inference, the mean and variance were frozen from those calculated during training. This train-test disparity degrades performance. The disparity can be decreased by simulating the moving average during inference:

2 + (1 - \alpha) \mu_) - \mu^2 \end

where

\alpha

is a hyperparameter to be optimized on a validation set. Other works attempt to eliminate BatchNorm, such as the Normalizer-Free ResNet.

Layer normalization

Layer normalization (LayerNorm) is a popular alternative to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component of

transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

models. For a given data input and layer, LayerNorm computes the mean

\mu

and variance

\sigma^2

over all the neurons in the layer. Similar to BatchNorm, learnable parameters

\gamma

(scale) and

\beta

(shift) are applied. It is defined by:

\hat = \frac, \quad y_i = \gamma_i \hat + \beta_i

where:

\mu = \frac 1D \sum_^D x_i, \quad \sigma^2 = \frac 1D \sum_^D (x_i - \mu)^2

and the index

i

ranges over the neurons in that layer.

Examples

For example, in CNN, a LayerNorm applies to all activations in a layer. In the previous notation, we have:

\begin
\mu^ &= \frac \sum_^H \sum_^W\sum_^C x^_ \\
(\sigma^)^2 &= \frac  \sum_^H \sum_^W\sum_^C (x_^ - \mu^)^2 \\
\hat^_ &= \frac \\
y^_ &= \gamma^ \hat^_  + \beta^
\end

Notice that the batch index

b

is removed, while the channel index

c

is added. In

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

s and

transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...

, LayerNorm is applied individually to each timestep. For example, if the hidden vector in an RNN at timestep

t

x^ \in \mathbb^

, where

D

is the dimension of the hidden vector, then LayerNorm will be applied with:

\hat^ = \frac, \quad y_i^ = \gamma_i \hat^  + \beta_i

where:

\mu^ = \frac 1D \sum_^D x_i^, \quad (\sigma^)^2 = \frac 1D \sum_^D (x_i^ - \mu^)^2

Root mean square layer normalization

Root mean square layer normalization (RMSNorm):

\hat = \frac, \quad y_i = \gamma \hat  + \beta

Essentially, it is LayerNorm where we enforce

\mu, \epsilon = 0

. It is also called L2 normalization. It is a special case of Lp normalization, or power normalization:

\hat = \frac, \quad y_i = \gamma \hat  + \beta

where

p > 0

is a constant.

Adaptive

Adaptive layer norm (adaLN) computes the

\gamma, \beta

in a LayerNorm not from the layer activation itself, but from other data. It was first proposed for CNNs, and has been used effectively in

diffusion Diffusion is the net movement of anything (for example, atoms, ions, molecules, energy) generally from a region of higher concentration to a region of lower concentration. Diffusion is driven by a gradient in Gibbs free energy or chemical p ...

transformers (DiTs). For example, in a DiT, the conditioning information (such as a text encoding vector) is processed by a

multilayer perceptron In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is ...

into

\gamma, \beta

, which is then applied in the LayerNorm module of a transformer.

Weight normalization

Weight normalization (WeightNorm) is a technique inspired by BatchNorm that normalizes weight matrices in a neural network, rather than its activations. One example is spectral normalization, which divides weight matrices by their spectral norm. The spectral normalization is used in

generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June ...

s (GANs) such as the Wasserstein GAN. The spectral radius can be efficiently computed by the following algorithm: By reassigning

W_i \leftarrow \frac

after each update of the discriminator, we can upper-bound

\, W_i\, _s \leq 1

, and thus upper-bound

\, D \, _L

. The algorithm can be further accelerated by

memoization In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls to pure functions and returning the cached result when the same inputs occur ag ...

: at step

t

, store

x^*_i(t)

. Then, at step

t+1

, use

x^*_i(t)

as the initial guess for the algorithm. Since

W_i(t+1)

is very close to

W_i(t)

, so is

x^*_i(t)

x^*_i(t+1)

, thus allowing rapid convergence.

CNN-specific normalization

There are some activation normalization techniques that are only used for CNNs.

Response normalization

Local response normalization was used in

AlexNet AlexNet is a convolutional neural network architecture developed for image classification tasks, notably achieving prominence through its performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It classifies images into 1, ...

. It was applied in a convolutional layer, just after a nonlinear activation function. It was defined by:

b_^i=\frac

where

a_^i

is the activation of the neuron at location

(x,y)

and channel

i

. I.e., each pixel in a channel is suppressed by the activations of the same pixel in its adjacent channels.

k, n, \alpha, \beta

are hyperparameters picked by using a validation set. It was a variant of the earlier local contrast normalization.

b_^i=\frac

where

\bar a_^j

is the average activation in a small window centered on location

(x,y)

and channel

i

. The hyperparameters

k, n, \alpha, \beta

, and the size of the small window, are picked by using a validation set. Similar methods were called divisive normalization, as they divide activations by a number depending on the activations. They were originally inspired by biology, where it was used to explain nonlinear responses of cortical neurons and nonlinear masking in visual perception. Both kinds of local normalization were obviated by batch normalization, which is a more global form of normalization. Response normalization reappeared in ConvNeXT-2 as global response normalization.

Group normalization

Group normalization (GroupNorm) is a technique also solely used for CNNs. It can be understood as the LayerNorm for CNN applied once per channel group. Suppose at a layer

l

, there are channels

1, 2, \dots, C

, then it is partitioned into groups

g_1, g_2, \dots, g_G

. Then, LayerNorm is applied to each group.

Instance normalization

Instance normalization (InstanceNorm), or contrast normalization, is a technique first developed for neural style transfer, and is also only used for CNNs. It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel:

\begin
\mu^_c &= \frac \sum_^H \sum_^Wx^_ \\
(\sigma^_c)^2 &= \frac  \sum_^H \sum_^W (x_^ - \mu^_c)^2 \\
\hat^_ &= \frac \\
y^_ &= \gamma^_c \hat^_  + \beta^_c
\end

Adaptive instance normalization

Adaptive instance normalization (AdaIN) is a variant of instance normalization, designed specifically for neural style transfer with CNNs, rather than just CNNs in general. In the AdaIN method of style transfer, we take a CNN and two input images, one for content and one for style. Each image is processed through the same CNN, and at a certain layer

l

, AdaIn is applied. Let

x^

be the activation in the content image, and

x^

be the activation in the style image. Then, AdaIn first computes the mean and variance of the activations of the content image

x'^

, then uses those as the

\gamma, \beta

for InstanceNorm on

x^

. Note that

x^

itself remains unchanged. Explicitly, we have:

\begin
y^_ &= \sigma^_c \left( \frac \right) + \mu^_c
\end

Transformers

Some normalization methods were designed for use in

. The original 2017 transformer used the "post-LN" configuration for its LayerNorms. It was difficult to train, and required careful

hyperparameter tuning Hyperparameter may refer to: * Hyperparameter (machine learning) * Hyperparameter (Bayesian statistics) In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of th ...

and a "warm-up" in

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...

, where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018, was found to be easier to train, requiring no warm-up, leading to faster convergence. FixNorm and ScaleNorm both normalize activation vectors in a transformer. The FixNorm method divides the ''output'' vectors from a transformer by their L2 norms, then multiplies by a learned parameter

g

. The ScaleNorm replaces all LayerNorms inside a transformer by division with L2 norm, then multiplying by a learned parameter

g'

(shared by all ScaleNorm modules of a transformer). Query-Key normalization (QKNorm) normalizes query and key vectors to have unit L2 norm. In nGPT, many vectors are normalized to have unit L2 norm: hidden state vectors, input and output embedding vectors, weight matrix columns, and query and key vectors.

Miscellaneous

Gradient normalization (GradNorm) normalizes gradient vectors during backpropagation.

Batch normalization

Interpretation

Special cases

Improvements

Layer normalization

Examples

Root mean square layer normalization

Adaptive

Weight normalization

CNN-specific normalization

Response normalization

Group normalization

Instance normalization

Adaptive instance normalization

Transformers

Miscellaneous

See also

References

Further reading