Feature scaling is a method used to normalize the range of independent variables or features of data. In
data processing
Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an o ...
, it is also known as
data normalization and is generally performed during the
data preprocessing step.
Motivation
Since the range of values of raw data varies widely, in some
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
algorithms, objective functions will not work properly without
normalization. For example, many
classifiers calculate the distance between two points by the
Euclidean distance
In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
Another reason why feature scaling is applied is that
gradient descent converges much faster with feature scaling than without it.
It's also important to apply feature scaling if
regularization is used as part of the loss function (so that coefficients are penalized appropriately).
Empirically, feature scaling can improve the convergence speed of
stochastic gradient descent. In support vector machines, it can reduce the time to find support vectors. Feature scaling is also often used in applications involving distances and similarities between data points, such as clustering and similarity search. As an example, the
K-means clustering
''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition of a set, partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster (statistics), cluste ...
algorithm is sensitive to feature scales.
Methods
Rescaling (min-max normalization)
Also known as min-max scaling or min-max normalization, rescaling is the simplest method and consists in rescaling the range of features to scale the range in
, 1
The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
or
��1, 1 Selecting the target range depends on the nature of the data. The general formula for a min-max of
, 1
The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
is given as:
:
where
is an original value,
is the normalized value. For example, suppose that we have the students' weight data, and the students' weights span
60 pounds, 200 pounds To rescale this data, we first subtract 160 from each student's weight and divide the result by 40 (the difference between the maximum and minimum weights).
To rescale a range between an arbitrary set of values
, b the formula becomes:
:
where
are the min-max values.
Mean normalization
:
where
is an original value,
is the normalized value,
is the mean of that feature vector. There is another form of the means normalization which divides by the standard deviation which is also called standardization.
Standardization (Z-score Normalization)

In machine learning, we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multiple
dimensions. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. This method is widely used for normalization in many machine learning algorithms (e.g.,
support vector machines,
logistic regression
In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...
, and
artificial neural network
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
s).
The general method of calculation is to determine the distribution
mean
A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
and
standard deviation
In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...
for each feature. Next we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation.
:
Where
is the original feature vector,
is the mean of that feature vector, and
is its standard deviation.
Robust Scaling
Robust scaling, also known as standardization using
median
The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...
and
interquartile range
In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the differen ...
(IQR), is designed to be
robust to
outliers. It scales features using the median and IQR as reference points instead of the mean and standard deviation:
where
are the three quartiles (25th, 50th, 75th percentile) of the feature.
Unit vector normalization
Unit vector normalization regards each individual data point as a vector, and divide each by its
vector norm, to obtain
. Any vector norm can be used, but the most common ones are the
L1 norm and the L2 norm.
For example, if
, then its Lp-normalized version is:
See also
*
Normalization (machine learning)
*
Normalization (statistics)
In statistics and applications of statistics, normalization can have a range of meanings. In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging ...
*
Standard score
In statistics, the standard score or ''z''-score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores ...
*
fMLLR, Feature space Maximum Likelihood Linear Regression
References
Further reading
* {{cite book , first1=Jiawei , last1=Han , first2=Micheline , last2=Kamber , first3=Jian , last3=Pei , title=Data Mining: Concepts and Techniques , publisher=Elsevier , year=2011 , chapter=Data Transformation and Data Discretization , pages=111–118 , isbn=9780123814807 , chapter-url=https://books.google.com/books?id=pQws07tdpjoC&pg=PA111
External links
Lecture by Andrew Ng on feature scaling
Machine learning
Statistical data transformation