The partition of sums of squares is a concept that permeates much of

inferential statistics Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...

and

descriptive statistics A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...

. More properly, it is the partitioning of sums of

squared deviations A square is a regular quadrilateral with four equal sides and four right angles. Square or Squares may also refer to: Mathematics and science *Square (algebra), multiplying a number or expression by itself *Square (cipher), a cryptographic block ...

or errors. Mathematically, the sum of squared deviations is an unscaled, or unadjusted measure of

dispersion Dispersion may refer to: Economics and finance *Dispersion (finance), a measure for the statistical distribution of portfolio returns * Price dispersion, a variation in prices across sellers of the same item *Wage dispersion, the amount of variat ...

(also called

variability Variability is how spread out or closely clustered a set of data is. Variability may refer to: Biology *Genetic variability, a measure of the tendency of individual genotypes in a population to vary from one another *Heart rate variability, a phy ...

). When scaled for the number of

degrees of freedom In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...

, it estimates the

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

, or spread of the observations about their mean value. Partitioning of the sum of squared deviations into various components allows the overall variability in a dataset to be ascribed to different types or sources of variability, with the relative importance of each being quantified by the size of each component of the overall sum of squares.

Background

The distance from any point in a collection of data, to the mean of the data, is the deviation. This can be written as

y_i - \overline

, where

y_i

is the ith data point, and

\overline

is the estimate of the mean. If all such deviations are squared, then summed, as in

\sum_^n\left(y_i-\overline\,\right)^2

, this gives the "sum of squares" for these data. When more data are added to the collection the sum of squares will increase, except in unlikely cases such as the new data being equal to the mean. So usually, the sum of squares will grow with the size of the data collection. That is a manifestation of the fact that it is unscaled. In many cases, the number of

is simply the number of data points in the collection, minus one. We write this as ''n'' − 1, where ''n'' is the number of data points. Scaling (also known as normalizing) means adjusting the sum of squares so that it does not grow as the size of the data collection grows. This is important when we want to compare samples of different sizes, such as a sample of 100 people compared to a sample of 20 people. If the sum of squares were not normalized, its value would always be larger for the sample of 100 people than for the sample of 20 people. To scale the sum of squares, we divide it by the degrees of freedom, i.e., calculate the sum of squares per degree of freedom, or variance.

Standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...

, in turn, is the square root of the variance. The above describes how the sum of squares is used in descriptive statistics; see the article on

total sum of squares In statistical data analysis the total sum of squares (TSS or SST) is a quantity that appears as part of a standard way of presenting results of such analyses. For a set of observations, y_i, i\leq n, it is defined as the sum over all squared dif ...

for an application of this broad principle to

Partitioning the sum of squares in linear regression

Theorem. Given a

linear regression model In statistics, linear regression is a model that estimates the relationship between a scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A model with exactly one explanatory variable ...

y_i = \beta_0 + \beta_1 x_ + \cdots + \beta_p x_ + \varepsilon_i

''including a constant''

\beta_0

, based on a sample

(y_i, x_, \ldots, x_), \, i = 1, \ldots, n

containing ''n'' observations, the total sum of squares

\mathrm = \sum_^n (y_i - \bar)^2

can be partitioned as follows into the

explained sum of squares In statistics, the explained sum of squares (ESS), alternatively known as the model sum of squares or sum of squares due to regression (SSR – not to be confused with the residual sum of squares (RSS) or sum of squares of errors), is a quantity ...

(ESS) and the

residual sum of squares In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of dat ...

(RSS): :

\mathrm =  \mathrm + \mathrm,

where this equation is equivalent to each of the following forms: :

\begin
\left\,  y - \bar \mathbf \right\, ^2 &=  \left\,  \hat - \bar \mathbf \right\, ^2 + \left\,  \hat \right\, ^2, \quad \mathbf = (1, 1, \ldots, 1)^T ,\\
\sum_^n (y_i - \bar)^2 &= \sum_^n (\hat_i - \bar)^2 + \sum_^n (y_i - \hat_i)^2 ,\\
\sum_^n (y_i - \bar)^2 &= \sum_^n (\hat_i - \bar)^2 + \sum_^n \hat_i^2 ,\\
\end

:where

\hat_i

is the value estimated by the regression line having

\hat_0

\hat_1

, ...,

\hat_p

as the estimated

coefficient In mathematics, a coefficient is a Factor (arithmetic), multiplicative factor involved in some Summand, term of a polynomial, a series (mathematics), series, or any other type of expression (mathematics), expression. It may be a Dimensionless qu ...

Proof

\begin
\sum_^n (y_i - \overline)^2 &= \sum_^n (y_i - \overline + \hat_i - \hat_i)^2
= \sum_^n ((\hat_i - \bar) + \underbrace_)^2 \\
&= \sum_^n ((\hat_i - \bar)^2 + 2 \hat_i (\hat_i - \bar) + \hat_i^2) \\
&= \sum_^n (\hat_i - \bar)^2 + \sum_^n \hat_i^2 + 2 \sum_^n \hat_i (\hat_i - \bar) \\
&= \sum_^n (\hat_i - \bar)^2 + \sum_^n \hat_i^2 + 2 \sum_^n \hat_i(\hat_0 + \hat_1 x_ + \cdots + \hat_p x_ - \overline) \\
&= \sum_^n (\hat_i - \bar)^2 + \sum_^n \hat_i^2 + 2 (\hat_0 - \overline) \underbrace_0 + 2 \hat_1 \underbrace_0 + \cdots + 2 \hat_p \underbrace_0 \\
&= \sum_^n (\hat_i - \bar)^2 + \sum_^n \hat_i^2 = \mathrm + \mathrm \\
\end

The requirement that the model include a constant or equivalently that the design matrix contain a column of ones ensures that

\sum_^n \hat_i = 0

, i.e.

\hat^T\mathbf=0

. The proof can also be expressed in vector form, as follows: :

\begin
  SS_\text   = \Vert \mathbf - \bar\mathbf \Vert^2 & = \Vert \mathbf - \bar\mathbf + \mathbf - \mathbf \Vert^2 , \\
     &  = \Vert \left( \mathbf - \bar\mathbf \right) + \left( \mathbf - \mathbf \right) \Vert^2, \\
     &  = \Vert  \Vert^2  + \Vert \hat \varepsilon\Vert^2  + 2 \hat \varepsilon^T \left( \mathbf - \bar\mathbf \right), \\
     &  = SS_\text  + SS_\text  + 2\hat \varepsilon^T \left( X\hat\beta - \bar\mathbf \right) ,\\
     &  = SS_\text  + SS_\text + 2\left( \hat \varepsilon^T X \right)\hat\beta  - 2\bar\underbrace_0, \\ 
     &  = SS_\text + SS_\text. 
\end

The elimination of terms in the last line, used the fact that :

\hat \varepsilon ^T X = \left( \mathbf - \mathbf \right)^T X 
    = \mathbf^T(I - X(X^T X)^ X^T)^T X = ^T(X^T-X^T)^T=.

Further partitioning

Note that the residual sum of squares can be further partitioned as the

lack-of-fit sum of squares In statistics, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an analysis of variance, used in the numerator in an F-test of the null ...

plus the sum of squares due to pure error.

References

* Pre-publication chapters are available on-line. * * *:Republished as: * {{cite book, title=Probability Via Expectation, edition=4th, author=Whittle, P., publisher=Springer, date=20 April 2000, isbn=0-387-98955-2 Analysis of variance Least squares

Background

Partitioning the sum of squares in linear regression

Proof

Further partitioning

See also

References