statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an

analysis of variance Analysis of variance (ANOVA) is a family of statistical methods used to compare the Mean, means of two or more groups by analyzing variance. Specifically, ANOVA compares the amount of variation ''between'' the group means to the amount of variati ...

, used in the

numerator A fraction (from , "broken") represents a part of a whole or, more generally, any number of equal parts. When spoken in everyday English, a fraction describes how many parts of a certain size there are, for example, one-half, eight-fifths, thre ...

in an

F-test An F-test is a statistical test that compares variances. It is used to determine if the variances of two samples, or if the ratios of variances among multiple samples, are significantly different. The test calculates a Test statistic, statistic, ...

of the

null hypothesis The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...

that says that a proposed model fits well. The other component is the pure-error sum of squares. The pure-error sum of squares is the sum of squared deviations of each value of the

dependent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...

from the average value over all observations sharing its

independent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

value(s). These are errors that could never be avoided by any predictive equation that assigned a predicted value for the dependent variable as a function of the value(s) of the independent variable(s). The remainder of the residual sum of squares is attributed to lack of fit of the model since it would be mathematically possible to eliminate these errors entirely.

Principle

In order for the lack-of-fit sum of squares to differ from the

sum of squares of residuals In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of dat ...

, there must be more than one value of the

response variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

for at least one of the values of the set of predictor variables. For example, consider fitting a line :

y = \alpha x + \beta \,

by the method of

least squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...

. One takes as estimates of ''α'' and ''β'' the values that minimize the sum of squares of residuals, i.e., the sum of squares of the differences between the observed ''y''-value and the fitted ''y''-value. To have a lack-of-fit sum of squares that differs from the residual sum of squares, one must observe more than one ''y''-value for each of one or more of the ''x''-values. One then partitions the "sum of squares due to error", i.e., the sum of squares of residuals, into two components: : sum of squares due to error = (sum of squares due to "pure" error) + (sum of squares due to lack of fit). The sum of squares due to "pure" error is the sum of squares of the differences between each observed ''y''-value and the average of all ''y''-values corresponding to the same ''x''-value. The sum of squares due to lack of fit is the ''weighted'' sum of squares of differences between each average of ''y''-values corresponding to the same ''x''-value and the corresponding fitted ''y''-value, the weight in each case being simply the number of observed ''y''-values for that ''x''-value. Because it is a property of least squares regression that the vector whose components are "pure errors" and the vector of lack-of-fit components are orthogonal to each other, the following equality holds: :

\begin
&\sum (\text - \text)^2  && \text \\
&\qquad = \sum (\text - \text)^2  && \text \\
&\qquad\qquad  + \sum \text\times (\text - \text)^2  && \text
\end

Hence the residual sum of squares has been completely decomposed into two components.

Mathematical details

Consider fitting a line with one predictor variable. Define ''i'' as an index of each of the ''n'' distinct ''x'' values, ''j'' as an index of the response variable observations for a given ''x'' value, and ''n''_''i'' as the number of ''y'' values associated with the ''i'' ^th ''x'' value. The value of each response variable observation can be represented by :

Y_ = \alpha x_i + \beta + \varepsilon_,\qquad i = 1,\dots, n,\quad j = 1,\dots,n_i.

Let :

\widehat\alpha, \widehat\beta \,

be the

estimates of the unobservable parameters ''α'' and ''β'' based on the observed values of ''x''_''i'' and ''Y''_''i j''. Let :

\widehat Y_i = \widehat\alpha x_i + \widehat\beta \,

be the fitted values of the response variable. Then :

\widehat\varepsilon_ = Y_ - \widehat Y_i \,

are the residuals, which are observable estimates of the unobservable values of the error term ''ε''_''ij''. Because of the nature of the method of least squares, the whole vector of residuals, with :

N = \sum_^n n_i

scalar components, necessarily satisfies the two constraints :

\sum_^n \sum_^ \widehat\varepsilon_ = 0 \,

\sum_^n \left(x_i \sum_^ \widehat\varepsilon_ \right) = 0. \,

It is thus constrained to lie in an (''N'' − 2)-dimensional subspace of R^''N'', i.e. there are ''N'' − 2 "

degrees of freedom In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...

for error". Now let :

\overline_ = \frac \sum_^ Y_

be the average of all ''Y''-values associated with the ''i'' ^th ''x''-value. We partition the sum of squares due to error into two components: :

\begin
& \sum_^n \sum_^ \widehat\varepsilon_^
= \sum_^n \sum_^ \left( Y_ - \widehat Y_i \right)^2 \\
& = \underbrace_\text
+ \underbrace_\text
\end

Probability distributions

Sums of squares

Suppose the error terms ''ε''_''i j'' are

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...

and

normally distributed In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...

with

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

0 and

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

''σ''². We treat ''x''_''i'' as constant rather than random. Then the response variables ''Y''_''i j'' are random only because the errors ''ε''_''i j'' are random. It can be shown to follow that if the straight-line model is correct, then the sum of squares due to error divided by the error variance, :

\frac\sum_^n \sum_^ \widehat\varepsilon_^

has a

chi-squared distribution In probability theory and statistics, the \chi^2-distribution with k Degrees of freedom (statistics), degrees of freedom is the distribution of a sum of the squares of k Independence (probability theory), independent standard normal random vari ...

with ''N'' − 2 degrees of freedom. Moreover, given the total number of observations ''N'', the number of levels of the independent variable ''n,'' and the number of parameters in the model ''p'': * The sum of squares due to pure error, divided by the error variance ''σ''², has a chi-squared distribution with ''N'' − ''n'' degrees of freedom; * The sum of squares due to lack of fit, divided by the error variance ''σ''², has a chi-squared distribution with ''n'' − ''p'' degrees of freedom (here ''p'' = 2 as there are two parameters in the straight-line model); * The two sums of squares are probabilistically independent.

The test statistic

It then follows that the statistic :

& = \frac \end

has an

F-distribution In probability theory and statistics, the ''F''-distribution or ''F''-ratio, also known as Snedecor's ''F'' distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor), is a continuous probability distribut ...

with the corresponding number of degrees of freedom in the numerator and the denominator, provided that the model is correct. If the model is wrong, then the

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

of the denominator is still as stated above, and the numerator and denominator are still independent. But the numerator then has a noncentral chi-squared distribution, and consequently the quotient as a whole has a non-central F-distribution. One uses this F-statistic to test the

that the linear model is correct. Since the non-central F-distribution is stochastically larger than the (central) F-distribution, one rejects the null hypothesis if the F-statistic is larger than the critical F value. The critical value corresponds to the

cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ever ...

of the

F distribution In probability theory and statistics, the ''F''-distribution or ''F''-ratio, also known as Snedecor's ''F'' distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor), is a continuous probability distribut ...

with ''x'' equal to the desired confidence level, and degrees of freedom ''d''₁ = (''n'' − ''p'') and ''d''₂ = (''N'' − ''n''). The assumptions of

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

of errors and

independence Independence is a condition of a nation, country, or state, in which residents and population, or some portion thereof, exercise self-government, and usually sovereignty, over its territory. The opposite of independence is the status of ...

can be shown to entail that this lack-of-fit test is the

likelihood-ratio test In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing ...

of this null hypothesis.

Notes