In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the
sum of squares of residuals in an
analysis of variance
Analysis of variance (ANOVA) is a family of statistical methods used to compare the Mean, means of two or more groups by analyzing variance. Specifically, ANOVA compares the amount of variation ''between'' the group means to the amount of variati ...
, used in the
numerator
A fraction (from , "broken") represents a part of a whole or, more generally, any number of equal parts. When spoken in everyday English, a fraction describes how many parts of a certain size there are, for example, one-half, eight-fifths, thre ...
in an
F-test
An F-test is a statistical test that compares variances. It is used to determine if the variances of two samples, or if the ratios of variances among multiple samples, are significantly different. The test calculates a Test statistic, statistic, ...
of the
null hypothesis
The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...
that says that a proposed model fits well. The other component is the pure-error sum of squares.
The pure-error sum of squares is the sum of squared deviations of each value of the
dependent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...
from the average value over all observations sharing its
independent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...
value(s). These are errors that could never be avoided by any predictive equation that assigned a predicted value for the dependent variable as a function of the value(s) of the independent variable(s). The remainder of the residual sum of squares is attributed to lack of fit of the model since it would be mathematically possible to eliminate these errors entirely.
Principle
In order for the lack-of-fit sum of squares to differ from the
sum of squares of residuals
In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of dat ...
, there must be
more than one value of the
response variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...
for at least one of the values of the set of predictor variables. For example, consider fitting a line
:
by the method of
least squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
. One takes as estimates of ''α'' and ''β'' the values that minimize the sum of squares of residuals, i.e., the sum of squares of the differences between the observed ''y''-value and the fitted ''y''-value. To have a lack-of-fit sum of squares that differs from the residual sum of squares, one must observe more than one ''y''-value for each of one or more of the ''x''-values. One then partitions the "sum of squares due to error", i.e., the sum of squares of residuals, into two components:
: sum of squares due to error = (sum of squares due to "pure" error) + (sum of squares due to lack of fit).
The sum of squares due to "pure" error is the sum of squares of the differences between each observed ''y''-value and the average of all ''y''-values corresponding to the same ''x''-value.
The sum of squares due to lack of fit is the ''weighted'' sum of squares of differences between each average of ''y''-values corresponding to the same ''x''-value and the corresponding fitted ''y''-value, the weight in each case being simply the number of observed ''y''-values for that ''x''-value.
Because it is a property of least squares regression that the vector whose components are "pure errors" and the vector of lack-of-fit components are orthogonal to each other, the following equality holds:
:
Hence the residual sum of squares has been completely decomposed into two components.
Mathematical details
Consider fitting a line with one predictor variable. Define ''i'' as an index of each of the ''n'' distinct ''x'' values, ''j'' as an index of the response variable observations for a given ''x'' value, and ''n''
''i'' as the number of ''y'' values associated with the ''i''
th ''x'' value. The value of each response variable observation can be represented by
:
Let
:
be the
least squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
estimates of the unobservable parameters ''α'' and ''β'' based on the observed values of ''x''
''i'' and ''Y''
''i j''.
Let
:
be the fitted values of the response variable. Then
:
are the
residuals, which are observable estimates of the unobservable values of the error term ''ε''
''ij''. Because of the nature of the method of least squares, the whole vector of residuals, with
:
scalar components, necessarily satisfies the two constraints
:
:
It is thus constrained to lie in an (''N'' − 2)-dimensional subspace of R
''N'', i.e. there are ''N'' − 2 "
degrees of freedom
In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...
for error".
Now let
:
be the average of all ''Y''-values associated with the ''i''
th ''x''-value.
We partition the sum of squares due to error into two components:
:
Probability distributions
Sums of squares
Suppose the
error terms ''ε''
''i j'' are
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
and
normally distributed
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is
f(x ...
with
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
0 and
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
''σ''
2. We treat ''x''
''i'' as constant rather than random. Then the response variables ''Y''
''i j'' are random only because the errors ''ε''
''i j'' are random.
It can be shown to follow that if the straight-line model is correct, then the sum of squares due to error divided by the error variance,
:
has a
chi-squared distribution
In probability theory and statistics, the \chi^2-distribution with k Degrees of freedom (statistics), degrees of freedom is the distribution of a sum of the squares of k Independence (probability theory), independent standard normal random vari ...
with ''N'' − 2 degrees of freedom.
Moreover, given the total number of observations ''N'', the number of levels of the independent variable ''n,'' and the number of parameters in the model ''p'':
* The sum of squares due to pure error, divided by the error variance ''σ''
2, has a chi-squared distribution with ''N'' − ''n'' degrees of freedom;
* The sum of squares due to lack of fit, divided by the error variance ''σ''
2, has a chi-squared distribution with ''n'' − ''p'' degrees of freedom (here ''p'' = 2 as there are two parameters in the straight-line model);
* The two sums of squares are probabilistically independent.
The test statistic
It then follows that the statistic
:
has an
F-distribution
In probability theory and statistics, the ''F''-distribution or ''F''-ratio, also known as Snedecor's ''F'' distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor), is a continuous probability distribut ...
with the corresponding number of degrees of freedom in the numerator and the denominator, provided that the model is correct. If the model is wrong, then the
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
of the denominator is still as stated above, and the numerator and denominator are still independent. But the numerator then has a
noncentral chi-squared distribution, and consequently the quotient as a whole has a
non-central F-distribution.
One uses this F-statistic to test the
null hypothesis
The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...
that the linear model is correct. Since the non-central F-distribution is
stochastically larger than the (central) F-distribution, one rejects the null hypothesis if the F-statistic is larger than the critical F value. The critical value corresponds to the
cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ever ...
of the
F distribution
In probability theory and statistics, the ''F''-distribution or ''F''-ratio, also known as Snedecor's ''F'' distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor), is a continuous probability distribut ...
with ''x'' equal to the desired
confidence level, and degrees of freedom ''d''
1 = (''n'' − ''p'') and ''d''
2 = (''N'' − ''n'').
The assumptions of
normal distribution
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
f(x) = \frac ...
of errors and
independence
Independence is a condition of a nation, country, or state, in which residents and population, or some portion thereof, exercise self-government, and usually sovereignty, over its territory. The opposite of independence is the status of ...
can be shown to entail that this
lack-of-fit test is the
likelihood-ratio test
In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing ...
of this null hypothesis.
See also
*
Fraction of variance unexplained
*
Goodness of fit
The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measur ...
*
Linear regression
In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...
Notes
{{reflist
Analysis of variance
Design of experiments
Least squares
Statistical hypothesis testing