statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, a regression diagnostic is one of a set of procedures available for regression analysis that seek to assess the validity of a model in any of a number of different ways. This assessment may be an exploration of the model's underlying

statistical assumption Statistics, like all mathematical disciplines, does not infer valid conclusions from nothing. Inferring interesting conclusions about real statistical populations almost always requires some background assumptions. Those assumptions must be made c ...

s, an examination of the structure of the model by considering formulations that have fewer, more or different

explanatory variables A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

, or a study of subgroups of observations, looking for those that are either poorly represented by the model (

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

s) or that have a relatively large effect on the regression model's predictions. A regression diagnostic may take the form of a graphical result, informal quantitative results or a formal

statistical hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

, each of which provides guidance for further stages of a regression analysis.

Introduction

Regression diagnostics have often been developed or were initially proposed in the context of

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

or, more particularly,

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression In statistics, linear regression is a statistical model, model that estimates the relationship ...

. This means that many formally defined diagnostics are only available for these contexts.

Assessing assumptions

;Distribution of model errors *

Normal probability plot The normal probability plot is a graphical technique to identify substantive departures from normality. This includes identifying outliers, skewness, kurtosis, a need for transformations, and mixtures. Normal probability plots are made of raw ...

;

Homoscedasticity In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...

* Goldfeld–Quandt test * Breusch–Pagan test * Park test *

White test White test is a statistical test that establishes whether the variance of the errors in a regression model is constant: that is for homoskedasticity. This test, and an estimator for heteroscedasticity-consistent standard errors, were proposed ...

;Correlation of model errors * Breusch–Godfrey test

Assessing model structure

;Adequacy of existing explanatory variables * Partial residual plot *

Ramsey RESET test In statistics, the Ramsey Regression Equation Specification Error Test (RESET) test is a general specification test for the linear regression model. More specifically, it tests whether non-linear combinations of the explanatory variables help to ex ...

F test An F-test is a statistical test that compares variances. It is used to determine if the variances of two samples, or if the ratios of variances among multiple samples, are significantly different. The test calculates a statistic, represented by t ...

for use when there are replicated observations, so that a comparison can be made between the

lack-of-fit sum of squares In statistics, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an analysis of variance, used in the numerator in an F-test of the null ...

and the pure error sum of squares, under the assumption that model errors are

homoscedastic In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...

and have a

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

. ;Adding or dropping explanatory variables *

Partial regression plot In applied statistics, a partial regression plot attempts to show the effect of adding another variable to a model that already has one or more independent variables. Partial regression plots are also referred to as added variable plots, adjusted v ...

Student's t test Student's ''t''-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's ''t''-di ...

for testing inclusion of a single explanatory variable, or the

for testing inclusion of a group of variables, both under the assumption that model errors are

and have a

. ;Change of model structure between groups of observations *

Structural break test In econometrics and statistics, a structural break is an unexpected change over time in the parameters of regression models, which can lead to huge forecasting errors and unreliability of the model in general. This issue was popularised by Davi ...

Chow test The Chow test (), proposed by econometrician Gregory Chow in 1960, is a statistical test of whether the true coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis ...

;Comparing model structures *

PRESS statistic In statistics, the predicted residual error sum of squares (PRESS) is a form of cross-validation used in regression analysis to provide a summary measure of the fit of a model to a sample of observations that were not themselves used to estimat ...

Important groups of observations

;Outliers ;Influential observations *

Leverage (statistics) In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. ''High-leverage points'', if any, are outliers with respect t ...

partial leverage In regression analysis, partial leverage (PL) is a measure of the contribution of the individual independent variables to the total leverage of each observation. That is, if ''h'i'' is the ''i''th element of the diagonal of the hat matrix, PL i ...

DFFITS In statistics, DFFIT and DFFITS ("difference in fit(s)") are diagnostics meant to show how influential a point is in a linear regression, first proposed in 1980. DFFIT is the change in the predicted value for a point, obtained when that point is ...

Cook's distance In statistics, Cook's distance or Cook's ''D'' is a commonly used estimate of the influence of a data point when performing a least-squares regression analysis. In a practical ordinary least squares analysis, Cook's distance can be used in several ...

References

{{statistics-stub