In statistics and in particular in

regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...

, leverage is a measure of how far away the independent variable values of an

observation Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. Th ...

are from those of the other observations. ''High-leverage points'', if any, are

outliers In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter a ...

with respect to the independent variables. That is, high-leverage points have no neighboring points in

\mathbb^

space, where ''

'' is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation. Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the hat matrix.

Definition and interpretations

Consider the

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...

model

_i = \boldsymbol_i^\boldsymbol+_i

i=1,\, 2,\ldots,\, n

. That is,

\boldsymbol = \mathbf\boldsymbol+\boldsymbol

, where,

\mathbf

is the

n\times p

design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual ...

whose rows correspond to the observations and whose columns correspond to the independent or explanatory variables. The ''leverage score'' for the

^

independent observation

\boldsymbol_i

is given as: :

= \boldsymbol_i ^ \left( \mathbf^ \mathbf \right)^\boldsymbol_i

, the

^

diagonal element of the ortho-projection matrix (''a.k.a'' hat matrix)

\mathbf = \mathbf \left( \mathbf^ \mathbf \right)^ \mathbf^

. Thus the

^

leverage score can be viewed as the 'weighted' distance between

\boldsymbol_i

to the mean of

\boldsymbol_i

's (see its relation with Mahalanobis distance). It can also be interpreted as the degree by which the

^

measured (dependent) value (i.e.,

y_i

) influences the

^

fitted (predicted) value (i.e.,

\widehat_i

): mathematically, :

h_ = \frac

. Hence, the leverage score is also known the observation self-sensitivity or self-influence. Using the fact that

=

(i.e., the prediction

is ortho-projection of

onto range space of

\mathbf

) in the above expression, we get

h_= \left \mathbf \right

. Note that this leverage depends on the values of the explanatory variables

(\mathbf)

of all observations but not on any of the values of the dependent variables

(y_i)

Properties

# The leverage

h_

is a number between 0 and 1,

0 \leq h_ \leq 1 .

Proof: Note that

\mathbf

is idempotent matrix (

\mathbf^2=\mathbf

) and symmetric (

h_=h_

). Thus, by using the fact that

\left \mathbf^2 \right = \left \mathbf \right

, we have

h_=h_^2+\sum_h_^2

. Since we know that

\sum_h_^2 \geq 0

, we have

h_ \geq h_^2 \implies 0 \leq h_ \leq 1

. # Sum of leverages is equal to the number of parameters

(p)

\boldsymbol

(including the intercept). Proof:

\sum_^n h_ =\operatorname(\mathbf)
=\operatorname\left(\mathbf \left( \mathbf^ \mathbf \right)^ \mathbf^\right)
=\operatorname\left(\mathbf^ \mathbf \left(\mathbf^ \mathbf \right)^ \right)
=\operatorname(\mathbf_p)=p

Determination of outliers in X using leverages

Large leverage

correspond

that is extreme. A common rule is to identify

whose leverage value

_

is more than 2 times larger than the mean leverage

\bar=\dfrac\sum_^h_=\dfrac

(see property 2 above). That is, if

h_>2\dfrac

shall be considered as an outlier. Some statisticians also prefer the threshold of

3p/

instead of

2p/

Relation to Mahalanobis distance

Leverage is closely related to the Mahalanobis distance (proof). Specifically, for some

n\times p

matrix

\mathbf

, the squared Mahalanobis distance of

(where

_^

^

row of

\mathbf

) from the vector of mean

\widehat=\sum_^n \boldsymbol_i

of length

p

, is

D^2(\boldsymbol_) = (\boldsymbol_ - \widehat)^ \mathbf^ (\boldsymbol_-\widehat)

, where

\mathbf = \mathbf^\mathbf

is the estimated covariance matrix of

's. This is related to the leverage

h_

of the hat matrix of

\mathbf

after appending a column vector of 1's to it. The relationship between the two is: :

D^2(\boldsymbol_) = (n - 1)(h_ - \tfrac)

This relationship enables us to decompose leverage into meaningful components so that some sources of high leverage can be investigated analytically.

Relation to influence functions

In a regression context, we combine leverage and influence functions to compute the degree to which estimated coefficients would change if we removed a single data point. Denoting the regression residuals as

\widehat_i = y_i- \boldsymbol_i^\widehat\boldsymbol

, one can compare the estimated coefficient

\widehat\boldsymbol

to the leave-one-out estimated coefficient

\widehat\boldsymbol^

using the formula :

\widehat\boldsymbol - \widehat\boldsymbol^
= \frac

Young (2019) uses a version of this formula after residualizing controls. To gain intuition for this formula, note that

\frac = (\mathbf^\mathbf)^\boldsymbol_i

captures the potential for an observation to affect the regression parameters, and therefore

(\mathbf^\mathbf)^\boldsymbol_i\widehat_i

captures the actual influence of that observations' deviations from its fitted value on the regression parameters. The formula then divides by

(1-h_)

to account for the fact that we remove the observation rather than adjusting its value, reflecting the fact that removal changes the distribution of covariates more when applied to high-leverage observations (i.e. with outlier covariate values). Similar formulas arise when applying general formulas for statistical influences functions in the regression context.

Effect on residual variance

If we are in an

ordinary least squares In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the ...

setting with fixed

\mathbf

and homoscedastic regression errors

\varepsilon_i,

\boldsymbol=\mathbf\boldsymbol+\boldsymbol;  \ \ \operatorname(\boldsymbol)=\sigma^2\mathbf

, then the

^

regression residual,

e_i=y_i-\widehat_i

has variance :

\operatorname(e_i)=(1-h_)\sigma^2

. In other words, an observation's leverage score determines the degree of noise in the model's misprediction of that observation, with higher leverage leading to less noise. This follows from the fact that

\mathbf-\mathbf

is idempotent and symmetric and

\widehat=\mathbf\boldsymbol

, hence,

\operatorname(\boldsymbol)=\operatorname((\mathbf-\mathbf)\boldsymbol)
=(\mathbf-\mathbf)\operatorname(\boldsymbol)(\mathbf-\mathbf)^\top
= \sigma^2 (\mathbf-\mathbf)^2=\sigma^2(\mathbf-\mathbf)

. The corresponding studentized residual—the residual adjusted for its observation-specific estimated residual variance—is then :

t_i =

where

\widehat

is an appropriate estimate of

\sigma

Partial leverage

Partial leverage (PL) is a measure of the contribution of the individual independent variables to the total leverage of each observation. That is, PL is a measure of how ''

h_

'' changes as a variable is added to the regression model. It is computed as: :

\left(\mathrm_j\right)_i = \frac

where

j

is the index of independent variable,

i

is the index of observation and

\mathbf_

are the residuals from regressing ''

\mathbf_

'' against the remaining independent variables. Note that the partial leverage is the leverage of the

^

point in the partial regression plot for the

^

variable. Data points with large partial leverage for an independent variable can exert undue influence on the selection of that variable in automatic regression model building procedures.

Software implementations

Many programs and statistics packages, such as R, Python, etc., include implementations of Leverage.

References

{{reflist Regression diagnostics