statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, a linear probability model (LPM) is a special case of a binary regression model. Here the

dependent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...

for each observation takes values which are either 0 or 1. The probability of observing a 0 or 1 in any one case is treated as depending on one or more explanatory variables. For the "linear probability model", this relationship is a particularly simple one, and allows the model to be fitted by

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

. The model assumes that, for a binary outcome (

Bernoulli trial In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is ...

Y

, and its associated vector of explanatory variables,

X

, :

\Pr(Y=1 ,  X=x) = x'\beta .

For this model, :

= 0\cdot \Pr(Y=0, X) +1\cdot \Pr(Y=1, X) = \Pr(Y=1, X) =x'\beta,

and hence the vector of parameters β can be estimated using

least squares The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...

. This method of fitting would be inefficient, and can be improved by adopting an iterative scheme based on

weighted least squares Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (''heteroscedasticity'') is incorporated into ...

, in which the model from the previous iteration is used to supply estimates of the conditional variances,

\operatorname(Y, X=x)

, which would vary between observations. This approach can be related to fitting the model by

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

. A drawback of this model is that, unless restrictions are placed on

\beta

, the estimated coefficients can imply probabilities outside the

unit interval In mathematics, the unit interval is the closed interval , that is, the set of all real numbers that are greater than or equal to 0 and less than or equal to 1. It is often denoted ' (capital letter ). In addition to its role in real analysi ...

,1

. For this reason, models such as the

logit model In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the ...

or the

probit model In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from ''probability'' + ''unit''. The purpose of the model is to es ...

are more commonly used.

Latent-variable formulation

More formally, the LPM can arise from a latent-variable formulation (usually to be found in the

econometrics Econometrics is an application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics", '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8 ...

literature), as follows: assume the following regression model with a latent (unobservable) dependent variable: :

y^* = b_0+ \mathbf x'\mathbf b + \varepsilon,\;\; \varepsilon\mid \mathbf x\sim U(-a,a).

The critical assumption here is that the error term of this regression is a symmetric around zero

uniform A uniform is a variety of costume worn by members of an organization while usually participating in that organization's activity. Modern uniforms are most often worn by armed forces and paramilitary organizations such as police, emergency serv ...

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

, and hence, of mean zero. The cumulative distribution function of

\varepsilon

here is

F_(\varepsilon\mid \mathbf x) = \frac .

Define the indicator variable

y = 1

y^* >0

, and zero otherwise, and consider the conditional probability :

(y =1\mid \mathbf x ) = (y^* > 0\mid \mathbf x) = (b_0+ \mathbf x'\mathbf b + \varepsilon>0\mid \mathbf x)

= (\varepsilon >- b_0- \mathbf x'\mathbf b\mid \mathbf x) = 1- (\varepsilon \leq - b_0- \mathbf x'\mathbf b\mid \mathbf x)

=1- F_(- b_0- \mathbf x'\mathbf b\mid \mathbf x) =1- \frac  = \frac +\frac .

But this is the Linear Probability Model, :

P(y =1\mid \mathbf x )= \beta_0 + \mathbf x'\beta

with the mapping :

\beta_0 = \frac ,\;\; \beta=\frac.

This method is a general device to obtain a conditional probability model of a binary variable: if we assume that the distribution of the error term is logistic, we obtain the

, while if we assume that it is the normal, we obtain the

and, if we assume that it is the logarithm of a Weibull distribution, the complementary log-log model.

Latent-variable formulation

See also

References

Further reading