In
regression analysis, a dummy variable (also known as indicator variable or just dummy) is one that takes a
binary value (0 or 1) to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. For example, if we were studying the relationship between
biological sex and
income
Income is the consumption and saving opportunity gained by an entity within a specified timeframe, which is generally expressed in monetary terms. Income is difficult to define conceptually and the definition may be different across fields. F ...
, we could use a dummy variable to represent the sex of each individual in the study. The variable could take on a value of 1 for
males
Male (symbol: ♂) is the sex of an organism that produces the gamete (sex cell) known as sperm, which fuses with the larger female gamete, or ovum, in the process of fertilisation. A male organism cannot reproduce sexually without access to ...
and 0 for
females
An organism's sex is female (symbol: ♀) if it produces the ovum (egg cell), the type of gamete (sex cell) that fuses with the male gamete (sperm cell) during sexual reproduction.
A female has larger gametes than a male. Females and male ...
(or vice versa). In
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
this is known as
one-hot encoding.
Dummy variables are commonly used in regression analysis to represent categorical variables that have more than two levels, such as education level or occupation. In this case, multiple dummy variables would be created to represent each level of the variable, and only one dummy variable would take on a value of 1 for each observation. Dummy variables are useful because they allow us to include categorical variables in our analysis, which would otherwise be difficult to include due to their non-numeric nature. They can also help us to control for confounding factors and improve the validity of our results.
As with any addition of variables to a model, the addition of dummy variables will increase the within-sample model fit (
coefficient of determination
In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
It is a statistic used in t ...
), but at a cost of fewer
degrees of freedom
In many scientific fields, the degrees of freedom of a system is the number of parameters of the system that may vary independently. For example, a point in the plane has two degrees of freedom for translation: its two coordinates; a non-infinite ...
and loss of generality of the model (out of sample model fit). Too many dummy variables result in a model that does not provide any general conclusions.
Dummy variables are useful in various cases. For example, in
econometric
Econometrics is an application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics", '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8� ...
time series analysis
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...
, dummy variables may be used to indicate the occurrence of wars, or major
strikes. It could thus be thought of as a
Boolean, i.e., a
truth value
In logic and mathematics, a truth value, sometimes called a logical value, is a value indicating the relation of a proposition to truth, which in classical logic has only two possible values ('' true'' or '' false''). Truth values are used in ...
represented as the numerical value 0 or 1 (as is sometimes done in
computer programming
Computer programming or coding is the composition of sequences of instructions, called computer program, programs, that computers can follow to perform tasks. It involves designing and implementing algorithms, step-by-step specifications of proc ...
).
Dummy variables may be extended to more complex cases. For example, seasonal effects may be captured by creating dummy variables for each of the seasons: D1=1 if the observation is for summer, and equals zero otherwise; D2=1 if and only if autumn, otherwise equals zero; D3=1 if and only if winter, otherwise equals zero; and D4=1 if and only if spring, otherwise equals zero. In the
panel data
In statistics and econometrics, panel data and longitudinal data are both multi-dimensional data involving measurements over time. Panel data is a subset of longitudinal data where observations are for the same subjects each time.
Time series and ...
fixed effects estimator
In statistics, a fixed effects model is a statistical model in which the model parameters are fixed or non-random quantities. This is in contrast to random effects models and mixed models in which all or some of the model parameters are random v ...
dummies are created for each of the units in
cross-sectional data
In statistics and econometrics, cross-sectional data is a type of data collected by observing many subjects (such as individuals, firms, countries, or regions) at a single point or period of time. Analysis of cross-sectional data usually consists ...
(e.g. firms or countries) or periods in a
pooled time-series. However in such regressions either the
constant term
In mathematics, a constant term (sometimes referred to as a free term) is a term in an algebraic expression that does not contain any variables and therefore is constant. For example, in the quadratic polynomial,
:x^2 + 2x + 3,\
The number 3 i ...
has to be removed, or one of the dummies removed making this the base category against which the others are assessed, for the following reason:
If dummy variables for all categories were included, their sum would equal 1 for all observations, which is identical to and hence perfectly correlated with the vector-of-ones variable whose coefficient is the constant term; if the vector-of-ones variable were also present, this would result in perfect
multicollinearity
In statistics, multicollinearity or collinearity is a situation where the predictors in a regression model are linearly dependent.
Perfect multicollinearity refers to a situation where the predictive variables have an ''exact'' linear rela ...
,
so that the matrix inversion in the estimation algorithm would be impossible. This is referred to as the dummy variable trap.
See also
*
*
*
*
*
*
*
References
Further reading
*
*
External links
*
*
*
{{DEFAULTSORT:Dummy Variable (Statistics)
Regression variable selection