In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, model specification is part of the process of building a
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...
: specification consists of selecting an appropriate
functional form for the model and choosing which variables to include. For example, given
personal income
In economics, personal income refers to the total earnings of an individual from various sources such as wages, investment ventures, and other sources of income. It encompasses all the products and money received by an individual.
Personal inco ...
together with years of schooling
and on-the-job experience
, we might specify a functional relationship
as follows:
:
where
is the unexplained
error term In mathematics and statistics, an error term is an additive type of error.
In writing, an error term is an instance of faulty language or grammar.
Common examples include:
* errors and residuals in statistics, e.g. in linear regression
* the error ...
that is supposed to comprise
independent and identically distributed
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
Gaussian variables.
The statistician
Sir David Cox has said, "How
hetranslation from subject-matter problem to statistical model is done is often the most critical part of an analysis".
Specification error and bias
Specification error occurs when the functional form or the choice of
independent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...
s poorly represent relevant aspects of the true data-generating process. In particular,
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
(the
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
of the difference of an estimated
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
and the true underlying value) occurs if an independent variable is correlated with the errors inherent in the underlying process. There are several different possible causes of specification error; some are listed below.
*An inappropriate functional form could be employed.
*A variable omitted from the model may have a relationship with both the
dependent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...
and one or more of the independent variables (causing
omitted-variable bias
In statistics, omitted-variable bias (OVB) occurs when a statistical model leaves out one or more relevant variables. The bias results in the model attributing the effect of the missing variables to those that were included.
More specifically, O ...
).
*An irrelevant variable may be included in the model (although this does not create bias, it involves
overfitting
In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
and so can lead to poor predictive performance).
*The dependent variable may be part of a system of
simultaneous equations
In mathematics, a set of simultaneous equations, also known as a system of equations or an equation system, is a finite set of equations for which common solutions are sought. An equation system is usually classified in the same manner as single e ...
(giving simultaneity bias).
Additionally,
measurement errors may affect the independent variables: while this is not a specification error, it can create statistical bias.
Note that all models will have some specification error. Indeed, in statistics there is a common aphorism that "
all models are wrong
"All models are wrong" is a common aphorism and anapodoton in statistics. It is often expanded as "All models are wrong, but some are useful". The aphorism acknowledges that statistical models always fall short of the complexities of reality but ca ...
". In the words of Burnham & Anderson,
"Modeling is an art as well as a science and is directed toward finding a good approximating model ... as the basis for statistical inference".
Detection of misspecification
The
Ramsey RESET test
In statistics, the Ramsey Regression Equation Specification Error Test (RESET) test is a general specification test for the linear regression model. More specifically, it tests whether non-linear combinations of the explanatory variables help to ex ...
can help test for specification error in
regression analysis.
In the example given above relating personal income to schooling and job experience, if the assumptions of the model are correct, then the
least squares
The method of least squares is a mathematical optimization technique that aims to determine the best fit function by minimizing the sum of the squares of the differences between the observed values and the predicted values of the model. The me ...
estimates of the parameters
and
will be
efficient and
unbiased
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...
. Hence specification diagnostics usually involve testing the first to fourth
moment of the
residuals.
Model building
Building a model involves finding a set of relationships to represent the process that is generating the data. This requires avoiding all the sources of misspecification mentioned above.
One approach is to start with a model in general form that relies on a theoretical understanding of the data-generating process. Then the model can be fit to the data and checked for the various sources of misspecification, in a task called ''
statistical model validation
In statistics, model validation is the task of evaluating whether a chosen statistical model is appropriate or not. Oftentimes in statistical inference, inferences from models that appear to fit their data may be flukes, resulting in a misundersta ...
''. Theoretical understanding can then guide the modification of the model in such a way as to retain theoretical validity while removing the sources of misspecification. But if it proves impossible to find a theoretically acceptable specification that fits the data, the theoretical model may have to be rejected and replaced with another one.
A quotation from
Karl Popper
Sir Karl Raimund Popper (28 July 1902 – 17 September 1994) was an Austrian–British philosopher, academic and social commentator. One of the 20th century's most influential philosophers of science, Popper is known for his rejection of the ...
is apposite here: "Whenever a theory appears to you as the only possible one, take this as a sign that you have neither understood the theory nor the problem which it was intended to solve".
[.]
Another approach to model building is to specify several different models as candidates, and then compare those candidate models to each other. The purpose of the comparison is to determine which candidate model is most appropriate for statistical inference. Common criteria for comparing models include the following:
''R''2,
Bayes factor
The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in question can have a common set of parameters, such as a null hypothesis ...
, and the
likelihood-ratio test
In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing ...
together with its generalization
relative likelihood
In statistics, when selecting a statistical model for given data, the relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a single model.
Relative likelihood of parameter ...
. For more on this topic, see ''
statistical model selection
Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one.
In the context of machine learning and more generally statistical analysis, this may be the selection of ...
''.
See also
*
Abductive reasoning
Abductive reasoning (also called abduction,For example: abductive inference, or retroduction) is a form of logical inference that seeks the simplest and most likely conclusion from a set of observations. It was formulated and advanced by Ameri ...
*
Conceptual model
The term conceptual model refers to any model that is formed after a wikt:concept#Noun, conceptualization or generalization process. Conceptual models are often abstractions of things in the real world, whether physical or social. Semantics, Semant ...
*
Data analysis
Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...
*
Data transformation (statistics)
In statistics, data transformation is the application of a deterministic mathematical function to each point in a data set—that is, each data point ''zi'' is replaced with the transformed value ''yi'' = ''f''(''zi''), where ''f'' is a functi ...
*
Design of experiments
The design of experiments (DOE), also known as experiment design or experimental design, is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. ...
*
Durbin–Wu–Hausman test
The Durbin–Wu–Hausman test (also called Hausman specification test) is a statistical hypothesis test in econometrics named after James Durbin, De-Min Wu, and Jerry A. Hausman. The test evaluates the consistency of an estimator when compared ...
*
Exploratory data analysis
In statistics, exploratory data analysis (EDA) is an approach of data analysis, analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or ...
*
Feature selection
In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons:
* sim ...
*
Heteroscedasticity
In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...
, second-order statistical misspecification
*
Information matrix test In econometrics, the information matrix test is used to determine whether a regression model is misspecified. The test was developed by Halbert White, who observed that in a correctly specified model and under standard regularity assumptions, the ...
*
Model identification
*
Principle of Parsimony
In philosophy, Occam's razor (also spelled Ockham's razor or Ocham's razor; ) is the problem-solving principle that recommends searching for explanations constructed with the smallest possible set of elements. It is also known as the principle o ...
*
Spurious relationship
In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but '' not'' causally related, due to either coincidence or the presence of a certain third, u ...
*
Statistical conclusion validity
Statistical conclusion validity is the degree to which conclusions about the relationship among variables based on the data are correct or "reasonable". This began as being solely about whether the statistical conclusion about the relationship of ...
*
Statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
*
Statistical learning theory
Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on da ...
Notes
Further reading
* .
*
*
*
* .
*
*
*
*
* {{cite journal, last = Sapra, first = Sunil, title = A regression error specification test (RESET) for generalized linear models, journal =
Economics Bulletin
The ''Economics Bulletin'' is a peer-reviewed open access academic journal that publishes concise notes, comments, and preliminary results in all areas of economics. The journal does not accept appeals and new versions of previously declined manu ...
, volume = 3, issue = 1, year = 2005, pages = 1–6, url = http://economicsbulletin.vanderbilt.edu/2005/volume3/EB-04C50033A.pdf
Regression variable selection
Statistical models