HOME

TheInfoList



OR:

A statistical model is a
mathematical model A mathematical model is an abstract and concrete, abstract description of a concrete system using mathematics, mathematical concepts and language of mathematics, language. The process of developing a mathematical model is termed ''mathematical m ...
that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger
population Population is a set of humans or other organisms in a given region or area. Governments conduct a census to quantify the resident population size within a given jurisdiction. The term is also applied to non-human animals, microorganisms, and pl ...
). A statistical model represents, often in considerably idealized form, the data-generating process. When referring specifically to
probabilities Probability is a branch of mathematics and statistics concerning Event (probability theory), events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probab ...
, the corresponding term is probabilistic model. All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
. A statistical model is usually specified as a mathematical relationship between one or more
random variables A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers ...
and other non-random variables. As such, a statistical model is "a formal representation of a theory" ( Herman Adèr quoting Kenneth Bollen).


Introduction

Informally, a statistical model can be thought of as a statistical assumption (or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided
dice A die (: dice, sometimes also used as ) is a small, throwable object with marked sides that can rest in multiple positions. Dice are used for generating random values, commonly as part of tabletop games, including dice games, board games, ro ...
. We will study two different statistical assumptions about the dice. The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is . From that assumption, we can calculate the probability of both dice coming up 5:  More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or (5 and 6). The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is (because the dice are weighted). From that assumption, we can calculate the probability of both dice coming up 5:  We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown. The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does ''not'' constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event. In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.


Formal definition

In mathematical terms, a statistical model is a pair (S, \mathcal), where S is the set of possible observations, i.e. the
sample space In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...
, and \mathcal is a set of
probability distributions In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spac ...
on S. The set \mathcal represents all of the models that are considered possible. This set is typically parameterized: \mathcal=\. The set \Theta defines the parameters of the model. If a parameterization is such that distinct parameter values give rise to distinct distributions, i.e. F_ = F_ \Rightarrow \theta_1 = \theta_2 (in other words, the mapping is
injective In mathematics, an injective function (also known as injection, or one-to-one function ) is a function that maps distinct elements of its domain to distinct elements of its codomain; that is, implies (equivalently by contraposition, impl ...
), it is said to be '' identifiable''. In some cases, the model can be more complex. * In
Bayesian statistics Bayesian statistics ( or ) is a theory in the field of statistics based on the Bayesian interpretation of probability, where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about ...
, the model is extended by adding a probability distribution over the parameter space \Theta. * A statistical model can sometimes distinguish two sets of probability distributions. The first set \mathcal=\ is the set of models considered for inference. The second set \mathcal=\ is the set of models that could have generated the data which is much larger than \mathcal. Such statistical models are key in checking that a given procedure is robust, i.e. that it does not produce catastrophic errors when its assumptions about the data are incorrect.


An example

Suppose that we have a population of children, with the ages of the children distributed uniformly, in the population. The height of a child will be
stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...
ally related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in a
linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...
model, like this: height''i'' = ''b''0 + ''b''1age''i'' + ε''i'', where ''b''0 is the intercept, ''b''1 is a parameter that age is multiplied by to obtain a prediction of height, ε''i'' is the error term, and ''i'' identifies the child. This implies that height is predicted by age, with some error. An admissible model must be consistent with all the data points. Thus, a straight line (height''i'' = ''b''0 + ''b''1age''i'') cannot be admissible for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, ε''i'', must be included in the equation, so that the model is consistent with all the data points. To do
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
, we would first need to assume some probability distributions for the ε''i''. For instance, we might assume that the ε''i'' distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: ''b''0, ''b''1, and the variance of the Gaussian distribution. We can formally specify the model in the form (S, \mathcal) as follows. The sample space, S, of our model comprises the set of all possible pairs (age, height). Each possible value of \theta = (''b''0, ''b''1, ''σ''2) determines a distribution on S; denote that distribution by F_. If \Theta is the set of all possible values of \theta, then \mathcal=\. (The parameterization is identifiable, and this is easy to check.) In this example, the model is determined by (1) specifying S and (2) making some assumptions relevant to \mathcal. There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify \mathcal—as they are required to do.


General remarks

A statistical model is a special class of
mathematical model A mathematical model is an abstract and concrete, abstract description of a concrete system using mathematics, mathematical concepts and language of mathematics, language. The process of developing a mathematical model is termed ''mathematical m ...
. What distinguishes a statistical model from other mathematical models is that a statistical model is non-
deterministic Determinism is the metaphysical view that all events within the universe (or multiverse) can occur only in one possible way. Deterministic theories throughout the history of philosophy have developed from diverse and sometimes overlapping mo ...
. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are
stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...
. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic. Statistical models are often used even when the data-generating process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a Bernoulli process). Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statistician Sir David Cox has said, "How hetranslation from subject-matter problem to statistical model is done is often the most critical part of an analysis". There are three purposes for a statistical model, according to Konishi & Kitagawa: #Predictions #Extraction of information #Description of stochastic structures Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description.


Dimension of a model

Suppose that we have a statistical model (S, \mathcal) with \mathcal=\. In notation, we write that \Theta \subseteq \mathbb^k where is a positive integer (\mathbb denotes the
real numbers In mathematics, a real number is a number that can be used to measurement, measure a continuous variable, continuous one-dimensional quantity such as a time, duration or temperature. Here, ''continuous'' means that pairs of values can have arbi ...
; other sets can be used, in principle). Here, is called the dimension of the model. The model is said to be '' parametric'' if \Theta has finite dimension. As an example, if we assume that data arise from a univariate
Gaussian distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...
, then we are assuming that :\mathcal=\left\. In this example, the dimension, , equals 2. As another example, suppose that the data consists of points (, ) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note the set of all possible lines has dimension 2, even though geometrically, a line has dimension 1.) Although formally \theta \in \Theta is a single parameter that has dimension , it is sometimes regarded as comprising separate parameters. For example, with the univariate Gaussian distribution, \theta is formally a single parameter with dimension 2, but it is often regarded as comprising 2 separate parameters—the mean and the standard deviation. A statistical model is ''nonparametric'' if the parameter set \Theta is infinite dimensional. A statistical model is ''semiparametric'' if it has both finite-dimensional and infinite-dimensional parameters. Formally, if is the dimension of \Theta and is the number of samples, both semiparametric and nonparametric models have k \rightarrow \infty as n \rightarrow \infty. If k/n \rightarrow 0 as n \rightarrow \infty, then the model is semiparametric; otherwise, the model is nonparametric. Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".


Nested models

Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model : has, nested within it, the linear model : —we constrain the parameter to equal 0. In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As an example where they have the same dimension, the set of positive-mean Gaussian distributions is nested within the set of all Gaussian distributions; they both have dimension 2.


Comparing models

Comparing statistical models is fundamental for much of
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
. state: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models." Common criteria for comparing models include the following: ''R''2, Bayes factor, Akaike information criterion, and the likelihood-ratio test together with its generalization, the relative likelihood. Another way of comparing two statistical models is through the notion of deficiency introduced by Lucien Le Cam.


See also

* All models are wrong * Blockmodel *
Conceptual model The term conceptual model refers to any model that is formed after a wikt:concept#Noun, conceptualization or generalization process. Conceptual models are often abstractions of things in the real world, whether physical or social. Semantics, Semant ...
*
Design of experiments The design of experiments (DOE), also known as experiment design or experimental design, is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. ...
* Deterministic model * Effective theory * Predictive model * Response modeling methodology * SackSEER * Scientific model *
Statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
*
Statistical model specification In statistics, model specification is part of the process of building a statistical model: specification consists of selecting an appropriate functional form for the model and choosing which variables to include. For example, given personal incom ...
* Statistical model validation *
Statistical theory The theory of statistics provides a basis for the whole range of techniques, in both study design and data analysis, that are used within applications of statistics. The theory covers approaches to statistical-decision problems and to statistica ...
*
Stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Sto ...


Notes


References

* . *. *. *. *. *.


Further reading

* Davison, A. C. (2008), ''Statistical Models'',
Cambridge University Press Cambridge University Press was the university press of the University of Cambridge. Granted a letters patent by King Henry VIII in 1534, it was the oldest university press in the world. Cambridge University Press merged with Cambridge Assessme ...
* * Freedman, D. A. (2009), ''Statistical Models'',
Cambridge University Press Cambridge University Press was the university press of the University of Cambridge. Granted a letters patent by King Henry VIII in 1534, it was the oldest university press in the world. Cambridge University Press merged with Cambridge Assessme ...
* Helland, I. S. (2010), ''Steps Towards a Unified Basis for Scientific Models and Methods'',
World Scientific World Scientific Publishing is an academic publisher of scientific, technical, and medical books and journals headquartered in Singapore. The company was founded in 1981. It publishes about 600 books annually, with more than 170 journals in var ...
* Kroese, D. P.; Chan, J. C. C. (2014), ''Statistical Modeling and Computation'',
Springer Springer or springers may refer to: Publishers * Springer Science+Business Media, aka Springer International Publishing, a worldwide publishing group founded in 1842 in Germany formerly known as Springer-Verlag. ** Springer Nature, a multinationa ...
* {{Authority control Mathematical modeling Statistical theory