statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, overdispersion is the presence of greater variability (

statistical dispersion In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartil ...

) in a data set than would be expected based on a given

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...

. A common task in applied

is choosing a

parametric model In statistics, a parametric model or parametric family or finite-dimensional model is a particular class of statistical models. Specifically, a parametric model is a family of probability distributions that has a finite number of parameters. Defi ...

to fit a given set of empirical observations. This necessitates an assessment of the fit of the chosen model. It is usually possible to choose the model parameters in such a way that the theoretical

population mean In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hyp ...

of the model is approximately equal to the

sample mean The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or me ...

. However, especially for simple models with few parameters, theoretical predictions may not match empirical observations for higher moments. When the observed

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

is higher than the variance of a theoretical model, overdispersion has occurred. Conversely, underdispersion means that there was less variation in the data than predicted. Overdispersion is a very common feature in applied data analysis because in practice, populations are frequently

heterogeneous Homogeneity and heterogeneity are concepts relating to the uniformity of a substance, process or image. A homogeneous feature is uniform in composition or character (i.e., color, shape, size, weight, height, distribution, texture, language, i ...

(non-uniform) contrary to the assumptions implicit within widely used simple parametric models.

Examples

Poisson

Overdispersion is often encountered when fitting very simple parametric models, such as those based on the

Poisson distribution In probability theory and statistics, the Poisson distribution () is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known const ...

. The Poisson distribution has one free parameter and does not allow for the variance to be adjusted independently of the mean. The choice of a distribution from the Poisson family is often dictated by the nature of the empirical data. For example,

Poisson regression In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable ''Y'' has a Poisson distribution, and assumes the lo ...

analysis is commonly used to model count data. If overdispersion is a feature, an alternative model with additional free parameters may provide a better fit. In the case of count data, a Poisson

mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observati ...

like the

negative binomial distribution In probability theory and statistics, the negative binomial distribution, also called a Pascal distribution, is a discrete probability distribution that models the number of failures in a sequence of independent and identically distributed Berno ...

can be proposed instead, in which the mean of the Poisson distribution can itself be thought of as a random variable drawn – in this case – from the

gamma distribution In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the g ...

thereby introducing an additional free parameter (note the resulting negative binomial distribution is completely characterized by two parameters).

Binomial

As a more concrete example, it has been observed that the number of boys born to families does not conform faithfully to a

binomial distribution In probability theory and statistics, the binomial distribution with parameters and is the discrete probability distribution of the number of successes in a sequence of statistical independence, independent experiment (probability theory) ...

as might be expected. Instead, the sex ratios of families seem to skew toward either boys or girls (see, for example the

Trivers–Willard hypothesis In evolutionary biology and evolutionary psychology, the Trivers–Willard hypothesis, formally proposed by Robert Trivers and Dan Willard in 1973, suggests that female mammals adjust the sex ratio of offspring in response to maternal condition, ...

for one possible explanation) i.e. there are more all-boy families, more all-girl families and not enough families close to the population 51:49 boy-to-girl mean ratio than expected from a binomial distribution, and the resulting empirical variance is larger than specified by a binomial model. In this case, the beta-binomial model distribution is a popular and analytically tractable alternative model to the binomial distribution since it provides a better fit to the observed data. To capture the heterogeneity of the families, one can think of the probability parameter of the binomial model (say, probability of being a boy) is itself a random variable (i.e.

random effects model In econometrics, a random effects model, also called a variance components model, is a statistical model where the model parameters are random variables. It is a kind of hierarchical linear model, which assumes that the data being analysed are ...

) drawn for each family from a

beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

or (0, 1) in terms of two positive Statistical parameter, parameters, denoted by ''alpha'' (''α'') an ...

as the mixing distribution. The resulting

compound distribution In probability and statistics, a compound probability distribution (also known as a mixture distribution or contagious distribution) is the probability distribution that results from assuming that a random variable is distributed according to some ...

(beta-binomial) has an additional free parameter. Another common model for overdispersion—when some of the observations are not Bernoulli—arises from introducing a normal random variable into a logistic model. Software is widely available for fitting this type of

multilevel model Multilevel models are statistical models of parameters that vary at more than one level. An example could be a model of student performance that contains measures for individual students as well as measures for classrooms within which the studen ...

. In this case, if the variance of the normal variable is zero, the model reduces to the standard (undispersed)

logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

. This model has an additional free parameter, namely the variance of the normal variable. With respect to binomial random variables, the concept of overdispersion makes sense only if n>1 (i.e. overdispersion is nonsensical for Bernoulli random variables).

Normal distribution

As the

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

(Gaussian) has variance as a parameter, any data with finite variance (including any finite data) can be modeled with a normal distribution with the exact variance – the normal distribution is a two-parameter model, with mean and variance. Thus, in the absence of an underlying model, there is no notion of data being overdispersed relative to the normal model, though the fit may be poor in other respects (such as the higher moments of skew,

kurtosis In probability theory and statistics, kurtosis (from , ''kyrtos'' or ''kurtos'', meaning "curved, arching") refers to the degree of “tailedness” in the probability distribution of a real-valued random variable. Similar to skewness, kurtos ...

, etc.). However, in the case that the data is modeled by a normal distribution with an expected variation, it can be over- or under-dispersed relative to that prediction. For example, in a

statistical survey Survey methodology is "the study of survey methods". As a field of applied statistics concentrating on human-research surveys, survey methodology studies the sampling of individual units from a population and associated techniques of survey d ...

, the

margin of error The margin of error is a statistic expressing the amount of random sampling error in the results of a Statistical survey, survey. The larger the margin of error, the less confidence one should have that a poll result would reflect the result of ...

(determined by sample size) predicts the

sampling error In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ...

and hence dispersion of results on repeated surveys. If one performs a

meta-analysis Meta-analysis is a method of synthesis of quantitative data from multiple independent studies addressing a common research question. An important part of this method involves computing a combined effect size across all of the studies. As such, th ...

of repeated surveys of a fixed population (say with a given sample size, so margin of error is the same), one expects the results to fall on normal distribution with standard deviation equal to the margin of error. However, in the presence of

study heterogeneity In statistics, (between-) study heterogeneity is a phenomenon that commonly occurs when attempting to undertake a meta-analysis. In a simplistic scenario, studies whose results are to be combined in the meta-analysis would all be undertaken in the ...

where studies have different

sampling bias In statistics, sampling bias is a bias (statistics), bias in which a sample is collected in such a way that some members of the intended statistical population, population have a lower or higher sampling probability than others. It results in a b ...

, the distribution is instead a

and will be overdistributed relative to the predicted distribution. For example, given repeated

opinion poll An opinion poll, often simply referred to as a survey or a poll, is a human research survey of public opinion from a particular sample. Opinion polls are usually designed to represent the opinions of a population by conducting a series of qu ...

s all with a margin of error of 3%, if they are conducted by different polling organizations, one expects the results to have standard deviation greater than 3%, due to pollster bias from different methodologies.

Differences in terminology among disciplines

Over- and underdispersion are terms which have been adopted in branches of the

biological science Biology is the scientific study of life and living organisms. It is a broad natural science that encompasses a wide range of fields and unifying principles that explain the structure, function, growth, origin, evolution, and distribution of ...

s. In

parasitology Parasitology is the study of parasites, their host (biology), hosts, and the relationship between them. As a List of biology disciplines, biological discipline, the scope of parasitology is not determined by the organism or environment in questio ...

, the term 'overdispersion' is generally used as defined here – meaning a distribution with a higher than expected variance. In some areas of

ecology Ecology () is the natural science of the relationships among living organisms and their Natural environment, environment. Ecology considers organisms at the individual, population, community (ecology), community, ecosystem, and biosphere lev ...

, however, meanings have been transposed, so that overdispersion is actually taken to mean more even (lower variance) than expected. This confusion has caused some ecologists to suggest that the terms 'aggregated', or 'contagious', would be better used in ecology for 'overdispersed'. Such preferences are creeping into

too. Generally this suggestion has not been heeded, and confusion persists in the literature. Furthermore in

demography Demography () is the statistical study of human populations: their size, composition (e.g., ethnic group, age), and how they change through the interplay of fertility (births), mortality (deaths), and migration. Demographic analysis examine ...

, overdispersion is often evident in the analysis of death count data, but demographers prefer the term '

unobserved heterogeneity In economic theory and econometrics, the term heterogeneity refers to differences across the units being studied. For example, a macroeconomic model in which consumers are assumed to differ from one another is said to have heterogeneous agents. U ...

References