In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, the variance function is a
smooth function
In mathematical analysis, the smoothness of a function is a property measured by the number of continuous derivatives (''differentiability class)'' it has over its domain.
A function of class C^k is a function of smoothness at least ; t ...
that depicts the
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
of a
random quantity as a function of its
mean
A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
. The variance function is a measure of
heteroscedasticity
In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...
and plays a large role in many settings of statistical modelling. It is a main ingredient in the
generalized linear model
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
framework and a tool used in
non-parametric regression,
semiparametric regression In statistics, a semiparametric model is a statistical model that has parametric and nonparametric components.
A statistical model is a parameterized family of distributions: \ indexed by a parameter \theta.
* A parametric model is a model ...
and
functional data analysis
Functional data analysis (FDA) is a branch of statistics that analyses data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework, each sample element of functional ...
. In parametric modeling, variance functions take on a parametric form and explicitly describe the relationship between the variance and the mean of a random quantity. In a non-parametric setting, the variance function is assumed to be a
smooth function
In mathematical analysis, the smoothness of a function is a property measured by the number of continuous derivatives (''differentiability class)'' it has over its domain.
A function of class C^k is a function of smoothness at least ; t ...
.
Intuition
In a regression model setting, the goal is to establish whether or not a relationship exists between a response variable and a set of predictor variables. Further, if a relationship does exist, the goal is then to be able to describe this relationship as best as possible. A main assumption in
linear regression
In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...
is constant variance or (homoscedasticity), meaning that different response variables have the same variance in their errors, at every predictor level. This assumption works well when the response variable and the predictor variable are jointly
normal
Normal(s) or The Normal(s) may refer to:
Film and television
* ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson
* ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie
* ''Norma ...
. As we will see later, the variance function in the Normal setting is constant; however, we must find a way to quantify heteroscedasticity (non-constant variance) in the absence of joint Normality.
When it is likely that the response follows a distribution that is a member of the exponential family, a
generalized linear model
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
may be more appropriate to use, and moreover, when we wish not to force a parametric model onto our data, a
non-parametric regression approach can be useful. The importance of being able to model the variance as a function of the mean lies in improved inference (in a parametric setting), and estimation of the regression function in general, for any setting.
Variance functions play a very important role in parameter estimation and inference. In general, maximum likelihood estimation requires that a likelihood function be defined. This requirement then implies that one must first specify the distribution of the response variables observed. However, to define a quasi-likelihood, one need only specify a relationship between the mean and the variance of the observations to then be able to use the quasi-likelihood function for estimation.
Quasi-likelihood
In statistics, quasi-likelihood methods are used to estimate parameters in a statistical model when exact likelihood methods, for example maximum likelihood estimation, are computationally infeasible. Due to the wrong likelihood being used, quasi- ...
estimation is particularly useful when there is
overdispersion
In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given statistical model.
A common task in applied statistics is choosing a parametric model to fit a giv ...
. Overdispersion occurs when there is more variability in the data than there should otherwise be expected according to the assumed distribution of the data.
In summary, to ensure efficient inference of the regression parameters and the regression function, the heteroscedasticity must be accounted for. Variance functions quantify the relationship between the variance and the mean of the observed data and hence play a significant role in regression estimation and inference.
Types
The variance function and its applications come up in many areas of statistical analysis. A very important use of this function is in the framework of
generalized linear models
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
and
non-parametric regression.
Generalized linear model
When a member of the
exponential family
In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
has been specified, the variance function can easily be derived. The general form of the variance function is presented under the exponential family context, as well as specific forms for Normal, Bernoulli, Poisson, and Gamma. In addition, we describe the applications and use of variance functions in maximum likelihood estimation and quasi-likelihood estimation.
Derivation
The generalized linear model (GLM), is a generalization of ordinary regression analysis that extends to any member of the
exponential family
In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
. It is particularly useful when the response variable is categorical, binary or subject to a constraint (e.g. only positive responses make sense). A quick summary of the components of a GLM are summarized on this page, but for more details and information see the page on
generalized linear models
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
.
A GLM consists of three main ingredients:
:1. Random Component: a distribution of y from the exponential family,
:2. Linear predictor:
:3. Link function:
First it is important to derive a couple key properties of the exponential family.
Any random variable
in the exponential family has a probability density function of the form,
:
with loglikelihood,
:
Here,
is the canonical parameter and the parameter of interest, and
is a nuisance parameter which plays a role in the variance.
We use the Bartlett's Identities to derive a general expression for the variance function.
The first and second Bartlett results ensures that under suitable conditions (see
Leibniz integral rule
In calculus, the Leibniz integral rule for differentiation under the integral sign, named after Gottfried Wilhelm Leibniz, states that for an integral of the form
\int_^ f(x,t)\,dt,
where -\infty < a(x), b(x) < \infty and the integrands ...
), for a density function dependent on
,
:
:
These identities lead to simple calculations of the expected value and variance of any random variable
in the exponential family