statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, ancillarity is a property of a

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...

computed on a sample dataset in relation to a

parametric model In statistics, a parametric model or parametric family or finite-dimensional model is a particular class of statistical models. Specifically, a parametric model is a family of probability distributions that has a finite number of parameters. Defi ...

of the dataset. An ancillary statistic has the same distribution regardless of the value of the parameters and thus provides no information about them. It is opposed to the concept of a complete statistic which contains no ancillary information. It is closely related to the concept of a sufficient statistic which contains all of the information that the dataset provides about the parameters. A ancillary statistic is a specific case of a pivotal quantity that is computed only from the data and not from the parameters. They can be used to construct prediction intervals. They are also used in connection with Basu's theorem to prove independence between statistics. This concept was first introduced by

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...

in the 1920s, but its formal definition was only provided in 1964 by Debabrata Basu.

Examples

Suppose ''X''₁, ..., ''X''_''n'' are independent and identically distributed, and are normally distributed with unknown

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

''μ'' and known

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

1. Let :

\overline_n = \frac

be the sample mean. The following statistical measures of dispersion of the sample * Range: max(''X''₁, ..., ''X''_''n'') − min(''X''₁, ..., ''X_n'') *

Interquartile range In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the differen ...

: ''Q''₃ − ''Q''₁ *

Sample variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, ...

: ::

\hat^2:=\,\frac

are all ''ancillary statistics'', because their sampling distributions do not change as ''μ'' changes. Computationally, this is because in the formulas, the ''μ'' terms cancel – adding a constant number to a distribution (and all samples) changes its sample maximum and minimum by the same amount, so it does not change their difference, and likewise for others: these measures of dispersion do not depend on location. Conversely, given i.i.d. normal variables with known mean 1 and unknown variance ''σ''², the sample mean

\overline

is ''not'' an ancillary statistic of the variance, as the sampling distribution of the sample mean is ''N''(1, ''σ''²/''n''), which does depend on ''σ'' ² – this measure of location (specifically, its standard error) depends on dispersion.

In location-scale families

In a location family of distributions,

(X_1 - X_n, X_2 - X_n, \dots, X_ - X_n)

is an ancillary statistic. In a scale family of distributions,

\left( \frac, \frac, \dots, \frac \right)

is an ancillary statistic. In a location-scale family of distributions,

( \frac, \frac, \dots, \frac )

, where

S^2

is the sample variance, is an ancillary statistic.

In recovery of information

It turns out that, if

T_1

is a non-sufficient statistic and

T_2

is ancillary, one can sometimes recover all the information about the unknown parameter contained in the entire data by reporting

T_1

while conditioning on the observed value of

T_2

. This is known as ''conditional inference''. For example, suppose that

X_1, X_2

follow the

N(\theta, 1)

distribution where

\theta

is unknown. Note that, even though

X_1

is not sufficient for

\theta

(since its Fisher information is 1, whereas the Fisher information of the complete statistic

\overline

is 2), by additionally reporting the ancillary statistic

X_1 - X_2

, one obtains a joint distribution with Fisher information 2.

Ancillary complement

Given a statistic ''T'' that is not sufficient, an ancillary complement is a statistic ''U'' that is ancillary and such that (''T'', ''U'') is sufficient. Intuitively, an ancillary complement "adds the missing information" (without duplicating any). The statistic is particularly useful if one takes ''T'' to be a maximum likelihood estimator, which in general will not be sufficient; then one can ask for an ancillary complement. In this case, Fisher argues that one must condition on an ancillary complement to determine information content: one should consider the Fisher information content of ''T'' to not be the marginal of ''T'', but the conditional distribution of ''T'', given ''U'': how much information does ''T'' ''add''? This is not possible in general, as no ancillary complement need exist, and if one exists, it need not be unique, nor does a maximum ancillary complement exist.

Example

baseball Baseball is a bat-and-ball games, bat-and-ball sport played between two team sport, teams of nine players each, taking turns batting (baseball), batting and Fielding (baseball), fielding. The game occurs over the course of several Pitch ...

, suppose a scout observes a batter in ''N'' at-bats. Suppose (unrealistically) that the number ''N'' is chosen by some random process that is independent of the batter's ability – say a coin is tossed after each at-bat and the result determines whether the scout will stay to watch the batter's next at-bat. The eventual data are the number ''N'' of at-bats and the number ''X'' of hits: the data (''X'', ''N'') are a sufficient statistic. The observed batting average ''X''/''N'' fails to convey all of the information available in the data because it fails to report the number ''N'' of at-bats (e.g., a batting average of 0.400, which is very high, based on only five at-bats does not inspire anywhere near as much confidence in the player's ability than a 0.400 average based on 100 at-bats). The number ''N'' of at-bats is an ancillary statistic because * It is a part of the observable data (it is a ''statistic''), and * Its probability distribution does not depend on the batter's ability, since it was chosen by a random process independent of the batter's ability. This ancillary statistic is an ancillary complement to the observed batting average ''X''/''N'', i.e., the batting average ''X''/''N'' is not a sufficient statistic, in that it conveys less than all of the relevant information in the data, but conjoined with ''N'', it becomes sufficient.

Notes