In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, the Horvitz–Thompson estimator, named after
Daniel G. Horvitz and Donovan J. Thompson, is a method for estimating the total and mean of a
pseudo-population in a
stratified sample by applying
inverse probability weighting to account for the difference in the sampling distribution between the collected data and the target population. The Horvitz–Thompson estimator is frequently applied in
survey analyses and can be used to account for
missing data
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Mi ...
, as well as many
sources of unequal selection probabilities.
The method
Formally, let
be an
independent sample from
of
distinct
strata
In geology and related fields, a stratum (: strata) is a layer of Rock (geology), rock or sediment characterized by certain Lithology, lithologic properties or attributes that distinguish it from adjacent layers from which it is separated by v ...
with an overall mean
. Suppose further that
is the
inclusion probability
In statistics, in the theory relating to sampling from finite populations, the sampling probability (also known as inclusion probability) of an element or member of the population, is its probability of becoming part of the sample during the dra ...
that a randomly sampled individual in a superpopulation belongs to the
th stratum. The Horvitz–Thompson estimator of the total is given by:
:
and the Horvitz–Thompson estimate of the mean is given by:
:
In a
Bayesian probabilistic framework
is considered the proportion of individuals in a target population belonging to the
th stratum. Hence,
could be thought of as an estimate of the complete sample of persons within the
th stratum. The Horvitz–Thompson estimator can also be expressed as the limit of a weighted
bootstrap resampling estimate of the mean. It can also be viewed as a special case of multiple
imputation approaches.
For
post-stratified study designs, estimation of
and
are done in distinct steps. In such cases, computating the variance of
is not straightforward. Resampling techniques such as the bootstrap or the jackknife can be applied to gain consistent estimates of the variance of the Horvitz–Thompson estimator. The "survey" package for
R conducts analyses for post-stratified data using the Horvitz–Thompson estimator.
Proof of Horvitz–Thompson unbiased estimation of the mean
For this proof it will be useful to represent the sample as a random subset
of size
. We can then define indicator random variables