Bootstrapping populations in
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and
mathematics
Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
starts with a
sample
Sample or samples may refer to:
Base meaning
* Sample (statistics), a subset of a population – complete data set
* Sample (signal), a digital discrete sample of a continuous analog signal
* Sample (material), a specimen or small quantity of s ...
observed from a
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
.
When ''X'' has a given
distribution law
Distribution law or the Nernst's distribution law gives a generalisation which governs the distribution of a solute between two non miscible solvents. This law was first given by Nernst who studied the distribution of several solutes between dif ...
with a set of non fixed parameters, we denote with a vector
, a
parametric inference problem consists of computing suitable values – call them
estimates – of these parameters precisely on the basis of the sample. An estimate is suitable if replacing it with the unknown parameter does not cause major damage in next computations. In
Algorithmic inference, suitability of an estimate reads in terms of
compatibility
Compatibility may refer to:
Computing
* Backward compatibility, in which newer devices can understand data generated by older devices
* Compatibility card, an expansion card for hardware emulation of another device
* Compatibility layer, compon ...
with the observed sample.
In this framework,
resampling methods are aimed at generating a set of candidate values to replace the unknown parameters that we read as compatible replicas of them. They represent a population of specifications of a random vector
compatible with an observed sample, where the compatibility of its values has the properties of a probability distribution. By plugging parameters into the expression of the questioned distribution law, we bootstrap entire populations of random variables
compatible with the observed sample.
The rationale of the algorithms computing the replicas, which we denote ''population bootstrap'' procedures, is to identify a set of statistics
exhibiting specific properties, denoting a
well behavior, w.r.t. the unknown parameters. The statistics are expressed as functions of the observed values
, by definition. The
may be expressed as a function of the unknown parameters and a random seed specification
through the
sampling mechanism , in turn. Then, by plugging the second expression in the former, we obtain
expressions as functions of seeds and parameters – the
master equations – that we invert to find values of the latter as a function of: i) the statistics, whose values in turn are fixed at the observed ones; and ii) the seeds, which are random according to their own distribution. Hence from a set of seed samples we obtain a set of parameter replicas.
Method
Given a
of a random variable ''X'' and a
sampling mechanism for ''X'', the realization x is given by
, with
. Focusing on
well-behaved statistics,
:
for their parameters, the master equations read
:
For each sample seed
a vector of parameters
is obtained from the solution of the above system with
fixed to the observed values.
Having computed a huge set of compatible vectors, say ''N'', the empirical marginal distribution of
is obtained by:
:
where
is the j-th component of the generic solution of (1) and where
is the
indicator function
In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
of
in the interval
Some indeterminacies remain if ''X'' is discrete and this we will be considered shortly.
The whole procedure may be summed up in the form of the following Algorithm, where the index
of
denotes the parameter vector from which the statistics vector is derived.
Algorithm

You may easily see from a
Algorithmic inference#SufficientTable, table of sufficient statistics that we obtain the curve in the picture on the left by computing the empirical distribution (2) on the population obtained through the above algorithm when: i) ''X'' is an Exponential random variable, ii)
, and
:
,
and the curve in the picture on the right when: i) ''X'' is a Uniform random variable in
, ii)
, and
:
.
Remark
Note that the accuracy with which a parameter distribution law of
populations compatible with a sample is obtained is not a function of the sample size. Instead, it is a function of the number of seeds we draw. In turn, this number is purely a matter of computational time but does not require any extension of the observed data. With other
bootstrapping methods focusing on a generation of sample replicas (like those proposed by ) the accuracy of the estimate distributions depends on the sample size.
Example
For
expected to represent a
Pareto distribution
The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto ( ), is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actua ...
, whose specification requires values for the parameters
and ''k'',
[We denote here with symbols ''a'' and ''k'' the Pareto parameters elsewhere indicated through ''k'' and .] we have that the cumulative distribution function reads:

:
.
A
sampling mechanism has