Bootstrapping is a procedure for estimating the distribution of an estimator by resampling (often with replacement) one's data or a model estimated from the data. Bootstrapping assigns measures of accuracy (

bias Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is inaccurate, closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individ ...

, variance, confidence intervals, prediction error, etc.) to sample estimates.software
This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods. Bootstrapping estimates the properties of an

estimand An estimand is a quantity that is to be estimated in a statistical analysis. The term is used to distinguish the target of inference from the method used to obtain an approximation of this target (i.e., the estimator) and the specific value obtain ...

(such as its

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the

empirical distribution function In statistics, an empirical distribution function ( an empirical cumulative distribution function, eCDF) is the Cumulative distribution function, distribution function associated with the empirical measure of a Sampling (statistics), sample. Th ...

of the observed data. In the case where a set of observations can be assumed to be from an

independent and identically distributed Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...

population, this can be implemented by constructing a number of resamples with replacement, of the observed data set (and of equal size to the observed data set). A key result in Efron's seminal paper that introduced the bootstrap is the favorable performance of bootstrap methods using sampling with replacement compared to prior methods like the jackknife that sample without replacement. However, since its introduction, numerous variants on the bootstrap have been proposed, including methods that sample without replacement or that create bootstrap samples larger or smaller than the original data. The bootstrap may also be used for constructing hypothesis tests. It is often used as an alternative to

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...

based on the assumption of a parametric model when that assumption is in doubt, or where parametric inference is impossible or requires complicated formulas for the calculation of

standard error The standard error (SE) of a statistic (usually an estimator of a parameter, like the average or mean) is the standard deviation of its sampling distribution or an estimate of that standard deviation. In other words, it is the standard deviati ...

History

The bootstrap was first described by Bradley Efron in "Bootstrap methods: another look at the jackknife" (1979), inspired by earlier work on the jackknife. Jaeckel L (1972) The infinitesimal jackknife. Memorandum MM72-1215-11, Bell Lab Improved estimates of the variance were developed later. A Bayesian extension was developed in 1981. The bias-corrected and accelerated (

BC_a

) bootstrap was developed by Efron in 1987, and the approximate bootstrap confidence interval (ABC, or approximate

BC_a

) procedure in 1992.

Approach

The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by ''resampling'' the sample data and performing inference about a sample from resampled data (resampled → sample). As the population is unknown, the true error in a sample statistic against its population value is unknown. In bootstrap-resamples, the 'population' is in fact the sample, and this is known; hence the quality of inference of the 'true' sample from resampled data (resampled → sample) is measurable. More formally, the bootstrap works by treating inference of the true

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

''J'', given the original data, as being analogous to an inference of the empirical distribution ''Ĵ'', given the resampled data. The accuracy of inferences regarding ''Ĵ'' using the resampled data can be assessed because we know ''Ĵ''. If ''Ĵ'' is a reasonable approximation to ''J'', then the quality of inference on ''J'' can in turn be inferred. As an example, assume we are interested in the average (or

mean A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...

) height of people worldwide. We cannot measure all the people in the global population, so instead, we sample only a tiny part of it, and measure that. Assume the sample is of size ''N''; that is, we measure the heights of ''N'' individuals. From that single sample, only one estimate of the mean can be obtained. In order to reason about the population, we need some sense of the variability of the mean that we have computed. The simplest bootstrap method involves taking the original data set of heights, and, using a computer, sampling from it to form a new sample (called a 'resample' or bootstrap sample) that is also of size ''N''. The bootstrap sample is taken from the original by using sampling with replacement (e.g. we might 'resample' 5 times from ,2,3,4,5and get ,5,4,4,1, so, assuming ''N'' is sufficiently large, for all practical purposes there is virtually zero probability that it will be identical to the original "real" sample. This process is repeated a large number of times (typically 1,000 or 10,000 times), and for each of these bootstrap samples, we compute its mean (each of these is called a "bootstrap estimate"). We now can create a histogram of bootstrap means. This histogram provides an estimate of the shape of the distribution of the sample mean from which we can answer questions about how much the mean varies across samples. (The method here, described for the mean, can be applied to almost any other

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...

estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...

Discussion

Advantages

A great advantage of bootstrap is its simplicity. It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of the distribution, such as percentile points, proportions,

Odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of event A taking place in the presence of B, and the odds of A in the absence of B ...

, and correlation coefficients. However, despite its simplicity, bootstrapping can be applied to complex sampling designs (e.g. for population divided into s strata with n_s observations per strata, one example of which is a dose-response experiment, where bootstrapping can be applied for each stratum). Bootstrap is also an appropriate way to control and check the stability of the results. Although for most problems it is impossible to know the true confidence interval, bootstrap is asymptotically more accurate than the standard intervals obtained using sample variance and assumptions of normality. Bootstrapping is also a convenient method that avoids the cost of repeating the experiment to get other groups of sample data.

Disadvantages

Bootstrapping depends heavily on the estimator used and, though simple, naive use of bootstrapping will not always yield asymptotically valid results and can lead to inconsistency. Although bootstrapping is (under some conditions) asymptotically

consistent In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...

, it does not provide general finite-sample guarantees. The result may depend on the representative sample. The apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis (e.g. independence of samples or large enough of a sample size) where these would be more formally stated in other approaches. Also, bootstrapping can be time-consuming and there are not many available software for bootstrapping as it is difficult to automate using traditional statistical computer packages.

Recommendations

Scholars have recommended more bootstrap samples as available computing power has increased. If the results may have substantial real-world consequences, then one should use as many samples as is reasonable, given available computing power and time. Increasing the number of samples cannot increase the amount of information in the original data; it can only reduce the effects of random sampling errors which can arise from a bootstrap procedure itself. Moreover, there is evidence that numbers of samples greater than 100 lead to negligible improvements in the estimation of standard errors. In fact, according to the original developer of the bootstrapping method, even setting the number of samples at 50 is likely to lead to fairly good standard error estimates. Adèr et al. recommend the bootstrap procedure for the following situations: :*When the theoretical distribution of a statistic of interest is complicated or unknown. Since the bootstrapping procedure is distribution-independent it provides an indirect method to assess the properties of the distribution underlying the sample and the parameters of interest that are derived from this distribution. :*When the

sample size Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences abo ...

is insufficient for straightforward statistical inference. If the underlying distribution is well-known, bootstrapping provides a way to account for the distortions caused by the specific sample that may not be fully representative of the population. :* When power calculations have to be performed, and a small pilot sample is available. Most power and sample size calculations are heavily dependent on the standard deviation of the statistic of interest. If the estimate used is incorrect, the required sample size will also be wrong. One method to get an impression of the variation of the statistic is to use a small pilot sample and perform bootstrapping on it to get impression of the variance. However, Athreya has shown that if one performs a naive bootstrap on the sample mean when the underlying population lacks a finite variance (for example, a power law distribution), then the bootstrap distribution will not converge to the same limit as the sample mean. As a result, confidence intervals on the basis of a

Monte Carlo simulation Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be det ...

of the bootstrap could be misleading. Athreya states that "Unless one is reasonably sure that the underlying distribution is not heavy tailed, one should hesitate to use the naive bootstrap".

Types of bootstrap scheme

In univariate problems, it is usually acceptable to resample the individual observations with replacement ("case resampling" below) unlike subsampling, in which resampling is without replacement and is valid under much weaker conditions compared to the bootstrap. In small samples, a parametric bootstrap approach might be preferred. For other problems, a ''smooth bootstrap'' will likely be preferred. For regression problems, various other alternatives are available.

Case resampling

The bootstrap is generally useful for estimating the distribution of a statistic (e.g. mean, variance) without using normality assumptions (as required, e.g., for a z-statistic or a t-statistic). In particular, the bootstrap is useful when there is no analytical form or an asymptotic theory (e.g., an applicable

central limit theorem In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the Probability distribution, distribution of a normalized version of the sample mean converges to a Normal distribution#Standard normal distributi ...

) to help estimate the distribution of the statistics of interest. This is because bootstrap methods can apply to most random quantities, e.g., the ratio of variance and mean. There are at least two ways of performing case resampling. # The Monte Carlo algorithm for case resampling is quite simple. First, we resample the data with replacement, and the size of the resample must be equal to the size of the original data set. Then the statistic of interest is computed from the resample from the first step. We repeat this routine many times to get a more precise estimate of the Bootstrap distribution of the statistic. # The 'exact' version for case resampling is similar, but we exhaustively enumerate every possible resample of the data set. This can be computationally expensive as there are a total of

\binom n = \frac

different resamples, where ''n'' is the size of the data set. Thus for ''n'' = 5, 10, 20, 30 there are 126, 92378, 6.89 × 10¹⁰ and 5.91 × 10¹⁶ different resamples respectively.

Estimating the distribution of sample mean

Consider a coin-flipping experiment. We flip the coin and record whether it lands heads or tails. Let be 10 observations from the experiment. if the i th flip lands heads, and 0 otherwise. By invoking the assumption that the average of the coin flips is normally distributed, we can use the

t-statistic In statistics, the ''t''-statistic is the ratio of the difference in a number’s estimated value from its assumed value to its standard error. It is used in hypothesis testing via Student's ''t''-test. The ''t''-statistic is used in a ''t''-t ...

to estimate the distribution of the sample mean, :

\bar = \frac (x_1 + x_2 + \cdots + x_).

Such a normality assumption can be justified either as an approximation of the distribution of each ''individual'' coin flip or as an approximation of the distribution of the ''average'' of a large number of coin flips. The former is a poor approximation because the true distribution of the coin flips is Bernoulli instead of normal. The latter is a valid approximation in ''infinitely large'' samples due to the

. However, if we are not ready to make such a justification, then we can use the bootstrap instead. Using case resampling, we can derive the distribution of

\bar

. We first resample the data to obtain a ''bootstrap resample''. An example of the first resample might look like this . There are some duplicates since a bootstrap resample comes from sampling with replacement from the data. Also the number of data points in a bootstrap resample is equal to the number of data points in our original observations. Then we compute the mean of this resample and obtain the first ''bootstrap mean'': ''μ''₁*. We repeat this process to obtain the second resample ''X''₂* and compute the second bootstrap mean ''μ''₂*. If we repeat this 100 times, then we have ''μ''₁*, ''μ''₂*, ..., ''μ''₁₀₀*. This represents an ''empirical bootstrap distribution'' of sample mean. From this empirical distribution, one can derive a ''bootstrap confidence interval'' for the purpose of hypothesis testing.

Regression

In regression problems, ''case resampling'' refers to the simple scheme of resampling individual cases – often rows of a

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...

. For regression problems, as long as the data set is fairly large, this simple scheme is often acceptable. However, the method is open to criticism. In regression problems, the

explanatory variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

s are often fixed, or at least observed with more control than the response variable. Also, the range of the explanatory variables defines the information available from them. Therefore, to resample cases means that each bootstrap sample will lose some information. As such, alternative bootstrap procedures should be considered.

Bayesian bootstrap

Bootstrapping can be interpreted in a Bayesian framework using a scheme that creates new data sets through reweighting the initial data. Given a set of

N

data points, the weighting assigned to data point

i

in a new data set

\mathcal^J

w^J_i = x^J_i - x^J_

, where

\mathbf^J

is a low-to-high ordered list of

N-1

uniformly distributed random numbers on

,1 /math>, preceded by 0 and succeeded by 1. The distributions of a parameter inferred from considering many such data sets \mathcal^J are then interpretable as

posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...

s on that parameter.

Smooth bootstrap

Under this scheme, a small amount of (usually normally distributed) zero-centered random noise is added onto each resampled observation. This is equivalent to sampling from a kernel density estimate of the data. Assume ''K'' to be a symmetric kernel density function with unit variance. The standard kernel estimator

\hat_h(x)

f(x)

is :

\hat_h(x)=\sum_^nK\left(\right),

where

h

is the smoothing parameter. And the corresponding distribution function estimator

\hat_h(x)

is :

\hat_h(x)=\int_^x \hat f_h(t)\,dt.

Parametric bootstrap

Based on the assumption that the original data set is a realization of a random sample from a distribution of a specific parametric type, in this case a parametric model is fitted by parameter θ, often by

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

, and samples of random numbers are drawn from this fitted model. Usually the sample drawn has the same sample size as the original data. Then the estimate of original function F can be written as

\hat=F_

. This sampling process is repeated many times as for other bootstrap methods. Considering the centered

sample mean The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or me ...

in this case, the random sample original distribution function

F_

is replaced by a bootstrap random sample with function

F_

, and the

\bar-\mu_

is approximated by that of

\bar_n^*-\mu^*

, where

\mu^*=\mu_

, which is the expectation corresponding to

F_

. The use of a parametric model at the sampling stage of the bootstrap methodology leads to procedures which are different from those obtained by applying basic statistical theory to inference for the same model.

Resampling residuals

Another approach to bootstrapping in regression problems is to resample residuals. The method proceeds as follows. # Fit the model and retain the fitted values

\widehat_i

and the residuals

\widehat_i = y_i - \widehat_i, (i = 1,\dots, n)

. # For each pair, (''x_i'', ''y_i''), in which ''x_i'' is the (possibly multivariate) explanatory variable, add a randomly resampled residual,

\widehat_j

, to the fitted value

\widehat_i

. In other words, create synthetic response variables

y^*_i = \widehat_i + \widehat_j

where ''j'' is selected randomly from the list (1, ..., ''n'') for every ''i''. # Refit the model using the fictitious response variables

y^*_i

, and retain the quantities of interest (often the parameters,

\widehat\mu^*_i

, estimated from the synthetic

y^*_i

). # Repeat steps 2 and 3 a large number of times. This scheme has the advantage that it retains the information in the explanatory variables. However, a question arises as to which residuals to resample. Raw residuals are one option; another is

studentized residuals In statistics, a studentized residual is the dimensionless ratio resulting from the division of a residual by an estimate of its standard deviation, both expressed in the same units. It is a form of a Student's ''t''-statistic, with the esti ...

(in linear regression). Although there are arguments in favor of using studentized residuals; in practice, it often makes little difference, and it is easy to compare the results of both schemes.

Gaussian process regression bootstrap

When data are temporally correlated, straightforward bootstrapping destroys the inherent correlations. This method uses Gaussian process regression (GPR) to fit a probabilistic model from which replicates may then be drawn. GPR is a Bayesian non-linear regression method. A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian (normal) distribution. A GP is defined by a mean function and a covariance function, which specify the mean vectors and covariance matrices for each finite collection of the random variables. Regression model: :

y(x)=f(x)+\varepsilon,\ \    \varepsilon\sim \mathcal(0,\sigma^2),

\varepsilon

is a noise term. Gaussian process prior: For any finite collection of variables, ''x''₁, ..., ''x''_''n'', the function outputs

f(x_1),\ldots,f(x_n)

are jointly distributed according to a multivariate Gaussian with mean

\intercal

and covariance matrix

(K)_=k(x_i,x_j).

Assume

f(x)\sim \mathcal(m,k).

Then

y(x)\sim \mathcal(m,l)

, where

l(x_i,x_j)=k(x_i,x_j)+\sigma^2\delta(x_i,x_j)

, and

\delta(x_i,x_j)

is the standard Kronecker delta function. Gaussian process posterior: According to GP prior, we can get :

sim \mathcal(m_0,K_0)

, where

\intercal

and

(K_0)_=k(x_i,x_j)+\sigma^2\delta(x_i,x_j).

Let x₁^*,...,x_s^* be another finite collection of variables, it's obvious that :

\intercal\sim \mathcal(\binom\begin K_0 & K_* \\ K_*^\intercal & K_ \end)

, where

\intercal

(K_)_=k(x_i^*,x_j^*)

(K_*)_=k(x_i,x_j^*).

According to the equations above, the outputs ''y'' are also jointly distributed according to a multivariate Gaussian. Thus, :

\intercal=y)\sim \mathcal(m_\text,K_\text),

where

\intercal

m_\text=m_*+K_*^\intercal(K_O+\sigma^2I_r)^(y-m_0)

K_\text=K_-K_*^\intercal(K_O+\sigma^2I_r)^K_*

, and

I_r

r\times r

identity matrix.

Wild bootstrap

The wild bootstrap, proposed originally by Wu (1986), is suited when the model exhibits heteroskedasticity. The idea is, as the residual bootstrap, to leave the regressors at their sample value, but to resample the response variable based on the residuals values. That is, for each replicate, one computes a new

y

based on :

y^*_i = \widehat_i + \widehat_i v_i

so the residuals are randomly multiplied by a random variable

v_i

with mean 0 and variance 1. For most distributions of

v_i

(but not Mammen's), this method assumes that the 'true' residual distribution is symmetric and can offer advantages over simple residual sampling for smaller sample sizes. Different forms are used for the random variable

v_i

, such as :*The

standard normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac e^ ...

:*A distribution suggested by Mammen (1993). :::

v_i = \begin
-(\sqrt -1)/2 & \text (\sqrt +1)/(2\sqrt), \\
(\sqrt +1)/2 & \text (\sqrt -1)/(2\sqrt)
\end

::Approximately, Mammen's distribution is: :::

v_i = \begin
-0.6180\quad\text & \text 0.7236, \\
+1.6180\quad\text & \text 0.2764.
\end

:*Or the simpler distribution, linked to the Rademacher distribution: :::

v_i =\begin
-1 & \text 1/2, \\
+1 & \text 1/2.
\end

Block bootstrap

The block bootstrap is used when the data, or the errors in a model, are correlated. In this case, a simple case or residual resampling will fail, as it is not able to replicate the correlation in the data. The block bootstrap tries to replicate the correlation by resampling inside blocks of data (see

Blocking (statistics) In the statistical theory of the design of experiments, blocking is the arranging of experimental units that are similar to one another in groups (blocks) based on one or more variables. These variables are chosen carefully to minimize the effect ...

). The block bootstrap has been used mainly with data correlated in time (i.e. time series) but can also be used with data correlated in space, or among groups (so-called cluster data).

Time series: Simple block bootstrap

In the (simple) block bootstrap, the variable of interest is split into non-overlapping blocks.

Time series: Moving block bootstrap

In the moving block bootstrap, introduced by Künsch (1989), data is split into ''n'' − ''b'' + 1 overlapping blocks of length ''b'': Observation 1 to b will be block 1, observation 2 to ''b'' + 1 will be block 2, etc. Then from these ''n'' − ''b'' + 1 blocks, ''n''/''b'' blocks will be drawn at random with replacement. Then aligning these n/b blocks in the order they were picked, will give the bootstrap observations. This bootstrap works with dependent data, however, the bootstrapped observations will not be stationary anymore by construction. But, it was shown that varying randomly the block length can avoid this problem. This method is known as the ''stationary bootstrap.'' Other related modifications of the moving block bootstrap are the ''Markovian bootstrap'' and a stationary bootstrap method that matches subsequent blocks based on standard deviation matching.

Time series: Maximum entropy bootstrap

Vinod (2006), presents a method that bootstraps time series data using maximum entropy principles satisfying the Ergodic theorem with mean-preserving and mass-preserving constraints. There is an R package, meboot, that utilizes the method, which has applications in econometrics and computer science.

Cluster data: block bootstrap

Cluster data describes data where many observations per unit are observed. This could be observing many firms in many states or observing students in many classes. In such cases, the correlation structure is simplified, and one does usually make the assumption that data is correlated within a group/cluster, but independent between groups/clusters. The structure of the block bootstrap is easily obtained (where the block just corresponds to the group), and usually only the groups are resampled, while the observations within the groups are left unchanged. Cameron et al. (2008) discusses this for clustered errors in linear regression.

Methods for improving computational efficiency

The bootstrap is a powerful technique although may require substantial computing resources in both time and memory. Some techniques have been developed to reduce this burden. They can generally be combined with many of the different types of Bootstrap schemes and various choices of statistics.

Parallel processing

Most bootstrap methods are

embarrassingly parallel In parallel computing, an embarrassingly parallel workload or problem (also called embarrassingly parallelizable, perfectly parallel, delightfully parallel or pleasingly parallel) is one where little or no effort is needed to split the problem into ...

algorithms. That is, the statistic of interest for each bootstrap sample does not depend on other bootstrap samples. Such computations can therefore be performed on separate

CPU A central processing unit (CPU), also called a central processor, main processor, or just processor, is the primary processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, log ...

s or compute nodes with the results from the separate nodes eventually aggregated for final analysis.

Poisson bootstrap

The nonparametric bootstrap samples items from a list of size n with counts drawn from a

multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a ''k''-sided die rolled ''n'' times. For ''n'' statistical independence, indepen ...

. If

W_i

denotes the number times element i is included in a given bootstrap sample, then each

W_i

is distributed as a

binomial distribution In probability theory and statistics, the binomial distribution with parameters and is the discrete probability distribution of the number of successes in a sequence of statistical independence, independent experiment (probability theory) ...

with n trials and mean 1, but

W_i

is not independent of

W_j

for

i \neq j

. The Poisson bootstrap instead draws samples assuming all

W_i

's are independently and identically distributed as Poisson variables with mean 1. The rationale is that the limit of the binomial distribution is Poisson: :

\lim_ \operatorname(n,1/n) = \operatorname(1)

The Poisson bootstrap had been proposed by Hanley and MacGibbon as potentially useful for non-statisticians using software like SAS and

SPSS SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. Versi ...

, which lacked the bootstrap packages of R and

S-Plus S-PLUS is a commercial implementation of the S (programming language), S programming language sold by TIBCO Software Inc. It features object-oriented programming capabilities and advanced analytical algorithms. Its statistical analysis capabilit ...

programming languages. The same authors report that for large enough n, the results are relatively similar to the nonparametric bootstrap estimates but go on to note the Poisson bootstrap has seen minimal use in applications. Another proposed advantage of the Poisson bootstrap is the independence of the

W_i

makes the method easier to apply for large datasets that must be processed as streams. A way to improve on the Poisson bootstrap, termed "sequential bootstrap", is by taking the first samples so that the proportion of unique values is ≈0.632 of the original sample size n. This provides a distribution with main empirical characteristics being within a distance of

O(n^)

. Empirical investigation has shown this method can yield good results. This is related to the reduced bootstrap method.

Bag of Little Bootstraps

For massive data sets, it is often computationally prohibitive to hold all the sample data in memory and resample from the sample data. The Bag of Little Bootstraps (BLB) provides a method of pre-aggregating data before bootstrapping to reduce computational constraints. This works by partitioning the data set into

b

equal-sized buckets and aggregating the data within each bucket. This pre-aggregated data set becomes the new sample data over which to draw samples with replacement. This method is similar to the Block Bootstrap, but the motivations and definitions of the blocks are very different. Under certain assumptions, the sample distribution should approximate the full bootstrapped scenario. One constraint is the number of buckets

b=n^\gamma

where

\gamma \in .5, 1 /math> and the authors recommend usage of b=n^as a general solution.

Choice of statistic

The bootstrap distribution of a point estimator of a population parameter has been used to produce a bootstrapped confidence interval for the parameter's true value if the parameter can be written as a function of the population's distribution. Population parameters are estimated with many

point estimator In statistics, point estimation involves the use of sample data to calculate a single value (known as a point estimate since it identifies a point in some parameter space) which is to serve as a "best guess" or "best estimate" of an unknown popu ...

s. Popular families of point-estimators include mean-unbiased minimum-variance estimators, median-unbiased estimators, Bayesian estimators (for example, the

's mode,

median The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...

), and maximum-likelihood estimators. A Bayesian point estimator and a maximum-likelihood estimator have good performance when the sample size is infinite, according to asymptotic theory. For practical problems with finite samples, other estimators may be preferable. Asymptotic theory suggests techniques that often improve the performance of bootstrapped estimators; the bootstrapping of a maximum-likelihood estimator may often be improved using transformations related to pivotal quantities.

Deriving confidence intervals from the bootstrap distribution

The bootstrap distribution of a parameter-estimator is often used to calculate confidence intervals for its population-parameter. A variety of methods for constructing the confidence intervals have been proposed, although there is disagreement which method is the best.

Desirable properties

The survey of bootstrap confidence interval methods of DiCiccio and Efron and consequent discussion lists several desired properties of confidence intervals, which generally are not all simultaneously met. * Transformation invariant - the confidence intervals from bootstrapping transformed data (e.g., by taking the logarithm) would ideally be the same as transforming the confidence intervals from bootstrapping the untransformed data. * Confidence intervals should be valid or consistent, i.e., the probability a parameter is in a confidence interval with nominal level

1 - \alpha

should be equal to or at least converge in probability to

1  - \alpha

. The latter criteria is both refined and expanded using the framework of Hall. The refinements are to distinguish between methods based on how fast the true coverage probability approaches the nominal value, where a method is (using DiCiccio and Efron's terminology) first-order accurate if the error term in the approximation is

O(1 / \sqrt)

and second-order accurate if the error term is

O(n^)

. In addition, methods are distinguished by the speed with which the estimated bootstrap critical point converges to the true (unknown) point, and a method is second-order correct when this rate is

O_p(n^)

. * Gleser in the discussion of the paper argues that a limitation of the asymptotic descriptions in the previous bullet is that the

O(\cdot)

terms are not necessarily uniform in the parameters or true distribution.

Bias, asymmetry, and confidence intervals

*Bias: The bootstrap distribution and the sample may disagree systematically, in which case

may occur. *:If the bootstrap distribution of an estimator is symmetric, then percentile confidence-interval are often used; such intervals are appropriate especially for median-unbiased estimators of minimum risk (with respect to an

absolute Absolute may refer to: Companies * Absolute Entertainment, a video game publisher * Absolute Radio, (formerly Virgin Radio), independent national radio station in the UK * Absolute Software Corporation, specializes in security and data risk ma ...

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

). Bias in the bootstrap distribution will lead to bias in the confidence interval. *:Otherwise, if the bootstrap distribution is non-symmetric, then percentile confidence intervals are often inappropriate.

Methods for bootstrap confidence intervals

There are several methods for constructing confidence intervals from the bootstrap distribution of a real parameter: *Basic bootstrap, also known as the Reverse Percentile Interval. The basic bootstrap is a simple scheme to construct the confidence interval: one simply takes the empirical

quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities or dividing the observations in a sample in the same way. There is one fewer quantile t ...

s from the bootstrap distribution of the parameter (see Davison and Hinkley 1997, equ. 5.6 p. 194): ::

(2\widehat -\theta^_,2\widehat -\theta^_)

where

\theta^_

denotes the

1-\alpha/2

percentile In statistics, a ''k''-th percentile, also known as percentile score or centile, is a score (e.g., a data point) a given percentage ''k'' of all scores in its frequency distribution exists ("exclusive" definition) or a score a given percentage ...

of the bootstrapped coefficients

\theta^

. *Percentile bootstrap. The percentile bootstrap proceeds in a similar way to the basic bootstrap, using percentiles of the bootstrap distribution, but with a different formula (note the inversion of the left and right quantiles): ::

(\theta^_,\theta^_)

where

\theta^_

denotes the

1-\alpha/2

of the bootstrapped coefficients

\theta^

. :See Davison and Hinkley (1997, equ. 5.18 p. 203) and Efron and Tibshirani (1993, equ 13.5 p. 171). :This method can be applied to any statistic. It will work well in cases where the bootstrap distribution is symmetrical and centered on the observed statistic and where the sample statistic is median-unbiased and has maximum concentration (or minimum risk with respect to an absolute value loss function). When working with small sample sizes (i.e., less than 50), the basic / reversed percentile and percentile confidence intervals for (for example) the

statistic will be too narrow. So that with a sample of 20 points, 90% confidence interval will include the true variance only 78% of the time. Ch13, p300 The basic / reverse percentile confidence intervals are easier to justify mathematically but they are less accurate in general than percentile confidence intervals, and some authors discourage their use. * Studentized bootstrap. The studentized bootstrap, also called ''bootstrap-t'', is computed analogously to the standard confidence interval, but replaces the quantiles from the normal or student approximation by the quantiles from the bootstrap distribution of the

Student's t-test Student's ''t''-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's ''t''- ...

(see Davison and Hinkley 1997, equ. 5.7 p. 194 and Efron and Tibshirani 1993 equ 12.22, p. 160): ::

(\widehat - t^_\cdot \widehat_\theta,\widehat - t^_\cdot \widehat_\theta)

where

t^_

denotes the

1-\alpha/2

of the bootstrapped

t^=(\widehat^-\widehat)/\widehat_

, and

\widehat_\theta

is the estimated standard error of the coefficient in the original model. :The studentized test enjoys optimal properties as the statistic that is bootstrapped is pivotal (i.e. it does not depend on

nuisance parameter In statistics, a nuisance parameter is any parameter which is unspecified but which must be accounted for in the hypothesis testing of the parameters which are of interest. The classic example of a nuisance parameter comes from the normal distri ...

s as the t-test follows asymptotically a N(0,1) distribution), unlike the percentile bootstrap. *Bias-corrected bootstrap – adjusts for

in the bootstrap distribution. *Accelerated bootstrap – The bias-corrected and accelerated (BCa) bootstrap, by Efron (1987), adjusts for both bias and

skewness In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal ...

in the bootstrap distribution. This approach is accurate in a wide variety of settings, has reasonable computation requirements, and produces reasonably narrow intervals.

Bootstrap hypothesis testing

Efron and Tibshirani suggest the following algorithm for comparing the means of two independent samples: Let

x_1, \ldots, x_n

be a random sample from distribution F with sample mean

\bar

and sample variance

\sigma_x^2

. Let

y_1, \ldots, y_m

be another, independent random sample from distribution G with mean

\bar

and variance

\sigma_y^2

# Calculate the test statistic

t = \frac

#Create two new data sets whose values are

x_i' = x_i - \bar + \bar

and

y_i' = y_i - \bar + \bar,

where

\bar

is the mean of the combined sample. # Draw a random sample (

x_i^*

) of size

n

with replacement from

x_i'

and another random sample (

y_i^*

) of size

m

with replacement from

y_i'

. # Calculate the test statistic

t^* = \frac

#Repeat 3 and 4

B

times (e.g.

B=1000

) to collect

B

values of the test statistic. # Estimate the p-value as

p = \frac

where

I\ = 1

when ''condition'' is true and 0 otherwise.

Example applications

Smoothed bootstrap

In 1878,

Simon Newcomb Simon Newcomb (March 12, 1835 – July 11, 1909) was a Canadians, Canadian–Americans, American astronomer, applied mathematician, and autodidactic polymath. He served as Professor of Mathematics in the United States Navy and at Johns Hopkins ...

took observations on the

speed of light The speed of light in vacuum, commonly denoted , is a universal physical constant exactly equal to ). It is exact because, by international agreement, a metre is defined as the length of the path travelled by light in vacuum during a time i ...

. The data set contains two

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

s, which greatly influence the

. (The sample mean need not be a

consistent estimator In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter ''θ''0—having the property that as the number of data points used increases indefinitely, the result ...

for any

population mean In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hyp ...

, because no mean needs to exist for a

heavy-tailed distribution In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. Roughly speaking, “heavy-tailed” means the distribu ...

.) A well-defined and

robust statistic Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regr ...

for the central tendency is the sample median, which is consistent and median-unbiased for the population median. The bootstrap distribution for Newcomb's data appears below. We can reduce the discreteness of the bootstrap distribution by adding a small amount of random noise to each bootstrap sample. A conventional choice is to add noise with a standard deviation of

\sigma/\sqrt n

for a sample size ''n''; this noise is often drawn from a Student-t distribution with ''n-1'' degrees of freedom. This results in an approximately-unbiased estimator for the variance of the sample mean. This means that samples taken from the bootstrap distribution will have a variance which is, on average, equal to the variance of the total population. Histograms of the bootstrap distribution and the smooth bootstrap distribution appear below. The bootstrap distribution of the sample-median has only a small number of values. The smoothed bootstrap distribution has a richer support. However, note that whether the smoothed or standard bootstrap procedure is favorable is case-by-case and is shown to depend on both the underlying distribution function and on the quantity being estimated. In this example, the bootstrapped 95% (percentile) confidence-interval for the population median is (26, 28.5), which is close to the interval for (25.98, 28.46) for the smoothed bootstrap.

Relation to other approaches to inference

Relationship to other resampling methods

The bootstrap is distinguished from: * the jackknife procedure, used to estimate biases of sample statistics and to estimate variances, and * cross-validation, in which the parameters (e.g., regression weights, factor loadings) that are estimated in one subsample are applied to another subsample. Bootstrap aggregating (bagging) is a meta-algorithm based on averaging model predictions obtained from models trained on multiple bootstrap samples.

U-statistics

In situations where an obvious statistic can be devised to measure a required characteristic using only a small number, ''r'', of data items, a corresponding statistic based on the entire sample can be formulated. Given an ''r''-sample statistic, one can create an ''n''-sample statistic by something similar to bootstrapping (taking the average of the statistic over all subsamples of size ''r''). This procedure is known to have certain good properties and the result is a

U-statistic In statistical theory, a U-statistic is a class of statistics defined as the average over the application of a given function applied to all tuples of a fixed size. The letter "U" stands for unbiased. In elementary statistics, U-statistics arise ...

. The

and

sample variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, ...

are of this form, for ''r'' = 1 and ''r'' = 2.

Asymptotic theory

The bootstrap has under certain conditions desirable asymptotic properties. The asymptotic properties most often described are weak convergence / consistency of the sample paths of the bootstrap empirical process and the validity of confidence intervals derived from the bootstrap. This section describes the convergence of the empirical bootstrap.

Stochastic convergence

This paragraph summarizes more complete descriptions of stochastic convergence in van der Vaart and Wellner and Kosorok. The bootstrap defines a

stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Sto ...

, a collection of random variables indexed by some set

T

, where

T

is typically the

real line A number line is a graphical representation of a straight line that serves as spatial representation of numbers, usually graduated like a ruler with a particular origin (geometry), origin point representing the number zero and evenly spaced mark ...

(

\mathbb

) or a family of functions. Processes of interest are those with bounded sample paths, i.e., sample paths in L-infinity (

\ell^\infty(T)

), the set of all

uniformly bounded In mathematics, a uniformly bounded family of functions is a family of bounded functions that can all be bounded by the same constant. This constant is larger than or equal to the absolute value of any value of any of the functions in the family. ...

functions from

T

\mathbb

. When equipped with the uniform distance,

\ell^\infty(T)

is a

metric space In mathematics, a metric space is a Set (mathematics), set together with a notion of ''distance'' between its Element (mathematics), elements, usually called point (geometry), points. The distance is measured by a function (mathematics), functi ...

, and when

T = \mathbb

, two subspaces of

\ell^\infty(T)

are of particular interest,

C,1 /math>, the space of all

continuous functions In mathematics, a continuous function is a function such that a small variation of the argument induces a small variation of the value of the function. This implies there are no abrupt changes in value, known as '' discontinuities''. More preci ...

from

T

to the

unit interval In mathematics, the unit interval is the closed interval , that is, the set of all real numbers that are greater than or equal to 0 and less than or equal to 1. It is often denoted ' (capital letter ). In addition to its role in real analysi ...

,1 and

D,1 /math>, the space of all cadlag functions from T to,1 This is because C,1 /math> contains the distribution functions for all continuous random variables, and D,1 /math> contains the distribution functions for all random variables. Statements about the consistency of the bootstrap are statements about the convergence of the sample paths of the bootstrap process as

random element In probability theory, random element is a generalization of the concept of random variable to more complicated spaces than the simple real line. The concept was introduced by who commented that the “development of probability theory and expansio ...

s of the metric space

\ell^\infty(T)

or some subspace thereof, especially

C,1 /math> or D,1 /math>.

Consistency

Horowitz in a recent review defines consistency as: the bootstrap estimator

G_n(\cdot, F_n)

is consistent or a statistic

T_n

if, for each

F_0

\sup_\tau , G_n(\tau, F_n) - G_\infty(\tau, F_0),

converges in probability In probability theory, there exist several different notions of convergence of sequences of random variables, including ''convergence in probability'', ''convergence in distribution'', and ''almost sure convergence''. The different notions of conve ...

to 0 as

n \to \infty

, where

F_n

is the distribution of the statistic of interest in the original sample,

F_0

is the true but unknown distribution of the statistic,

G_\infty(\tau, F_0)

is the asymptotic distribution function of

T_n

, and

\tau

is the indexing variable in the distribution function, i.e.,

P(T_n \leq \tau) = G_n(\tau, F_0)

. This is sometimes more specifically called consistency relative to the Kolmogorov-Smirnov distance. Horowitz goes on to recommend using a theorem from Mammen that provides easier to check necessary and sufficient conditions for consistency for statistics of a certain common form. In particular, let

\

be the random sample. If

T_n = \frac

for a sequence of numbers

t_n

and

\sigma_n

, then the bootstrap estimate of the cumulative distribution function estimates the empirical cumulative distribution function if and only if

T_n

converges in distribution to the

Strong consistency

Convergence in (outer) probability as described above is also called weak consistency. It can also be shown with slightly stronger assumptions, that the bootstrap is strongly consistent, where convergence in (outer) probability is replaced by convergence (outer) almost surely. When only one type of consistency is described, it is typically weak consistency. This is adequate for most statistical applications since it implies confidence bands derived from the bootstrap are asymptotically valid.

Showing consistency using the central limit theorem

In simpler cases, it is possible to use the

directly to show the

consistency In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...

of the bootstrap procedure for estimating the distribution of the sample mean. Specifically, let us consider

X_, \ldots, X_

independent identically distributed random variables with

\mathbb_= \mu

and

= \sigma^2 < \infty

for each

n \ge 1

. Let

\bar_n = n^(X_ + \cdots + X_)

. In addition, for each

n \ge 1

, conditional on

X_, \ldots, X_

, let

X^*_, \ldots, X^*_

be independent random variables with distribution equal to the empirical distribution of

X_, \ldots, X_

. This is the sequence of bootstrap samples. Then it can be shown that

\sup_ \left ,  P^* \left(\frac \le \tau \right) - P \left(\frac \le \tau \right) \right ,  \to 0 \text n \to \infty,

where

P^*

represents probability conditional on

X_, \ldots, X_

n \ge 1

\bar^*_n = n^(X^*_ + \cdots + X^*_)

, and

\hat_n^2 = n^ \sum_^(X_ - \bar_n)^2

. To see this, note that

(X_^* - \bar X_n)/\sqrt n\hat_n

satisfies the

Lindeberg condition In probability theory, Lindeberg's condition is a sufficient condition (and under certain conditions also a necessary condition) for the central limit theorem (CLT) to hold for a sequence of independent random variables. Unlike the classical CLT, w ...

, so the CLT holds. The

Glivenko–Cantelli theorem In the theory of probability, the Glivenko–Cantelli theorem (sometimes referred to as the fundamental theorem of statistics), named after Valery Ivanovich Glivenko and Francesco Paolo Cantelli, describes the asymptotic behaviour of the empirica ...

provides theoretical background for the bootstrap method.

Finite populations

Finite populations and drawing without replacement require adaptations of the bootstrap due to the violation of the i.i.d assumption. One example is "population bootstrap"Mashreghi, Zeinab, David Haziza, and Christian Léger. "A survey of bootstrap methods in finite population sampling." (2016): 1-52. https://projecteuclid.org/journals/statistics-surveys/volume-10/issue-none/A-survey-of-bootstrap-methods-in-finite-population-sampling/10.1214/16-SS113.full.

History

Approach

Discussion

Advantages

Disadvantages

Recommendations

Types of bootstrap scheme

Case resampling

Estimating the distribution of sample mean

Regression

Bayesian bootstrap

Smooth bootstrap

Parametric bootstrap

Resampling residuals

Gaussian process regression bootstrap

Wild bootstrap

Block bootstrap

Time series: Simple block bootstrap

Time series: Moving block bootstrap

Time series: Maximum entropy bootstrap

Cluster data: block bootstrap

Methods for improving computational efficiency

Parallel processing

Poisson bootstrap

Bag of Little Bootstraps

Choice of statistic

Deriving confidence intervals from the bootstrap distribution

Desirable properties

Bias, asymmetry, and confidence intervals

Methods for bootstrap confidence intervals

Bootstrap hypothesis testing

Example applications

Smoothed bootstrap

Relation to other approaches to inference

Relationship to other resampling methods

U-statistics

Asymptotic theory

Stochastic convergence

Consistency

Strong consistency

Showing consistency using the central limit theorem

Finite populations

See also

References

Further reading