HOME

TheInfoList



OR:

In statistics,
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ev ...
(CDF)-based nonparametric confidence intervals are a general class of
confidence interval In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as ...
s around statistical functionals of a distribution. To calculate these confidence intervals, all that is required is an independently and identically distributed (iid) sample from the distribution and known bounds on the support of the distribution. The latter requirement simply means that all the nonzero probability mass of the distribution must be contained in some known interval ,b/math>.


Intuition

The intuition behind the CDF-based approach is that bounds on the CDF of a distribution can be translated into bounds on statistical functionals of that distribution. Given an upper and lower bound on the CDF, the approach involves finding the CDFs within the bounds that maximize and minimize the statistical functional of interest.


Properties of the bounds

Unlike approaches that make asymptotic assumptions, including bootstrap approaches and those that rely on the
central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables thems ...
, CDF-based bounds are valid for finite sample sizes. And unlike bounds based on inequalities such as Hoeffding's and McDiarmid's inequalities, CDF-based bounds use properties of the entire sample and thus often produce significantly tighter bounds.


CDF bounds

When producing bounds on the CDF, we must differentiate between pointwise and simultaneous bands.


Pointwise band

A pointwise CDF bound is one which only guarantees their
Coverage probability In statistics, the coverage probability is a technique for calculating a confidence interval which is the proportion of the time that the interval contains the true value of interest. For example, suppose our interest is in the mean number of mo ...
of 1-\alpha percent on any individual point of the empirical CDF. Because of the relaxed guarantees, these intervals can be much smaller. One method of generating them is based on the Binomial distribution. Considering a single point of a CDF of value F(x_i), then the empirical distribution at that point will be distributed proportional to the binomial distribution with p=F(x_i) and n set equal to the number of samples in the empirical distribution. Thus, any of the methods available for generating a
Binomial proportion confidence interval In statistics, a binomial proportion confidence interval is a confidence interval for the probability of success calculated from the outcome of a series of success–failure experiments (Bernoulli trial, Bernoulli trials). In other words, a binomia ...
can be used to generate a CDF bound as well.


Simultaneous Band

CDF-based confidence intervals require a probabilistic bound on the CDF of the distribution from which the sample were generated. A variety of methods exist for generating confidence intervals for the CDF of a distribution, F, given an i.i.d. sample drawn from the distribution. These methods are all based on the
empirical distribution function In statistics, an empirical distribution function (commonly also called an empirical Cumulative Distribution Function, eCDF) is the distribution function associated with the empirical measure of a sample. This cumulative distribution function ...
(empirical CDF). Given an i.i.d. sample of size ''n'', x_1,\ldots,x_n\sim F, the empirical CDF is defined to be : \hat_n(t) = \frac\sum_^n1\, where 1\ is the indicator of event A. The
Dvoretzky–Kiefer–Wolfowitz inequality In the theory of probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an ...
, whose tight constant was determined by Massart, places a confidence interval around the Kolmogorov–Smirnov statistic between the CDF and the empirical CDF. Given an i.i.d. sample of size ''n'' from F, the bound states : P(\sup_x, F(x)-F_n(x), >\varepsilon)\le2e^. This can be viewed as a confidence envelope that runs parallel to, and is equally above and below, the empirical CDF. The equally spaced confidence interval around the empirical CDF allows for different rates of violations across the support of the distribution. In particular, it is more common for a CDF to be outside of the CDF bound estimated using the Dvoretzky–Kiefer–Wolfowitz inequality near the median of the distribution than near the endpoints of the distribution. In contrast, the order statistics-based bound introduced by Learned-Miller and DeStefano allows for an equal rate of violation across all of the order statistics. This in turn results in a bound that is tighter near the ends of the support of the distribution and looser in the middle of the support. Other types of bounds can be generated by varying the rate of violation for the order statistics. For example, if a tighter bound on the distribution is desired on the upper portion of the support, a higher rate of violation can be allowed at the upper portion of the support at the expense of having a lower rate of violation, and thus a looser bound, for the lower portion of the support.


A nonparametric bound on the mean

Assume without loss of generality that the support of the distribution is contained in ,1 Given a confidence envelope for the CDF of F it is easy to derive a corresponding confidence interval for the mean of F. It can be shown that the CDF that maximizes the mean is the one that runs along the lower confidence envelope, L(x), and the CDF that minimizes the mean is the one that runs along the upper envelope, U(x). Using the identity : E(X) = \int_0^1(1-F(x))\,dx, the confidence interval for the mean can be computed as : \left int_0^1(1-U(x))\,dx, \int_0^1(1-L(x))\,dx \right


A nonparametric bound on the variance

Assume without loss of generality that the support of the distribution of interest, F, is contained in ,1/math>. Given a confidence envelope for F, it can be shown that the CDF within the envelope that minimizes the variance begins on the lower envelope, has a jump discontinuity to the upper envelope, and then continues along the upper envelope. Further, it can be shown that this variance-minimizing CDF, F', must satisfy the constraint that the jump discontinuity occurs at E '/math>. The variance maximizing CDF begins on the upper envelope, horizontally transitions to the lower envelope, then continues along the lower envelope. Explicit algorithms for calculating these variance-maximizing and minimizing CDFs are given by Romano and Wolf.


Bounds on other statistical functionals

The CDF-based framework for generating confidence intervals is very general and can be applied to a variety of other statistical functionals including *
Entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...
*
Mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such as ...
*Arbitrary percentiles


See also

*
Bootstrapping (statistics) Bootstrapping is any test or metric that uses random sampling with replacement (e.g. mimicking the sampling process), and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy (bias, variance, confidenc ...
*
Non-parametric statistics Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distri ...
*
Confidence interval In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as ...


References

{{Reflist Nonparametric statistics Estimation theory Robust statistics Empirical process