Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying

distribution Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...

of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as in

parametric statistics Parametric statistics is a branch of statistics which leverages models based on a fixed (finite) set of parameters. Conversely nonparametric statistics does not assume explicit (finite-parametric) mathematical forms for distributions when modeli ...

. Nonparametric statistics can be used for

descriptive statistics A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...

. Nonparametric tests are often used when the assumptions of parametric tests are evidently violated.

Definitions

The term "nonparametric statistics" has been defined imprecisely in the following two ways, among others: The first meaning of ''nonparametric'' involves techniques that do not rely on data belonging to any particular parametric family of probability distributions. These include, among others: * Methods which are ''distribution-free'', which do not rely on assumptions that the data are drawn from a given parametric family of

probability distributions In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spac ...

. * Statistics defined to be a function on a sample, without dependency on a

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

. An example is

Order statistic In statistics, the ''k''th order statistic of a statistical sample is equal to its ''k''th-smallest value. Together with Ranking (statistics), rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and ...

s, which are based on ordinal ranking of observations. The discussion following is taken from ''Kendall's Advanced Theory of Statistics''.

Statistical hypotheses concern the behavior of observable random variables.... For example, the hypothesis (a) that a normal distribution has a specified mean and variance is statistical; so is the hypothesis (b) that it has a given mean but unspecified variance; so is the hypothesis (c) that a distribution is of normal form with both mean and variance unspecified; finally, so is the hypothesis (d) that two unspecified continuous distributions are identical. It will have been noticed that in the examples (a) and (b) the distribution underlying the observations was taken to be of a certain form (the normal) and the hypothesis was concerned entirely with the value of one or both of its parameters. Such a hypothesis, for obvious reasons, is called ''parametric''. Hypothesis (c) was of a different nature, as no parameter values are specified in the statement of the hypothesis; we might reasonably call such a hypothesis ''non-parametric''. Hypothesis (d) is also non-parametric but, in addition, it does not even specify the underlying form of the distribution and may now be reasonably termed ''distribution-free''. Notwithstanding these distinctions, the statistical literature now commonly applies the label "non-parametric" to test procedures that we have just termed "distribution-free", thereby losing a useful classification.

The second meaning of ''non-parametric'' involves techniques that do not assume that the ''structure'' of a model is fixed. Typically, the model grows in size to accommodate the complexity of the data. In these techniques, individual variables ''are'' typically assumed to belong to parametric distributions, and assumptions about the types of associations among variables are also made. These techniques include, among others: * '' non-parametric regression'', which is modeling whereby the structure of the relationship between variables is treated non-parametrically, but where nevertheless there may be parametric assumptions about the distribution of model residuals. * ''non-parametric hierarchical Bayesian models'', such as models based on the

Dirichlet process In probability theory, Dirichlet processes (after the distribution associated with Peter Gustav Lejeune Dirichlet) are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a pro ...

, which allow the number of

latent variables In statistics, latent variables (from Latin: present participle of ) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such ''latent vari ...

to grow as necessary to fit the data, but where individual variables still follow parametric distributions and even the process controlling the rate of growth of latent variables follows a parametric distribution.

Applications and purpose

Non-parametric methods are widely used for studying populations that have a ranked order (such as movie reviews receiving one to five "stars"). The use of non-parametric methods may be necessary when data have a

ranking A ranking is a relationship between a set of items, often recorded in a list, such that, for any two items, the first is either "ranked higher than", "ranked lower than", or "ranked equal to" the second. In mathematics, this is known as a weak ...

but no clear numerical interpretation, such as when assessing

preferences In psychology, economics and philosophy, preference is a technical term usually used in relation to choosing between alternatives. For example, someone prefers A over B if they would rather choose A than B. Preferences are central to decision the ...

. In terms of

levels of measurement Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to dependent and independent variables, variables. Psychologist Stanley Smith Stevens developed the best-known class ...

, non-parametric methods result in

ordinal data Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four Level of measurement, levels of m ...

. As non-parametric methods make fewer assumptions, their applicability is much more general than the corresponding parametric methods. In particular, they may be applied in situations where less is known about the application in question. Also, due to the reliance on fewer assumptions, non-parametric methods are more

robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system's functional body. In the same line ''robustness'' can ...

. Non-parametric methods are sometimes considered simpler to use and more robust than parametric methods, even when the assumptions of parametric methods are justified. This is due to their more general nature, which may make them less susceptible to misuse and misunderstanding. Non-parametric methods can be considered a conservative choice, as they will work even when their assumptions are not met, whereas parametric methods can produce misleading results when their assumptions are violated. The wider applicability and increased

robustness Robustness is the property of being strong and healthy in constitution. When it is transposed into a system A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, ...

of non-parametric tests comes at a cost: in cases where a parametric test's assumptions are met, non-parametric tests have less

statistical power In frequentist statistics, power is the probability of detecting a given effect (if that effect actually exists) using a given test in a given context. In typical use, it is a function of the specific test that is used (including the choice of tes ...

. In other words, a larger sample size can be required to draw conclusions with the same degree of confidence.

Non-parametric models

''Non-parametric models'' differ from parametric models in that the model structure is not specified ''a priori'' but is instead determined from data. The term ''non-parametric'' is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance. * A

histogram A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...

is a simple nonparametric estimate of a probability distribution. *

Kernel density estimation In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on '' kernels'' as ...

is another method to estimate a probability distribution. *

Nonparametric regression Nonparametric regression is a form of regression analysis where the predictor does not take a predetermined form but is completely constructed using information derived from the data. That is, no parametric equation is assumed for the relationshi ...

and

semiparametric regression In statistics, a semiparametric model is a statistical model that has parametric and nonparametric components. A statistical model is a parameterized family of distributions: \ indexed by a parameter \theta. * A parametric model is a model ...

methods have been developed based on

kernels Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...

, splines, and

wavelet A wavelet is a wave-like oscillation with an amplitude that begins at zero, increases or decreases, and then returns to zero one or more times. Wavelets are termed a "brief oscillation". A taxonomy of wavelets has been established, based on the n ...

s. *

Data envelopment analysis Data envelopment analysis (DEA) is a nonparametric method in operations research and economics for the estimation of production frontiers.Charnes et al (1978) DEA has been applied in a large range of fields including international banking, econom ...

provides efficiency coefficients similar to those obtained by

multivariate analysis Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., '' multivariate random variables''. Multivariate statistics concerns understanding the differ ...

without any distributional assumption. * KNNs classify the unseen instance based on the K points in the training set which are nearest to it. * A

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

(with a Gaussian kernel) is a nonparametric large-margin classifier. * The method of moments with polynomial probability distributions.

Methods

Non-parametric (or distribution-free) inferential statistical methods are mathematical procedures for statistical hypothesis testing which, unlike

, make no assumptions about the

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

s of the variables being assessed. The most frequently used tests include *

Analysis of similarities Analysis of similarities (ANOSIM) is a non-parametric statistical test widely used in the field of ecology. The test was first suggested by K. R. Clarke as an ANOVA-like test, where instead of operating on raw data, operates on a ranked dissimil ...

Anderson–Darling test The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, i ...

: tests whether a sample is drawn from a given distribution * Statistical bootstrap methods: estimates the accuracy/sampling distribution of a statistic * Cochran's Q: tests whether ''k'' treatments in randomized block designs with 0/1 outcomes have identical effects *

Cohen's kappa Cohen's kappa coefficient ('κ', lowercase Greek kappa) is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items. It is generally thought to be a more robust measure than ...

: measures inter-rater agreement for categorical items * Friedman two-way analysis of variance (Repeated Measures) by ranks: tests whether ''k'' treatments in randomized block designs have identical effects *

Empirical likelihood In probability theory and statistics, empirical likelihood (EL) is a nonparametric method for estimating the parameters of statistical models. It requires fewer assumptions about the error distribution while retaining some of the merits in likel ...

* Kaplan–Meier: estimates the survival function from lifetime data, modeling censoring *

Kendall's tau In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non ...

: measures statistical dependence between two variables *

Kendall's W Kendall's ''W'' (also known as Kendall's coefficient of concordance) is a non-parametric statistic for rank correlation. It is a normalization of the statistic of the Friedman test, and can be used for assessing agreement among raters and in partic ...

: a measure between 0 and 1 of inter-rater agreement. *

Kolmogorov–Smirnov test In statistics, the Kolmogorov–Smirnov test (also K–S test or KS test) is a nonparametric statistics, nonparametric test of the equality of continuous (or discontinuous, see #Discrete and mixed null distribution, Section 2.2), one-dimensional ...

: tests whether a sample is drawn from a given distribution, or whether two samples are drawn from the same distribution. * Kruskal–Wallis one-way analysis of variance by ranks: tests whether > 2 independent samples are drawn from the same distribution. *

Kuiper's test Kuiper's test is used in statistics to test whether a data sample comes from a given distribution (one-sample Kuiper test), or whether two data samples came from the same unknown distribution (two-sample Kuiper test). It is named after Dutch math ...

: tests whether a sample is drawn from a given distribution, sensitive to cyclic variations such as day of the week. * Logrank test: compares survival distributions of two right-skewed, censored samples. * Mann–Whitney U or Wilcoxon rank sum test: tests whether two samples are drawn from the same distribution, as compared to a given alternative hypothesis. *

McNemar's test McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are eq ...

: tests whether, in 2 × 2 contingency tables with a dichotomous trait and matched pairs of subjects, row and column marginal frequencies are equal. *

Median test The median test (also Mood’s median-test, Westenberg-Mood median test or Brown-Mood median test) is a special case of Pearson's chi-squared test. It is a nonparametric test that tests the null hypothesis that the medians of the populations from ...

: tests whether two samples are drawn from distributions with equal medians. * Pitman's permutation test: a statistical significance test that yields exact ''p'' values by examining all possible rearrangements of labels. *

Rank product The rank product is a biologically motivated rank test for the detection of differentially expressed genes in replicated microarray experiments. It is a simple non-parametric statistical method based on ranks of fold changes. In addition to its u ...

s: detects differentially expressed genes in replicated microarray experiments. *

Siegel–Tukey test Siegel–Tukey test, named after Sidney Siegel and John Tukey, is a non-parametric test which may be applied to data measured at least on an ordinal scale. It tests for differences in scale between two groups. The test is used to determine if o ...

: tests for differences in scale between two groups. *

Sign test The sign test is a statistical test for consistent differences between pairs of observations, such as the weight of subjects before and after treatment. Given pairs of observations (such as weight pre- and post-treatment) for each subject, the sign ...

: tests whether matched pair samples are drawn from distributions with equal medians. *

Spearman's rank correlation coefficient In statistics, Spearman's rank correlation coefficient or Spearman's ''ρ'' is a number ranging from -1 to 1 that indicates how strongly two sets of ranks are correlated. It could be used in a situation where one only has ranked data, such as a ...

: measures statistical dependence between two variables using a monotonic function. *

Squared ranks test In statistics, the Conover squared ranks test is a non-parametric version of the parametric Levene's test for equality of variance. Conover's squared ranks test is the only equality of variance test that appears to be non-parametric. Other tests o ...

: tests equality of variances in two or more samples. *

Tukey–Duckworth test In statistics, the Tukey–Duckworth test is a two-sample location test – a statistical test of whether one of two samples was significantly greater than the other. It was introduced by John Tukey John Wilder Tukey (; June 16, 1915 – July 2 ...

: tests equality of two distributions by using ranks. *

Wald–Wolfowitz runs test The Wald–Wolfowitz runs test (or simply runs test), named after statisticians Abraham Wald and Jacob Wolfowitz is a non-parametric statistical test that checks a randomness hypothesis for a two-valued data sequence. More precisely, it can be us ...

: tests whether the elements of a sequence are mutually independent/random. *

Wilcoxon signed-rank test The Wilcoxon signed-rank test is a non-parametric rank test for statistical hypothesis testing used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples., ...

: tests whether matched pair samples are drawn from populations with different mean ranks. * Universal Linear Fit Identification: A Method Independent of Data, Outliers and Noise Distribution Model and Free of Missing or Removed Data Imputation.

History

Early nonparametric statistics include the

median The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...

(13th century or earlier, use in estimation by Edward Wright, 1599; see ) and the

sign test The sign test is a statistical test for consistent differences between pairs of observations, such as the weight of subjects before and after treatment. Given pairs of observations (such as weight pre- and post-treatment) for each subject, the sign ...

John Arbuthnot John Arbuthnot FRS (''baptised'' 29 April 1667 – 27 February 1735), often known simply as Dr Arbuthnot, was a Scottish physician, satirist and polymath in London. He is best remembered for his contributions to mathematics, his membership ...

(1710) in analyzing the

human sex ratio The human sex ratio is the ratio of males to females in a population in the context of anthropology and demography. In humans, the natural sex ratio at birth is slightly biased towards the male sex. It is estimated to be about 1.05 worldwide or ...

at birth (see ).

Notes

General references

* Bagdonavicius, V., Kruopis, J., Nikulin, M.S. (2011). "Non-parametric tests for complete data", ISTE & WILEY: London & Hoboken. . * * Gibbons, Jean Dickinson; Chakraborti, Subhabrata (2003). ''Nonparametric Statistical Inference'', 4th Ed. CRC Press. . * also . * Hollander M., Wolfe D.A., Chicken E. (2014). ''Nonparametric Statistical Methods'', John Wiley & Sons. * Sheskin, David J. (2003) ''Handbook of Parametric and Nonparametric Statistical Procedures''. CRC Press. * Wasserman, Larry (2007). ''All of Nonparametric Statistics'', Springer. . {{Authority control Statistical inference Robust statistics Mathematical and quantitative methods (economics)