Nonparametric statistics is a type of statistical analysis that makes minimal assumptions about the underlying
distribution of the data being studied. Often these models are infinite-dimensional, rather than finite dimensional, as in
parametric statistics. Nonparametric statistics can be used for
descriptive statistics or
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...
. Nonparametric tests are often used when the assumptions of parametric tests are evidently violated.
Definitions
The term "nonparametric statistics" has been defined imprecisely in the following two ways, among others:
The first meaning of ''nonparametric'' involves techniques that do not rely on data belonging to any particular parametric family of probability distributions.
These include, among others:
* Methods which are ''distribution-free'', which do not rely on assumptions that the data are drawn from a given parametric family of
probability distributions.
* Statistics defined to be a function on a sample, without dependency on a
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
.
An example is
Order statistics, which are based on
ordinal ranking of observations.
The discussion following is taken from ''Kendall's Advanced Theory of Statistics''.
Statistical hypotheses concern the behavior of observable random variables.... For example, the hypothesis (a) that a normal distribution has a specified mean and variance is statistical; so is the hypothesis (b) that it has a given mean but unspecified variance; so is the hypothesis (c) that a distribution is of normal form with both mean and variance unspecified; finally, so is the hypothesis (d) that two unspecified continuous distributions are identical.
It will have been noticed that in the examples (a) and (b) the distribution underlying the observations was taken to be of a certain form (the normal) and the hypothesis was concerned entirely with the value of one or both of its parameters. Such a hypothesis, for obvious reasons, is called ''parametric''.
Hypothesis (c) was of a different nature, as no parameter values are specified in the statement of the hypothesis; we might reasonably call such a hypothesis ''non-parametric''. Hypothesis (d) is also non-parametric but, in addition, it does not even specify the underlying form of the distribution and may now be reasonably termed ''distribution-free''. Notwithstanding these distinctions, the statistical literature now commonly applies the label "non-parametric" to test procedures that we have just termed "distribution-free", thereby losing a useful classification.
The second meaning of ''non-parametric'' involves techniques that do not assume that the ''structure'' of a model is fixed. Typically, the model grows in size to accommodate the complexity of the data. In these techniques, individual variables ''are'' typically assumed to belong to parametric distributions, and assumptions about the types of associations among variables are also made. These techniques include, among others:
* ''
non-parametric regression'', which is modeling whereby the structure of the relationship between variables is treated non-parametrically, but where nevertheless there may be parametric assumptions about the distribution of model residuals.
* ''non-parametric hierarchical Bayesian models'', such as models based on the
Dirichlet process, which allow the number of
latent variables to grow as necessary to fit the data, but where individual variables still follow parametric distributions and even the process controlling the rate of growth of latent variables follows a parametric distribution.
Applications and purpose
Non-parametric methods are widely used for studying populations that have a ranked order (such as movie reviews receiving one to five "stars"). The use of non-parametric methods may be necessary when data have a
ranking
A ranking is a relationship between a set of items, often recorded in a list, such that, for any two items, the first is either "ranked higher than", "ranked lower than", or "ranked equal to" the second. In mathematics, this is known as a weak ...
but no clear
numerical interpretation, such as when assessing
preferences. In terms of
levels of measurement, non-parametric methods result in
ordinal data
Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four Level of measurement, levels of m ...
.
As non-parametric methods make fewer assumptions, their applicability is much more general than the corresponding parametric methods. In particular, they may be applied in situations where less is known about the application in question. Also, due to the reliance on fewer assumptions, non-parametric methods are more
robust.
Non-parametric methods are sometimes considered simpler to use and more robust than parametric methods, even when the assumptions of parametric methods are justified. This is due to their more general nature, which may make them less susceptible to misuse and misunderstanding. Non-parametric methods can be considered a conservative choice, as they will work even when their assumptions are not met, whereas parametric methods can produce misleading results when their assumptions are violated.
The wider applicability and increased
robustness of non-parametric tests comes at a cost: in cases where a parametric test's assumptions are met, non-parametric tests have less
statistical power. In other words, a larger sample size can be required to draw conclusions with the same degree of confidence.
Non-parametric models
''Non-parametric models'' differ from
parametric models in that the model structure is not specified ''a priori'' but is instead determined from data. The term ''non-parametric'' is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance.
* A
histogram
A histogram is a visual representation of the frequency distribution, distribution of quantitative data. To construct a histogram, the first step is to Data binning, "bin" (or "bucket") the range of values— divide the entire range of values in ...
is a simple nonparametric estimate of a probability distribution.
*
Kernel density estimation is another method to estimate a probability distribution.
*
Nonparametric regression and
semiparametric regression methods have been developed based on
kernels,
splines, and
wavelets.
*
Data envelopment analysis provides efficiency coefficients similar to those obtained by
multivariate analysis
Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., '' multivariate random variables''.
Multivariate statistics concerns understanding the differ ...
without any distributional assumption.
*
KNNs classify the unseen instance based on the K points in the training set which are nearest to it.
* A
support vector machine (with a Gaussian kernel) is a nonparametric large-margin classifier.
* The
method of moments with polynomial probability distributions.
Methods
Non-parametric (or distribution-free) inferential statistical methods are mathematical procedures for statistical hypothesis testing which, unlike
parametric statistics, make no assumptions about the
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
s of the variables being assessed. The most frequently used tests include
*
Analysis of similarities
*
Anderson–Darling test: tests whether a sample is drawn from a given distribution
*
Statistical bootstrap methods: estimates the accuracy/sampling distribution of a statistic
*
Cochran's Q: tests whether ''k'' treatments in randomized block designs with 0/1 outcomes have identical effects
*
Cohen's kappa: measures inter-rater agreement for categorical items
*
Friedman two-way analysis of variance (Repeated Measures) by ranks: tests whether ''k'' treatments in randomized block designs have identical effects
*
Empirical likelihood
*
Kaplan–Meier: estimates the survival function from lifetime data, modeling censoring
*
Kendall's tau
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non ...
: measures statistical dependence between two variables
*
Kendall's W: a measure between 0 and 1 of inter-rater agreement.
*
Kolmogorov–Smirnov test
In statistics, the Kolmogorov–Smirnov test (also K–S test or KS test) is a nonparametric statistics, nonparametric test of the equality of continuous (or discontinuous, see #Discrete and mixed null distribution, Section 2.2), one-dimensional ...
: tests whether a sample is drawn from a given distribution, or whether two samples are drawn from the same distribution.
*
Kruskal–Wallis one-way analysis of variance by ranks: tests whether > 2 independent samples are drawn from the same distribution.
*
Kuiper's test: tests whether a sample is drawn from a given distribution, sensitive to cyclic variations such as day of the week.
*
Logrank test: compares survival distributions of two right-skewed, censored samples.
*
Mann–Whitney U or Wilcoxon rank sum test: tests whether two samples are drawn from the same distribution, as compared to a given alternative hypothesis.
*
McNemar's test: tests whether, in 2 × 2 contingency tables with a dichotomous trait and matched pairs of subjects, row and column marginal frequencies are equal.
*
Median test: tests whether two samples are drawn from distributions with equal medians.
*
Pitman's permutation test: a statistical significance test that yields exact ''p'' values by examining all possible rearrangements of labels.
*
Rank products: detects differentially expressed genes in replicated microarray experiments.
*
Siegel–Tukey test: tests for differences in scale between two groups.
*
Sign test: tests whether matched pair samples are drawn from distributions with equal medians.
*
Spearman's rank correlation coefficient
In statistics, Spearman's rank correlation coefficient or Spearman's ''ρ'' is a number ranging from -1 to 1 that indicates how strongly two sets of ranks are correlated. It could be used in a situation where one only has ranked data, such as a ...
: measures statistical dependence between two variables using a monotonic function.
*
Squared ranks test: tests equality of variances in two or more samples.
*
Tukey–Duckworth test: tests equality of two distributions by using ranks.
*
Wald–Wolfowitz runs test: tests whether the elements of a sequence are mutually independent/random.
*
Wilcoxon signed-rank test: tests whether matched pair samples are drawn from populations with different mean ranks.
* Universal Linear Fit Identification: A Method Independent of Data, Outliers and Noise Distribution Model and Free of Missing or Removed Data Imputation.
History
Early nonparametric statistics include the
median
The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...
(13th century or earlier, use in estimation by
Edward Wright, 1599; see ) and the
sign test by
John Arbuthnot (1710) in analyzing the
human sex ratio
The human sex ratio is the ratio of males to females in a population in the context of anthropology and demography. In humans, the natural sex ratio at birth is slightly biased towards the male sex. It is estimated to be about 1.05 worldwide or ...
at birth (see ).
See also
*
CDF-based nonparametric confidence interval
*
Parametric statistics
*
Resampling (statistics)
*
Semiparametric model In statistics, a semiparametric model is a statistical model that has parametric and nonparametric components.
A statistical model is a parameterized family of distributions: \ indexed by a parameter \theta.
* A parametric model is a model i ...
Notes
General references
* Bagdonavicius, V., Kruopis, J., Nikulin, M.S. (2011). "Non-parametric tests for complete data", ISTE & WILEY: London & Hoboken. .
*
*
Gibbons, Jean Dickinson; Chakraborti, Subhabrata (2003). ''Nonparametric Statistical Inference'', 4th Ed. CRC Press. .
* also .
* Hollander M., Wolfe D.A., Chicken E. (2014). ''Nonparametric Statistical Methods'', John Wiley & Sons.
* Sheskin, David J. (2003) ''Handbook of Parametric and Nonparametric Statistical Procedures''. CRC Press.
*
Wasserman, Larry (2007). ''All of Nonparametric Statistics'', Springer. .
{{Authority control
Statistical inference
Robust statistics
Mathematical and quantitative methods (economics)