Winsorizing
   HOME

TheInfoList



OR:

Winsorizing or winsorization is the transformation of
statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...
s by limiting
extreme value In mathematical analysis, the maximum and minimum of a function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given range (the ''local'' or ''relative' ...
s in the statistical data to reduce the effect of possibly spurious
outliers In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter ar ...
. It is named after the engineer-turned-biostatistician Charles P. Winsor (1895–1951). The effect is the same as
clipping Clipping may refer to: Words * Clipping (morphology), the formation of a new word by shortening it, e.g. "ad" from "advertisement" * Clipping (phonetics), shortening the articulation of a speech sound, usually a vowel * Clipping (publications ...
in signal processing. The distribution of many statistics can be heavily influenced by outliers, values that are 'way outside' the bulk of the data. A typical strategy to account for, without eliminating altogether, these outlier values is to 'reset' outliers to a specified
percentile In statistics, a ''k''-th percentile, also known as percentile score or centile, is a score (e.g., a data point) a given percentage ''k'' of all scores in its frequency distribution exists ("exclusive" definition) or a score a given percentage ...
(or an upper and lower percentile) of the data. For example, a 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and all data above the 95th percentile set to the 95th percentile. Winsorized
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...
s are usually more
robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system's functional body. In the same line ''robustness'' can ...
to outliers than their more standard forms, although there are alternatives, such as trimming (see below), that will achieve a similar effect.


Example

Consider a simple data set consisting of: : :(N = 20, mean = 101.5) The data below the 5th percentile lie between −40 and −5 inclusive, while the data above the 95th percentile lie between 101 and 1053 inclusive (pertinent values are shown in bold). Winsorization effectively resets the outlier values to the values of the data at the 5th and 95th percentiles. Accordingly, a 90% winsorization would result in the following data set: : :(N = 20, mean = 55.65) After winsorization the mean has dropped to nearly half its previous value, and is consequently more in line or congruent with the data set from which it is calculated.


Explanation, and distinction from trimming/truncation

Note that winsorizing is not equivalent to simply excluding data, which is a simpler procedure, called trimming or
truncation In mathematics and computer science, truncation is limiting the number of digits right of the decimal point. Truncation and floor function Truncation of positive real numbers can be done using the floor function. Given a number x \in \mathbb ...
, but is a method of censoring data. In a trimmed estimator, the extreme values are ''discarded;'' in a winsorized estimator, the extreme values are instead ''replaced'' by certain percentiles (the trimmed minimum and maximum). Thus a
winsorized mean A winsorized mean is a winsorized statistical measure of central tendency, much like the mean and median, and even more similar to the truncated mean. It involves the calculation of the mean after winsorizing — replacing given parts of a p ...
is not the same as a truncated or trimmed mean. For instance, the 10% trimmed mean is the average of the 5th to 95th percentile of the data, while the 90% winsorized mean sets the bottom 5% to the 5th percentile, the top 5% to the 95th percentile, and then averages the data. Winsorizing thus does not change the total number of values in the data set, N. In the example given above, the trimmed mean would be obtained from the smaller (truncated) set: : :(N = 18, trimmed mean = 56.5) In this case, the winsorized mean can equivalently be expressed as a
weighted average The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The ...
of the 5th percentile, the truncated mean, and the 95th percentile (for this case of a 10% winsorized mean: 0.05 times the 5th percentile, 0.9 times the 10% trimmed mean, and 0.05 times the 95th percentile). However, in general, winsorized statistics need not be expressible in terms of the corresponding trimmed statistic. More formally, they are distinct because the
order statistics In statistics, the ''k''th order statistic of a statistical sample is equal to its ''k''th-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference. Important ...
are not independent.


Uses

Winsorization is used in the
survey methodology Survey methodology is "the study of survey methods". As a field of applied statistics concentrating on human-research surveys, survey methodology studies the sampling of individual units from a population and associated techniques of survey d ...
context in order to "trim" extreme survey non-response weights. It is also used in the construction of some
stock indexes In finance, a stock index, or stock market index, is an Index (economics), index that measures the performance of a stock market, or of a subset of a stock market. It helps investors compare current stock price levels with past prices to calcul ...
when looking at the range of certain factors (for example growth and value) for particular stocks.


Coding methods

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
can winsorize data using
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, fast Fourier ...
library: import numpy as np from scipy.stats.mstats import winsorize winsorize(np.array(
2, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
, limits= .05, 0.05
R can winsorize data using the DescTools package:Andri Signorell et al. (2021). DescTools: Tools for descriptive statistics. R package version 0.99.41. library(DescTools) a<-c(92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41) DescTools::Winsorize(a, probs = c(0.05, 0.95))


See also

*
Trimmed estimator In statistics, a trimmed estimator is an estimator derived from another estimator by excluding some of the extreme values, a process called truncation. This is generally done to obtain a more robust statistic, and the extreme values are considered ...
*
Huber loss In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used. Definition The Huber loss function describes ...
*
Robust regression In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of re ...


References

* * *


External links

* {{cite web , title=Winsorization , work=R-bloggers , date=June 30, 2011 , url=https://www.r-bloggers.com/winsorization/ Statistical data transformation Robust statistics Articles with example Python (programming language) code Articles with example R code