HOME

TheInfoList



OR:

In
non-parametric statistics Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being dist ...
, the Theil–Sen estimator is a method for robustly fitting a line to sample points in the plane ( simple linear regression) by choosing the median of the
slope In mathematics, the slope or gradient of a line is a number that describes both the ''direction'' and the ''steepness'' of the line. Slope is often denoted by the letter ''m''; there is no clear answer to the question why the letter ''m'' is use ...
s of all lines through pairs of points. It has also been called Sen's slope estimator, slope selection, the single median method, the Kendall robust line-fit method, and the Kendall–Theil robust line. It is named after Henri Theil and Pranab K. Sen, who published papers on this method in 1950 and 1968 respectively,; and after Maurice Kendall because of its relation to the Kendall tau rank correlation coefficient. This estimator can be computed efficiently, and is insensitive to outliers. It can be significantly more accurate than non-robust simple linear regression (least squares) for skewed and heteroskedastic data, and competes well against least squares even for normally distributed data in terms of statistical power. It has been called "the most popular nonparametric technique for estimating a linear trend"..


Definition

As defined by , the Theil–Sen estimator of a set of two-dimensional points is the median of the slopes determined by all pairs of sample points. extended this definition to handle the case in which two data points have the same coordinate. In Sen's definition, one takes the median of the slopes defined only from pairs of points having distinct coordinates. Once the slope has been determined, one may determine a line from the sample points by setting the -intercept to be the median of the values . The fit line is then the line with coefficients and in slope–intercept form. As Sen observed, this choice of slope makes the Kendall tau rank correlation coefficient become approximately zero, when it is used to compare the values with their associated residuals . Intuitively, this suggests that how far the fit line passes above or below a data point is not correlated with whether that point is on the left or right side of the data set. The choice of does not affect the Kendall coefficient, but causes the median residual to become approximately zero; that is, the fit line passes above and below equal numbers of points.; . A confidence interval for the slope estimate may be determined as the interval containing the middle 95% of the slopes of lines determined by pairs of points and may be estimated quickly by sampling pairs of points and determining the 95% interval of the sampled slopes. According to simulations, approximately 600 sample pairs are sufficient to determine an accurate confidence interval..


Variations

A variation of the Theil–Sen estimator, the repeated median regression of , determines for each sample point , the median of the slopes of lines through that point, and then determines the overall estimator as the median of these medians. It can tolerate a greater number of outliers than the Theil–Sen estimator, but known algorithms for computing it efficiently are more complicated and less practical. A different variant pairs up sample points by the rank of their -coordinates: the point with the smallest coordinate is paired with the first point above the median coordinate, the second-smallest point is paired with the next point above the median, and so on. It then computes the median of the slopes of the lines determined by these pairs of points, gaining speed by examining significantly fewer pairs than the Theil–Sen estimator. Variations of the Theil–Sen estimator based on weighted medians have also been studied, based on the principle that pairs of samples whose -coordinates differ more greatly are more likely to have an accurate slope and therefore should receive a higher weight. For seasonal data, it may be appropriate to smooth out seasonal variations in the data by considering only pairs of sample points that both belong to the same month or the same season of the year, and finding the median of the slopes of the lines determined by this more restrictive set of pairs..


Statistical properties

The Theil–Sen estimator is an
unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In sta ...
of the true slope in simple linear regression. For many distributions of the response error, this estimator has high asymptotic efficiency relative to
least-squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...
estimation. Estimators with low efficiency require more independent observations to attain the same sample variance of efficient unbiased estimators. The Theil–Sen estimator is more
robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
than the least-squares estimator because it is much less sensitive to outliers. It has a
breakdown point Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such ...
of :1-\frac\approx 29.3\%, meaning that it can tolerate arbitrary corruption of up to 29.3% of the input data-points without degradation of its accuracy., pp. 67, 164. However, the breakdown point decreases for higher-dimensional generalizations of the method. A higher breakdown point, 50%, holds for a different robust line-fitting algorithm, the repeated median estimator of Siegel. The Theil–Sen estimator is equivariant under every
linear transformation In mathematics, and more specifically in linear algebra, a linear map (also called a linear mapping, linear transformation, vector space homomorphism, or in some contexts linear function) is a mapping V \to W between two vector spaces that pre ...
of its response variable, meaning that transforming the data first and then fitting a line, or fitting a line first and then transforming it in the same way, both produce the same result. However, it is not equivariant under
affine transformations In Euclidean geometry, an affine transformation or affinity (from the Latin, ''affinis'', "connected with") is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles. More generally, ...
of both the predictor and response variables..


Algorithms and implementation

The median slope of a set of sample points may be computed exactly by computing all lines through pairs of points, and then applying a linear time median finding algorithm. Alternatively, it may be estimated by sampling pairs of points. This problem is equivalent, under projective duality, to the problem of finding the crossing point in an
arrangement of lines In music, an arrangement is a musical adaptation of an existing composition. Differences from the original composition may include reharmonization, melodic paraphrasing, orchestration, or formal development. Arranging differs from orchestr ...
that has the median -coordinate among all such crossing points. The problem of performing slope selection exactly but more efficiently than the brute force quadratic time algorithm has been extensively studied in computational geometry. Several different methods are known for computing the Theil–Sen estimator exactly in time, either deterministically; ; . or using randomized algorithms.; ; . Siegel's repeated median estimator can also be constructed in the same time bound. In models of computation in which the input coordinates are integers and in which bitwise operations on integers take constant time, the Theil–Sen estimator can be constructed even more quickly, in randomized expected time O(n\sqrt). An estimator for the slope with approximately median rank, having the same breakdown point as the Theil–Sen estimator, may be maintained in the data stream model (in which the sample points are processed one by one by an algorithm that does not have enough persistent storage to represent the entire data set) using an algorithm based on ε-nets. In the R statistics package, both the Theil–Sen estimator and Siegel's repeated median estimator are available through the mblm library. A free standalone
Visual Basic Visual Basic is a name for a family of programming languages from Microsoft. It may refer to: * Visual Basic .NET (now simply referred to as "Visual Basic"), the current version of Visual Basic launched in 2002 which runs on .NET * Visual Basic ( ...
application for Theil–Sen estimation, KTRLine, has been made available by the
US Geological Survey The United States Geological Survey (USGS), formerly simply known as the Geological Survey, is a scientific agency of the United States government. The scientists of the USGS study the landscape of the United States, its natural resources, and ...
. The Theil–Sen estimator has also been implemented in
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
as part of the
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, ...
and
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
libraries.;


Applications

Theil–Sen estimation has been applied to
astronomy Astronomy () is a natural science that studies celestial objects and phenomena. It uses mathematics, physics, and chemistry in order to explain their origin and evolution. Objects of interest include planets, moons, stars, nebulae, g ...
due to its ability to handle
censored regression model Censored regression models are a class of models in which the dependent variable is censored above or below a certain threshold. A commonly used likelihood-based model to accommodate to a censored sample is the Tobit model, but quantile and nonp ...
s. In
biophysics Biophysics is an interdisciplinary science that applies approaches and methods traditionally used in physics to study biological phenomena. Biophysics covers all scales of biological organization, from molecular to organismic and populations. ...
, suggest its use for remote sensing applications such as the estimation of leaf area from reflectance data due to its "simplicity in computation, analytical estimates of confidence intervals, robustness to outliers, testable assumptions regarding residuals and ... limited a priori information regarding measurement errors". For measuring seasonal environmental data such as
water quality Water quality refers to the chemical, physical, and biological characteristics of water based on the standards of its usage. It is most frequently used by reference to a set of standards against which compliance, generally achieved through tr ...
, a seasonally adjusted variant of the Theil–Sen estimator has been proposed as preferable to least squares estimation due to its high precision in the presence of skewed data. In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includi ...
, the Theil–Sen method has been used to estimate trends in
software aging In software engineering, software aging is the tendency for software to fail or cause a system failure after running continuously for a certain time, or because of ongoing changes in systems surrounding the software. Software aging has several c ...
. In
meteorology Meteorology is a branch of the atmospheric sciences (which include atmospheric chemistry and physics) with a major focus on weather forecasting. The study of meteorology dates back millennia, though significant progress in meteorology did no ...
and
climatology Climatology (from Greek , ''klima'', "place, zone"; and , ''-logia'') or climate science is the scientific study of Earth's climate, typically defined as weather conditions averaged over a period of at least 30 years. This modern field of study ...
, it has been used to estimate the long-term trends of wind occurrence and speed.


See also

*
Regression dilution Regression dilution, also known as regression attenuation, is the Bias (statistics), biasing of the linear regression regression slope, slope towards zero (the underestimation of its absolute value), caused by errors in the independent variable. ...
, for another problem affecting estimated trend slopes


Notes


References

*. *. *. *. *. *. *. *. *. *. *. *. *. *. *. *. *. * *. *. *. *. * *. *. *. * *. *. *. *. *. *. * *. *. *. *. {{DEFAULTSORT:Theil-Sen estimator Robust regression Computational geometry