Theil-Sen regression has several advantages over
Ordinary least squares regression. It is insensitive to
outliers. It can be used for significance tests even when residuals are not normally distributed.[10] It can be significantly more accurate than
non-robust simple linear regression (least squares) for
skewed and
heteroskedastic data, and competes well against least squares even for
normally distributed data in terms of
statistical power.[11] It has been called "the most popular nonparametric technique for estimating a linear trend".[2] There are fast algorithms for efficiently computing the parameters.
Definition
As defined by
Theil (1950), the TheilâSen estimator of a set of two-dimensional points (xi, yi) is the median m of the slopes (yj â yi)/(xj â xi) determined by all pairs of sample points.
Sen (1968) extended this definition to handle the case in which two data points have the same x coordinate. In Sen's definition, one takes the median of the slopes defined only from pairs of points having distinct x coordinates.[8]
Once the slope m has been determined, one may determine a line from the sample points by setting the
y-interceptb to be the median of the values yi â mxi. The fit line is then the line y = mx + b with coefficients m and b in
slopeâintercept form.[12] As Sen observed, this choice of slope makes the
Kendall tau rank correlation coefficient become approximately zero, when it is used to compare the values xi with their associated
residualsyi â mxi â b. Intuitively, this suggests that how far the fit line passes above or below a data point is not correlated with whether that point is on the left or right side of the data set. The choice of b does not affect the Kendall coefficient, but causes the median residual to become approximately zero; that is, the fit line passes above and below equal numbers of points.[9]
A
confidence interval for the slope estimate may be determined as the interval containing the middle 95% of the slopes of lines determined by pairs of points[13] and may be estimated quickly by sampling pairs of points and determining the 95% interval of the sampled slopes. According to simulations, approximately 600 sample pairs are sufficient to determine an accurate confidence interval.[11]
Variations
A variation of the TheilâSen estimator, the
repeated median regression of
Siegel (1982), determines for each sample point (xi, yi), the median mi of the slopes (yj â yi)/(xj â xi) of lines through that point, and then determines the overall estimator as the median of these medians. It can tolerate a greater number of outliers than the TheilâSen estimator, but known algorithms for computing it efficiently are more complicated and less practical.[14]
A different variant pairs up sample points by the rank of their x-coordinates: the point with the smallest coordinate is paired with the first point above the median coordinate, the second-smallest point is paired with the next point above the median, and so on. It then computes the median of the slopes of the lines determined by these pairs of points, gaining speed by examining significantly fewer pairs than the TheilâSen estimator.[15]
Variations of the TheilâSen estimator based on
weighted medians have also been studied, based on the principle that pairs of samples whose x-coordinates differ more greatly are more likely to have an accurate slope and therefore should receive a higher weight.[16]
For seasonal data, it may be appropriate to smooth out seasonal variations in the data by considering only pairs of sample points that both belong to the same month or the same season of the year, and finding the median of the slopes of the lines determined by this more restrictive set of pairs.[17]
The TheilâSen estimator is more
robust than the least-squares estimator because it is much less sensitive to
outliers. It has a
breakdown point of
meaning that it can tolerate arbitrary corruption of up to 29.3% of the input data-points without degradation of its accuracy.[12] However, the breakdown point decreases for higher-dimensional generalizations of the method.[20] A higher breakdown point, 50%, holds for a different robust line-fitting algorithm, the
repeated median estimator of Siegel.[12]
The TheilâSen estimator is
equivariant under every
linear transformation of its response variable, meaning that transforming the data first and then fitting a line, or fitting a line first and then transforming it in the same way, both produce the same result.[21] However, it is not equivariant under
affine transformations of both the predictor and response variables.[20]
Algorithms
The median slope of a set of n sample points may be computed exactly by computing all O(n2) lines through pairs of points, and then applying a linear time
median finding algorithm. Alternatively, it may be estimated by sampling pairs of points. This problem is equivalent, under
projective duality, to the problem of finding the crossing point in an
arrangement of lines that has the median x-coordinate among all such crossing points.[22]
The problem of performing slope selection exactly but more efficiently than the brute force quadratic time algorithm has been extensively studied in
computational geometry. Several different methods are known for computing the TheilâSen estimator exactly in O(n log n) time, either deterministically[3] or using
randomized algorithms.[4] Siegel's repeated median estimator can also be constructed in the same time bound.[23] In models of computation in which the input coordinates are integers and in which
bitwise operations on integers take constant time, the TheilâSen estimator can be constructed even more quickly, in randomized expected time .[24]
An estimator for the slope with approximately median rank, having the same breakdown point as the TheilâSen estimator, may be maintained in the
data stream model (in which the sample points are processed one by one by an algorithm that does not have enough persistent storage to represent the entire data set) using an algorithm based on
Δ-nets.[25]
Implementations
In the
R statistics package, both the TheilâSen estimator and Siegel's repeated median estimator are available through the mblm library.[26]
A free standalone
Visual Basic application for TheilâSen estimation, KTRLine, has been made available by the
US Geological Survey.[27]
The TheilâSen estimator has also been implemented in
Python as part of the
SciPy and
scikit-learn libraries.[28]
Applications
TheilâSen estimation has been applied to
astronomy due to its ability to handle
censored regression models.[29] In
biophysics,
Fernandes & Leblanc (2005) suggest its use for remote sensing applications such as the estimation of leaf area from reflectance data due to its "simplicity in computation, analytical estimates of confidence intervals, robustness to outliers, testable assumptions regarding residuals and ... limited a priori information regarding measurement errors".[30] For measuring seasonal environmental data such as
water quality, a seasonally adjusted variant of the TheilâSen estimator has been proposed as preferable to least squares estimation due to its high precision in the presence of skewed data.[17] In
computer science, the TheilâSen method has been used to estimate trends in
software aging.[31] In
meteorology and
climatology, it has been used to estimate the long-term trends of wind occurrence and speed.[32]
^Helsel, Dennis R.; Hirsch, Robert M.; Ryberg, Karen R.; Archfield, Stacey A.; Gilroy, Edward J. (2020).
Statistical methods in water resources. Techniques and Methods. Reston, VA: U.S. Geological Survey. p. 484. Retrieved 2020-05-22.
^For determining confidence intervals, pairs of points must be sampled
with replacement; this means that the set of pairs used in this calculation includes pairs in which both points are the same as each other. These pairs are always outside the confidence interval, because they do not determine a well-defined slope value, but using them as part of the calculation causes the confidence interval to be wider than it would be without them.
Akritas, Michael G.;
Murphy, Susan A.; LaValley, Michael P. (1995), "The Theil-Sen estimator with doubly censored data and applications to astronomy", Journal of the American Statistical Association, 90 (429): 170â177,
doi:
10.1080/01621459.1995.10476499,
JSTOR2291140,
MR1325124.
Birkes, David; Dodge, Yadolah (1993), "6.3 Estimating the Regression Line",
Alternative Methods of Regression, Wiley Series in Probability and Statistics, vol. 282, Wiley-Interscience, pp. 113â118,
ISBN978-0-471-56881-0.
Fernandes, Richard; Leblanc, Sylvain G. (2005), "Parametric (modified least squares) and non-parametric (TheilâSen) linear regressions for predicting biophysical parameters in the presence of measurement errors", Remote Sensing of Environment, 95 (3): 303â316,
Bibcode:
2005RSEnv..95..303F,
doi:
10.1016/j.rse.2005.01.005.
Jaeckel, Louis A. (1972), "Estimating regression coefficients by minimizing the dispersion of the residuals", Annals of Mathematical Statistics, 43 (5): 1449â1458,
doi:10.1214/aoms/1177692377,
MR0348930.
Logan, Murray (2010), Biostatistical Design and Analysis Using R: A Practical Guide, John Wiley & Sons,
ISBN9781444362473
Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; De Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. (1997), "12.1.5.1 Single median method",
Handbook of Chemometrics and Qualimetrics: Part A, Data Handling in Science and Technology, vol. 20A, Elsevier, pp. 355â356,
ISBN978-0-444-89724-4.
RomaniÄ, Djordje; ÄuriÄ, Mladjen; JoviÄiÄ, Ilija; Lompar, MiloĆĄ (2014), "Long-term trends of the 'Koshava' wind during the period 1949â2010", International Journal of Climatology, 35 (2): 288â302,
Bibcode:
2015IJCli..35..288R,
doi:
10.1002/joc.3981,
S2CID129402302.
Siegel, Andrew F. (1982), "Robust regression using repeated medians", Biometrika, 69 (1): 242â244,
doi:10.1093/biomet/69.1.242.
Sievers, Gerald L. (1978), "Weighted rank statistics for simple linear regression", Journal of the American Statistical Association, 73 (363): 628â631,
doi:
10.1080/01621459.1978.10480067,
JSTOR2286613.
Vaidyanathan, Kalyanaraman; Trivedi, Kishor S. (2005), "A Comprehensive Model for Software Rejuvenation", IEEE Transactions on Dependable and Secure Computing, 2 (2): 124â137,
doi:
10.1109/TDSC.2005.15,
S2CID15105513.
Wilcox, Rand R. (2005), "10.2 TheilâSen Estimator", Introduction to Robust Estimation and Hypothesis Testing, Academic Press, pp. 423â427,
ISBN978-0-12-751542-7.