I am trying to understand how they implemented tsoutliers, and you can easily access the source from a CRAN mirror here, and this package is mostly based on this paper “Joint Estimation of Model Parameters and Outlier Effects in Time Series“. Lets first start with the function “locate-outliers”.
There are a few parameters here
- resid: residuals of the ARIMA model over the real data
- pars: parameters of the AR(auto-regressive) and MA(moving average) from the ARIMA model
- types: a list of the types of outliers, like AO: additive outliers, LS: level shift and TC: transient change
Before compute the test statistic of outliers, we have to first estimate the residual standard deviations since they are easily contaminated by taking outliers into consideration, as indicated in the paper 1.4 (Estimation of Residual Standard Deviation). They mentioned three approaches to have a better estimation.
- MAD: the median absolute deviation (scaled by factor of 1.483)
- a% trimmed method
- the omit-one method
In the code, they first calculated the sigma and then called the function `outliers.tstatistic` to calculate the test statistics. The outliers.tstatistics will be explained in another post but let’s assume that we have the metrics ready for every single data point in the time series where they “type”, “indices”, “coefhat” (least squares estimate for the effect of a single outlier) and tstat (maximum value of the standardized statistics of the outlier effects).
Then they removed the rows whose tstat is lower than the cval (threshold, 3.5 as default).
They also mentioned a scenario where consecutive LS outliers have been found. And they will only keep the one with the highest abs(tstat).
Also, a point might be categorized as many types of outliers, where they will choose the one category where it exceed cval and also has the highest abs(tstat).
Then following two big for loops, iloop and oloop.