STL decomposition of time series with missing values for anomaly detection

巧了我就是萌 提交于 2019-11-30 01:16:34

In the beginning of stl we find

x <- na.action(as.ts(x))

and soon after that

period <- frequency(x)
if (period < 2 || n <= 2 * period) 
    stop("series is not periodic or has less than two periods")

That is, stl expects x to be ts object after na.action(as.ts(x)) (otherwise period == 1). Let us check na.omit and na.exclude first.

Clearly, at the end of getAnywhere("na.omit.ts") we find

if (any(is.na(object))) 
    stop("time series contains internal NAs")

which is straightforward and nothing can be done (na.omit does not exclude NAs from ts objects). Now getAnywhere("na.exclude.default") excludes NA observations, but returns an object of class exclude:

    attr(omit, "class") <- "exclude"

and this is a different situation. As mentioned above, stl expects na.action(as.ts(x)) to be ts, but na.exclude(as.ts(x)) is of class exclude.

Hence if one is satisfied with NAs exclusion then e.g.

nottem[3] <- NA
frequency(nottem)
# [1] 12
na.new <- function(x) ts(na.exclude(x), frequency = 12)
stl(nottem, na.action = na.new, s.window = "per")

works. In general, stl does not work with NA values (i.e. with na.action = na.pass), it crashes deeper in Fortran (see full source code here):

z <- .Fortran(C_stl, ...

Alternatives to na.new are not delightful:

  • na.contaguous - finds the longest consecutive stretch of non-missing values in a time series object.
  • na.approx, na.locf from zoo or some other interpolation function.
  • Not sure about this one, but another one Fortran implementation can be found for Python here. One could use Python of possibly install R from source after some modifications, in case this module really allows missing values.

As we can see in the paper, there is no some simple procedure for missing values (like approximating them in the very beginning) which could be applied to the time series before calling stl. So considering the fact that original implementation is quite lengthy I would think about some other alternatives than whole new implementation.

Update: a quite optimal in many aspects choice when having NAs could be na.approx from zoo, so let us check its performance, i.e. compare results of stl with full data set and results when having some number of NAs, using na.approx. I am using MAPE as a measure of accuracy, but only for trend, because seasonal component and remainder crosses zero and it would distort the result. Positions for NAs are chosen at random.

library(zoo)
library(plyr)
library(reshape)
library(ggplot2)
mape <- function(f, x) colMeans(abs(1 - f / x) * 100)

stlCheck <- function(data, p = 3, ...){
  set.seed(20130201)
  pos <- lapply(3^(0:p), function(x) sample(1:length(data), x))
  datasetsNA <- lapply(pos, function(x) {data[x] <- NA; data})
  original <- data.frame(stl(data, ...)$time.series, stringsAsFactors = FALSE)
  original$id <- "Original"
  datasetsNA <- lapply(datasetsNA, function(x) 
    data.frame(stl(x, na.action = na.approx, ...)$time.series, 
               id = paste(sum(is.na(x)), "NAs"), 
               stringsAsFactors = FALSE))
  stlAll <- rbind.fill(c(list(original), datasetsNA))
  stlAll$Date <- time(data)
  stlAll <- melt(stlAll, id.var = c("id", "Date"))
  results <- data.frame(trend = sapply(lapply(datasetsNA, '[', i = "trend"), mape, original[, "trend"]))
  results$id <- paste(3^(0:p), "NAs")
  results <- melt(results, id.var = "id")
  results$x <- min(stlAll$Date) + diff(range(stlAll$Date)) / 4
  results$y <- min(original[, "trend"]) + diff(range(original[, "trend"])) / (4 * p) * (0:p)
  results$value <- round(results$value, 2)
  ggplot(stlAll, aes(x = Date, y = value, colour = id, group = id)) + geom_line() + 
    facet_wrap(~ variable, scales = "free_y") + theme_bw() +
    theme(legend.title = element_blank(), strip.background = element_rect(fill = "white")) + 
    labs(x = NULL, y = NULL) + scale_colour_brewer(palette = "Set1") +
    lapply(unique(results$id), function(z)
      geom_text(data = results, colour = "black", size = 3,
                aes(x = x, y = y, label = paste0("MAPE (", id, "): ", value, "%"))))
}

nottem, 240 observations

stlCheck(nottem, s.window = 4, t.window = 50, t.jump = 1)

co2, 468 observations

stlCheck(log(co2), s.window = 21)

mdeaths, 72 observations

stlCheck(mdeaths, s.window = "per")

Visually we do see some differences in trend in cases 1 and 3. But these differences are pretty small in 1 and also satisfactory in 3 considering sample size (72).

Realize this is an old question, but thought I'd update since there is a newer stl package available in R called stlplus. Here is its homepage on github. You can install it from CRAN with install.packages("stlplus") or directly from github with devtools::install_github("hafen/stlplus").

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!