Is there a way we can fill NAs in a zoo or xts object with limited number of NAs forward. In other words like fill
Without using na.locf, but the idea is to split your xts by group of non missing values, then for each group replacing only the 3 first values (after the non misssing one) with the first value. It is a loop , but since it is only applied on group , it should be faster than a simple loop over all the values.
zz <-
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
## create the zoo object since , the latter algorithm is applied only to the values
zoo(zz,index(x))
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26 2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02
1 1 1 1 5 5 5 5 NA NA 11 12 12
2014-10-03 2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 12 NA NA NA 19 20
Here's another way:
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
# [1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
edit: my previous answer required that x have no duplicates. The current answer does not.
benchmarks
x <- rep(x, length.out=1e4)
plourde <- function(x) {
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
}
agstudy <- function(x) {
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
}
microbenchmark(plourde(x), agstudy(x))
# Unit: milliseconds
# expr min lq median uq max neval
# plourde(x) 5.30 5.591 6.409 6.774 57.13 100
# agstudy(x) 16.04 16.249 16.454 17.516 20.64 100
And another idea that, unless I've missed something, seems valid:
na_locf_until = function(x, n = 3)
{
wnn = which(!is.na(x))
inds = sort(c(wnn, (wnn + n+1)[which((wnn + n+1) < c(wnn[-1], length(x)))]))
c(rep(NA, wnn[1] - 1),
as.vector(x)[rep(inds, c(diff(inds), length(x) - inds[length(inds)] + 1))])
}
na_locf_until(x)
#[1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
From playing around in data.table comes this hacky solution:
np1 <- 3 + 1
dt[,
x_filled := x[c(rep(1, min(np1, .N)), rep(NA, max(0, .N - np1)))],
by = cumsum(!is.na(x))]
# Or slightly simplified:
dt[,
x_filled := ifelse(rowid(x) < 4, x[1], x[NA]),
by = cumsum(!is.na(x))]
> dt
date x x_filled
1: 2019-02-14 1 1
2: 2019-02-15 NA 1
3: 2019-02-16 NA 1
4: 2019-02-17 NA 1
5: 2019-02-18 5 5
6: 2019-02-19 NA 5
7: 2019-02-20 NA 5
8: 2019-02-21 NA 5
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 NA 12
14: 2019-02-27 NA 12
15: 2019-02-28 NA 12
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
We build on the fact that subsetting vectors with NA returns NA.
Data/Packages
library(zoo)
library(data.table)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
dt <- data.table(date = index(x), x = as.integer(x))
The cleanest way to implement this in data.table is probably using the join syntax:
na.omit(dt)[dt, on = .(date), roll = +3, .(date, x_filled = x, x = i.x)]
date x_filled x
1: 2019-02-14 1 1
2: 2019-02-15 1 NA
3: 2019-02-16 1 NA
4: 2019-02-17 1 NA
5: 2019-02-18 5 5
6: 2019-02-19 5 NA
7: 2019-02-20 5 NA
8: 2019-02-21 5 NA
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 12 NA
14: 2019-02-27 12 NA
15: 2019-02-28 12 NA
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
*This solution depends on the date columns and it being contiguous