问题
Is there a way we can fill NA
s in a zoo
or xts
object with limited number of NA
s forward. In other words like fill NA
s up to 3 consecutive NA
s, and then keep the NA
s from the 4th value on until a valid number.
Something like this.
library(zoo)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
x
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 NA NA NA 5 NA NA
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
NA NA NA 11 12 NA NA
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
NA NA NA NA 19 20
Desired output, will be something with variable n = 3 is
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26
1 1 1 1 5 5 5
2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02 2014-10-03
5 NA NA 11 12 12 12
2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 NA NA NA 19 20
I have tried lot of combination with na.locf(x, maxgap = 3)
etc without much success. I can create a loop to get the desired output, I was wondering whether there is vectorized way of achieving this.
fillInTheBlanks <- function(v, n=3) {
result <- v
counter0 <- 1
for(i in 2:length(v)) {
value <- v[i]
if (is.na(value)) {
if (counter0 > n) {
result[i] <- v[i]
} else {
result[i] <- result[i-1]
counter0 <- counter0 + 1
} }
else {
result[i] <- v[i]
counter0 <- 1
}
}
return(result)
}
Thanks
回答1:
Here's another way:
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
# [1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
edit: my previous answer required that x
have no duplicates. The current answer does not.
benchmarks
x <- rep(x, length.out=1e4)
plourde <- function(x) {
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
}
agstudy <- function(x) {
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
}
microbenchmark(plourde(x), agstudy(x))
# Unit: milliseconds
# expr min lq median uq max neval
# plourde(x) 5.30 5.591 6.409 6.774 57.13 100
# agstudy(x) 16.04 16.249 16.454 17.516 20.64 100
回答2:
And another idea that, unless I've missed something, seems valid:
na_locf_until = function(x, n = 3)
{
wnn = which(!is.na(x))
inds = sort(c(wnn, (wnn + n+1)[which((wnn + n+1) < c(wnn[-1], length(x)))]))
c(rep(NA, wnn[1] - 1),
as.vector(x)[rep(inds, c(diff(inds), length(x) - inds[length(inds)] + 1))])
}
na_locf_until(x)
#[1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
回答3:
Without using na.locf
, but the idea is to split your xts by group of non missing values, then for each group replacing only the 3 first values (after the non misssing one) with the first value. It is a loop , but since it is only applied on group , it should be faster than a simple loop over all the values.
zz <-
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
## create the zoo object since , the latter algorithm is applied only to the values
zoo(zz,index(x))
2014-09-20 2014-09-21 2014-09-22 2014-09-23 2014-09-24 2014-09-25 2014-09-26 2014-09-27 2014-09-28 2014-09-29 2014-09-30 2014-10-01 2014-10-02
1 1 1 1 5 5 5 5 NA NA 11 12 12
2014-10-03 2014-10-04 2014-10-05 2014-10-06 2014-10-07 2014-10-08 2014-10-09
12 12 NA NA NA 19 20
回答4:
The cleanest way to implement this in data.table
is probably using the join syntax:
na.omit(dt)[dt, on = .(date), roll = +3, .(date, x_filled = x, x = i.x)]
date x_filled x
1: 2019-02-14 1 1
2: 2019-02-15 1 NA
3: 2019-02-16 1 NA
4: 2019-02-17 1 NA
5: 2019-02-18 5 5
6: 2019-02-19 5 NA
7: 2019-02-20 5 NA
8: 2019-02-21 5 NA
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 12 NA
14: 2019-02-27 12 NA
15: 2019-02-28 12 NA
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
*This solution depends on the date columns and it being contiguous
回答5:
From playing around in data.table
comes this hacky solution:
np1 <- 3 + 1
dt[,
x_filled := x[c(rep(1, min(np1, .N)), rep(NA, max(0, .N - np1)))],
by = cumsum(!is.na(x))]
# Or slightly simplified:
dt[,
x_filled := ifelse(rowid(x) < 4, x[1], x[NA]),
by = cumsum(!is.na(x))]
> dt
date x x_filled
1: 2019-02-14 1 1
2: 2019-02-15 NA 1
3: 2019-02-16 NA 1
4: 2019-02-17 NA 1
5: 2019-02-18 5 5
6: 2019-02-19 NA 5
7: 2019-02-20 NA 5
8: 2019-02-21 NA 5
9: 2019-02-22 NA NA
10: 2019-02-23 NA NA
11: 2019-02-24 11 11
12: 2019-02-25 12 12
13: 2019-02-26 NA 12
14: 2019-02-27 NA 12
15: 2019-02-28 NA 12
16: 2019-03-01 NA NA
17: 2019-03-02 NA NA
18: 2019-03-03 NA NA
19: 2019-03-04 19 19
20: 2019-03-05 20 20
We build on the fact that subsetting vectors with NA
returns NA
.
Data/Packages
library(zoo)
library(data.table)
x <- zoo(1:20, Sys.Date() + 1:20)
x[c(2:4, 6:10, 13:18)] <- NA
dt <- data.table(date = index(x), x = as.integer(x))
来源:https://stackoverflow.com/questions/25940241/fill-na-in-a-time-series-only-to-a-limited-number