问题
Possible Duplicate:
Populate NAs in a vector using prior non-NA values?
Is there an idiomatic way to copy cell values "down" in an R vector? By "copying down", I mean replacing NAs with the closest previous non-NA value.
While I can do this very simply with a for loop, it runs very slowly. Any advice on how to vectorise this would be appreciated.
# Test code
# Set up test data
len <- 1000000
data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
head(data, n=25)
tail(data, n=25)
# Time naive method
system.time({
data.clean <- data;
for (i in 2:length(data.clean)){
if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
}
})
# Print results
head(data.clean, n=25)
tail(data.clean, n=25)
Result of test run:
> # Set up test data
> len <- 1000000
> data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
> head(data, n=25)
[1] 1 NA NA NA NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA 3 NA NA NA NA
> tail(data, n=25)
[1] NA NA NA NA NA 99999 NA NA NA NA
[11] NA NA NA NA NA 100000 NA NA NA NA
[21] NA NA NA NA NA
>
> # Time naive method
> system.time({
+ data.clean <- data;
+ for (i in 2:length(data.clean)){
+ if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
+ }
+ })
user system elapsed
3.09 0.00 3.09
>
> # Print results
> head(data.clean, n=25)
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
> tail(data.clean, n=25)
[1] 99998 99998 99998 99998 99998 99999 99999 99999 99999 99999
[11] 99999 99999 99999 99999 99999 100000 100000 100000 100000 100000
[21] 100000 100000 100000 100000 100000
>
回答1:
Use zoo::na.locf
Wrapping your code in function f
(including returning data.clean
at the end):
library(rbenchmark)
library(zoo)
identical(f(data), na.locf(data))
## [1] TRUE
benchmark(f(data), na.locf(data), replications=10, columns=c("test", "elapsed", "relative"))
## test elapsed relative
## 1 f(data) 21.460 14.471
## 2 na.locf(data) 1.483 1.000
回答2:
I don't know about idiomatic, but here we identify the non-NA values (idx
), and the index of the last non-NA value (cumsum(idx)
)
f1 <- function(x) {
idx <- !is.na(x)
x[idx][cumsum(idx)]
}
which seems to be about 6 times faster than na.locf
for the example data. It drops leading NA's like na.locf
does by default, so
f2 <- function(x, na.rm=TRUE) {
idx <- !is.na(x)
cidx <- cumsum(idx)
if (!na.rm)
cidx[cidx==0] <- NA_integer_
x[idx][cidx]
}
which seems to add on about 30% time when na.rm=FALSE
. Presumably na.locf
has other merits, capturing more of the corner cases and allowing filling up instead of down (which is an interesting exercise in the cumsum
world, anyway). It's also clear that we're making at least five allocations of possibly large data -- idx
(actually, we calculate is.na()
and it's complement), cumsum(idx)
, x[idx]
, and x[idx][cumsum(idx)]
-- so there's room for further improvement, e.g., in C
来源:https://stackoverflow.com/questions/14449717/idiomatic-way-to-copy-cell-values-down-in-an-r-vector