Efficiently removing missing values from the start and end of multiple time series in 1 data frame

问题

Using R, I'm trying to trim NA values from the start and end of a data frame that contains multiple time series. I have achieved my goal using a for loop and the zoo package, but as expected it is extremely inefficient on large data frames.

My data frame look like this and contains 3 columns with each time series identified by it's unique id. In this case AAA, B and CCC.

id   date          value
AAA  2010/01/01    NA
AAA  2010/02/01    34
AAA  2010/03/01    35
AAA  2010/04/01    30
AAA  2010/05/01    NA
AAA  2010/06/01    28
B    2010/01/01    NA
B    2010/02/01    0
B    2010/03/01    1
B    2010/04/01    2
B    2010/05/01    3
B    2010/06/01    NA
B    2010/07/01    NA
B    2010/07/01    NA
CCC  2010/01/01    0
CCC  2010/02/01    400
CCC  2010/03/01    300
CCC  2010/04/01    200
CCC  2010/05/01    NA

I would like to know, how can I efficiently remove the NA values from the start and end of each time series, in this case AAA, B and CCC. So it should look like this.

id   date          value
AAA  2010/02/01    34
AAA  2010/03/01    35
AAA  2010/04/01    30
AAA  2010/05/01    NA
AAA  2010/06/01    28
B    2010/02/01    0
B    2010/03/01    1
B    2010/04/01    2
B    2010/05/01    3
CCC  2010/01/01    0
CCC  2010/02/01    400
CCC  2010/03/01    300
CCC  2010/04/01    200

回答1:

I would do it like this, which should be very fast :

require(data.table)
DT = as.data.table(your data)   # please provide something pastable

DT2 = DT[!is.na(value)]
setkey(DT,id,date)
setkey(DT2,id,date)
tokeep = DT2[DT,!is.na(value),rolltolast=TRUE,mult="last"]
DT = DT[tokeep]

This works by rolling forward the prevailing non-NA, but not past the last one, within each group.

The mult="last" is optional. It should speed it up if v1.8.0 (on CRAN) is used. Interested in timings with and without it. By default data.table joins to groups (mult="all"), but in this case we're joining to all columns of the key, and, we know the key is unique; i.e., no dups in key. In v1.8.1 (in dev) there isn't a need to know about this and it looks after you more.

回答2:

If your data is in data frame data

fun <- function(x)
{
    x$value[is.na(x$value)] <- "NA"
    tmp <- rle(x$value)
    values <- tmp$values
    lengths <- tmp$lengths
    n <- length(values)

    nr <- nrow(x)
    id <- c()
    if(values[1] == "NA") id <- c(id, 1:lengths[1])
    if(values[n] == "NA") id <- c(id, (nr-lengths[n]+1):nr)
    if(length(id) == 0)return(x)
    x[-id,]
}

do.call(rbind,
        by(data, INDICES=data$id,
           FUN=fun))

Not the most elegant solution I guess. In the mood of this post.

来源：https://stackoverflow.com/questions/10811357/efficiently-removing-missing-values-from-the-start-and-end-of-multiple-time-seri

标签

dataframe

data.table

time-series

zoo