问题
Let's say I have a dataframe with contains time series as below:
Date value
2000-01-01 00:00:00 4.6
2000-01-01 01:00:00 N/A
2000-01-01 02:00:00 5.3
2000-01-01 03:00:00 6.0
2000-01-01 04:00:00 N/A
2000-01-01 05:00:00 N/A
2000-01-01 06:00:00 N/A
2000-01-01 07:00:00 6.0
I want to find an efficient way to calculate the size of the gap (number of consecutive N/As) and add it to a new column of my dataframe to get the following:
Date value gap_size
2000-01-01 00:00:00 4.6 0
2000-01-01 01:00:00 N/A 1
2000-01-01 02:00:00 5.3 0
2000-01-01 03:00:00 6.0 0
2000-01-01 04:00:00 N/A 3
2000-01-01 05:00:00 N/A 3
2000-01-01 06:00:00 N/A 3
2000-01-01 07:00:00 6.0 0
My dataframe in reality has more than 6 millions row so I am looking for the cheapest way in terms of computation. Note that my time series is equi-spaced over the whole dataset (1 hour).
回答1:
You could try using rle
in this case to generate run lengths. First, convert your value column to logical using is.na
and apply rle
which provides the run lengths of the different values of the input vector. In this case, the two categories are TRUE and FALSE, and you're counting how long they run for. You can then rep
licate this by the run length to get the output you're looking for.
x = c(1,2,4,NA,NA,6,NA,19,NA,NA)
res = rle(is.na(x))
rep(res$values*res$lengths,res$lengths)
#> [1] 0 0 0 2 2 0 1 0 2 2
回答2:
Set to data.table
with setDT() and:
dt[, gap := rep(rle(value)$lengths, rle(value)$lengths) * (value == "N/A")]
Date value gap
1: 2000-01-01 00:00:00 4.6 0
2: 2000-01-01 01:00:00 N/A 1
3: 2000-01-01 02:00:00 5.3 0
4: 2000-01-01 03:00:00 6.0 0
5: 2000-01-01 04:00:00 N/A 3
6: 2000-01-01 05:00:00 N/A 3
7: 2000-01-01 06:00:00 N/A 3
8: 2000-01-01 07:00:00 6.0 0
Data:
dt <- structure(list(Date = c("2000-01-01 00:00:00", "2000-01-01 01:00:00",
"2000-01-01 02:00:00", "2000-01-01 03:00:00", "2000-01-01 04:00:00",
"2000-01-01 05:00:00", "2000-01-01 06:00:00", "2000-01-01 07:00:00"
), value = c("4.6", "N/A", "5.3", "6.0", "N/A", "N/A", "N/A",
"6.0")), row.names = c(NA, -8L), class = c("data.table", "data.frame"
))
来源:https://stackoverflow.com/questions/51029171/gap-size-calculation-in-time-series-with-r