Gap size calculation in time series with R

五迷三道 提交于 2020-12-13 04:55:44

问题


Let's say I have a dataframe with contains time series as below:

Date                value
2000-01-01 00:00:00  4.6
2000-01-01 01:00:00  N/A
2000-01-01 02:00:00  5.3
2000-01-01 03:00:00  6.0
2000-01-01 04:00:00  N/A
2000-01-01 05:00:00  N/A
2000-01-01 06:00:00  N/A
2000-01-01 07:00:00  6.0

I want to find an efficient way to calculate the size of the gap (number of consecutive N/As) and add it to a new column of my dataframe to get the following:

Date                value  gap_size
2000-01-01 00:00:00  4.6      0
2000-01-01 01:00:00  N/A      1
2000-01-01 02:00:00  5.3      0
2000-01-01 03:00:00  6.0      0
2000-01-01 04:00:00  N/A      3
2000-01-01 05:00:00  N/A      3
2000-01-01 06:00:00  N/A      3
2000-01-01 07:00:00  6.0      0

My dataframe in reality has more than 6 millions row so I am looking for the cheapest way in terms of computation. Note that my time series is equi-spaced over the whole dataset (1 hour).


回答1:


You could try using rle in this case to generate run lengths. First, convert your value column to logical using is.na and apply rle which provides the run lengths of the different values of the input vector. In this case, the two categories are TRUE and FALSE, and you're counting how long they run for. You can then replicate this by the run length to get the output you're looking for.

x = c(1,2,4,NA,NA,6,NA,19,NA,NA)
res = rle(is.na(x))
rep(res$values*res$lengths,res$lengths)
#> [1] 0 0 0 2 2 0 1 0 2 2



回答2:


Set to data.table with setDT() and:

dt[, gap := rep(rle(value)$lengths, rle(value)$lengths) * (value == "N/A")]
                  Date value gap
1: 2000-01-01 00:00:00   4.6   0
2: 2000-01-01 01:00:00   N/A   1
3: 2000-01-01 02:00:00   5.3   0
4: 2000-01-01 03:00:00   6.0   0
5: 2000-01-01 04:00:00   N/A   3
6: 2000-01-01 05:00:00   N/A   3
7: 2000-01-01 06:00:00   N/A   3
8: 2000-01-01 07:00:00   6.0   0

Data:

dt <- structure(list(Date = c("2000-01-01 00:00:00", "2000-01-01 01:00:00", 
"2000-01-01 02:00:00", "2000-01-01 03:00:00", "2000-01-01 04:00:00", 
"2000-01-01 05:00:00", "2000-01-01 06:00:00", "2000-01-01 07:00:00"
), value = c("4.6", "N/A", "5.3", "6.0", "N/A", "N/A", "N/A", 
"6.0")), row.names = c(NA, -8L), class = c("data.table", "data.frame"
))


来源:https://stackoverflow.com/questions/51029171/gap-size-calculation-in-time-series-with-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!