efficiently locf by groups in a single R data.table

我只是一个虾纸丫 提交于 2019-12-17 18:55:28

问题


I have a large, wide data.table (20m rows) keyed by a person ID but with lots of columns (~150) that have lots of null values. Each column is a recorded state / attribute that I wish to carry forward for each person. Each person may have anywhere from 10 to 10,000 observations and there are about 500,000 people in the set. Values from one person can not 'bleed' into the following person, so my solution must respect the person ID column and group appropriately.

For demonstration purposes - here's a very small sample input:

DT = data.table(
  id=c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
  aa=c("A", NA, "B", "C", NA, NA, "D", "E", "F", NA, NA, NA),
  bb=c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
  cc=c(1, NA, NA, NA, NA, 4, NA, 5, 6, NA, 7, NA)
)

It looks like this:

    id aa bb cc
 1:  1  A NA  1
 2:  1 NA NA NA
 3:  1  B NA NA
 4:  1  C NA NA
 5:  2 NA NA NA
 6:  2 NA NA  4
 7:  2  D NA NA
 8:  2  E NA  5
 9:  3  F NA  6
10:  3 NA NA NA
11:  3 NA NA  7
12:  3 NA NA NA

My expected output looks like this:

    id aa bb cc
 1:  1  A NA  1
 2:  1  A NA  1
 3:  1  B NA  1
 4:  1  C NA  1
 5:  2 NA NA NA
 6:  2 NA NA  4
 7:  2  D NA  4
 8:  2  E NA  5
 9:  3  F NA  6
10:  3  F NA  6
11:  3  F NA  7
12:  3  F NA  7

I've found a data.table solution that works, but it's terribly slow on my large data sets:

DT[, na.locf(.SD, na.rm=FALSE), by=id]

I've found equivalent solutions using dplyr that are equally slow.

GRP = DT %>% group_by(id)
data.table(GRP %>% mutate_each(funs(blah=na.locf(., na.rm=FALSE))))

I was hopeful that I could come up with a rolling 'self' join using the data.table functionality, but I just can't seem to get it right (I suspect I would need to use .N but I just haven't figured it out).

At this point I'm thinking I'll have to write something in Rcpp to efficiently apply the grouped locf.

I'm new to R, but I'm not new to C++ - so I'm confident I can do it. I just feel like there should be an efficient way to do this in R using data.table.


回答1:


A very simple na.locf can be built by forwarding (cummax) the non-NA indices ((!is.na(x)) * seq_along(x)) and subsetting accordingly:

x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2)
x[cummax((!is.na(x)) * seq_along(x))]
# [1] 1 1 1 6 4 5 4 4 4 2

This replicates na.locf with an na.rm = TRUE argument, to get na.rm = FALSE behavior we simply need to make sure the first element in the cummax is TRUE:

x = c(NA, NA, 1, NA, 2)
x[cummax(c(TRUE, tail((!is.na(x)) * seq_along(x), -1)))]
#[1] NA NA  1  1  2

In this case, we need to take into account not only the non-NA indices but, also, of the indices where the (ordered, or to be ordered) "id" column changes value:

id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
c(TRUE, id[-1] != id[-length(id)])
# [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE

Combining the above:

id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
x =  c(1,  NA, NA, 6,  4,  5,  4,  NA, NA, 2)

x[cummax(((!is.na(x)) | c(TRUE, id[-1] != id[-length(id)])) * seq_along(x))]
# [1]  1  1 NA  6  4  5  4  4 NA  2

Note, that here we OR the first element with TRUE, i.e. make it equal to TRUE, thus getting the na.rm = FALSE behavior.

And for this example:

id_change = DT[, c(TRUE, id[-1] != id[-.N])]
DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])]
#    id aa bb cc
# 1:  1  A NA  1
# 2:  1  A NA  1
# 3:  1  B NA  1
# 4:  1  C NA  1
# 5:  2 NA NA NA
# 6:  2 NA NA  4
# 7:  2  D NA  4
# 8:  2  E NA  5
# 9:  3  F NA  6
#10:  3  F NA  6
#11:  3  F NA  7
#12:  3  F NA  7


来源:https://stackoverflow.com/questions/37060211/efficiently-locf-by-groups-in-a-single-r-data-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!