Is there a way to efficiently count column values in A falling within ranges in B using data.table?

问题

I have created some code to handle the following task:

ref = read.table(header=TRUE, text="
user    event
1441    120120102
1441    120120888
1443    120122122
1445    120124452
1445    120123525
1446    120123463", stringsAsFactors=FALSE)

data = read.table(header=TRUE, text="
user    event1        event2
1440    120123432     120156756
1441    120128523     120156545
1441    120123333     120146444
1441    120122344     120122355", stringsAsFactors=FALSE)

What I have here is a function (credit to user Carlos Cinelli) that will allow me to go line by line on the table data and search and record how many events of ref are sandwiched between event1 and event2, by user id.

Now, I am wondering if there is a faster way to do the function below:

count <- function(x,y,z) ref[, sum(event >=x & event <= y & user ==z)]
data[, count:=mapply(x=event1, y=event2, z=user, count)]

I haven't been able to do much and was wondering if the data.table package would have anything that can help with making the above faster. Thank you so much!

回答1:

Have a look at the examples from ?foverlaps. They clearly show how you can join based on overlapping intervals within other identifiers.

require(data.table) ## 1.9.3+
setDT(ref)
setDT(data)

setkey(ref[, event2 := event])
ans = foverlaps(data, ref, by.x=c("user", "event1", "event2"), which=TRUE, nomatch=0L)

Your example is particularly bad because there are no overlaps. So I can't really demonstrate the next few steps. But ans should provide you with overlapping row indices of ref (yid) for each row in data (xid). And the overlaps are obtained within user - since it was set as a key column as well.

I hope you can take it from here... If you find this doesn't resolve, please post an example that I can run to reproduce the same issue you're running into.

HTH

回答2:

Non-equi joins were recently implemented and available in the current development version of data.table, v1.9.7. This can be performed in a quite straightforward manner using this feature:

require(data.table) # v1.9.7+
setDT(ref); setDT(data)
data[ref, .N, by=.EACHI, nomatch=0L, on=.(user, event1 <= event, event2 >= event)]
# returns an empty data.table here since no overlaps are found..

来源：https://stackoverflow.com/questions/26134707/is-there-a-way-to-efficiently-count-column-values-in-a-falling-within-ranges-in

标签

data.table