问题
I have created some code to handle the following task:
ref = read.table(header=TRUE, text="
user event
1441 120120102
1441 120120888
1443 120122122
1445 120124452
1445 120123525
1446 120123463", stringsAsFactors=FALSE)
data = read.table(header=TRUE, text="
user event1 event2
1440 120123432 120156756
1441 120128523 120156545
1441 120123333 120146444
1441 120122344 120122355", stringsAsFactors=FALSE)
What I have here is a function (credit to user Carlos Cinelli) that will allow me to go line by line on the table data
and search and record how many events of ref are sandwiched between event1 and event2, by user
id.
Now, I am wondering if there is a faster way to do the function below:
count <- function(x,y,z) ref[, sum(event >=x & event <= y & user ==z)]
data[, count:=mapply(x=event1, y=event2, z=user, count)]
I haven't been able to do much and was wondering if the data.table
package would have anything that can help with making the above faster. Thank you so much!
回答1:
Have a look at the examples from ?foverlaps
. They clearly show how you can join based on overlapping intervals within other identifiers.
require(data.table) ## 1.9.3+
setDT(ref)
setDT(data)
setkey(ref[, event2 := event])
ans = foverlaps(data, ref, by.x=c("user", "event1", "event2"), which=TRUE, nomatch=0L)
Your example is particularly bad because there are no overlaps. So I can't really demonstrate the next few steps. But ans
should provide you with overlapping row indices of ref
(yid
) for each row in data
(xid
). And the overlaps are obtained within user
- since it was set as a key column as well.
I hope you can take it from here... If you find this doesn't resolve, please post an example that I can run to reproduce the same issue you're running into.
HTH
回答2:
Non-equi joins were recently implemented and available in the current development version of data.table, v1.9.7. This can be performed in a quite straightforward manner using this feature:
require(data.table) # v1.9.7+
setDT(ref); setDT(data)
data[ref, .N, by=.EACHI, nomatch=0L, on=.(user, event1 <= event, event2 >= event)]
# returns an empty data.table here since no overlaps are found..
来源:https://stackoverflow.com/questions/26134707/is-there-a-way-to-efficiently-count-column-values-in-a-falling-within-ranges-in