Summarize the self-join index while avoiding cartesian product in R data.table

眉间皱痕 提交于 2019-12-01 09:47:36

If you can split your Y's into groups that don't have a large intersection of X's, you could do the computation by those groups first, resulting in a smaller intermediate table:

d[, grp := Y <= 3] # this particular split works best for OP data
d[, .SD[.SD, allow = T][, .N, by = .(X, i.X)], by = grp][,
    .(N = sum(N)), by = .(X, i.X)]

The intermediate table above has only 16 rows, as opposed to 26. Unfortunately I can't think of an easy way to create such grouping automatically.

How about this one using foverlaps(). The more consecutive values of Y you've for each X, the lesser number of rows this'll produce compared to a cartesian join.

d = data.table(X=c(1,1,1,2,2,2,2,3,3,3,4,4), Y=c(1,2,3,1,2,3,4,1,5,6,4,5))
setorder(d, X)
d[, id := cumsum(c(0L, diff(Y)) != 1L), by=X]
dd = d[, .(start=Y[1L], end=Y[.N]), by=.(X,id)][, id := NULL][]

ans <- foverlaps(dd, setkey(dd, start, end))
ans[, count := pmin(abs(i.end-start+1L), abs(end-i.start+1L), 
                    abs(i.end-i.start+1L), abs(end-start+1L))]
ans[, .(count = sum(count)), by=.(X, i.X)][order(i.X, X)]
#     X i.X count
#  1: 1   1     3
#  2: 2   1     3
#  3: 3   1     1
#  4: 1   2     3
#  5: 2   2     4
#  6: 3   2     1
#  7: 4   2     1
#  8: 1   3     1
#  9: 2   3     1
# 10: 3   3     3
# 11: 4   3     1
# 12: 2   4     1
# 13: 3   4     1
# 14: 4   4     2

Note: make sure X and Y are integers for faster results. This is because joins on integer types are faster than on double types (foverlaps performs binary joins internally).

You can make this more memory efficient by using which=TRUE in foverlaps() and using the indices to generate count in the next step.

You already have solution written in SQL so I suggest R package sqldf

Here's code:

library(sqldf)

result <- sqldf("SELECT A.X, B.X, COUNT(A.Y) as N FROM test as A JOIN test as B WHERE A.Y==B.Y GROUP BY A.X, B.X")
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!