问题
I have a df
in the following format and try to get a dataframe with all the pairwise combinations per group
df<-structure(list(id = c(209044052, 209044061, 209044061, 209044061,209044062, 209044062, 209044062, 209044182, 209044183, 209044295), group = c(2365686, 387969, 388978, 2365686, 387969, 388978, 2365686, 2278460, 2278460, 654238)), .Names = c("id", "group"), row.names = c(NA, -10L), class = "data.frame")
While do.call(rbind, lapply(split(df, df$group), function(i) expand.grid(i$id, i$id)))
works for a small data frame I run into time problems on my large data (~12 million obs. and ~1.5 million groups).
After some testing I recognized that the split command seems to be the bottleneck and expand.grid might also not be the fastest solution.
Found some improvements for expand.grid Use outer instead of expand.grid and some faster split alternatives here Improving performance of split() function in R? but struggle to put it all together with grouping.
Output should be something like
Var1 Var2
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044295 209044295
209044182 209044182
209044183 209044182
....
As an extra I would like to exclude repetitions of the same pair, self-reference (e.g. above 209044061 209044061
) and only keep one combination, if they are in different orders (e.g. above 209044061 209044062
and 209044062 209044061
) (Combinations without repetitions). Tried library(gtools)
with 'combinations()` but could not figure out if this slows down the calculation even more.
回答1:
One possible solution which avoids repetitions of the same pair as well as different orders is using the data.table
and combinat
packages:
library(data.table)
setDT(df)[order(id), data.table(combinat::combn2(unique(id))), by = group]
group V1 V2 1: 2365686 209044052 209044061 2: 2365686 209044052 209044062 3: 2365686 209044061 209044062 4: 387969 209044061 209044062 5: 388978 209044061 209044062 6: 2278460 209044182 209044183
order(id)
is used here just for convenience to better check the results but can be skipped in production code.
Replace combn2()
by a non-equi join
There is another approach where the call to combn2()
is replaced by a non-equi join:
mdf <- setDT(df)[order(id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
allow.cartesian = TRUE]
group V1 V2 1: 2365686 209044052 209044061 2: 2365686 209044052 209044062 3: 2365686 209044061 209044062 4: 387969 209044061 209044062 5: 388978 209044061 209044062 6: 2278460 209044182 209044183
Note that the non-equi join requires the data to be ordered.
Benchmark
The second method seems to be much faster
# create benchmark data
nr <- 1.2e5L # number of rows
rg <- 8L # number of ids within each group
ng <- nr / rg # number of groups
set.seed(1L)
df2 <- data.table(
id = sample.int(rg, nr, TRUE),
group = sample.int(ng, nr, TRUE)
)
#benchmark code
microbenchmark::microbenchmark(
combn2 = df2[order(group, id), data.table((combinat::combn2(unique(id)))), by = group],
nej = {
mdf <- df2[order(group, id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
allow.cartesian = TRUE]},
times = 1L)
For 120000 rows and 14994 groups the timings are:
Unit: milliseconds expr min lq mean median uq max neval combn2 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 1 nej 137.3228 137.3228 137.3228 137.3228 137.3228 137.3228 1
Caveat
As pointed out by the OP the number of id
per group
is crucial in terms of memory consumption and speed. The number of combinations is of O(n2), exactly n * (n-1) / 2 or choose(n, 2L)
if n is the number of ids.
The size of the largest group can be found by
df2[, uniqueN(id), by = group][, max(V1)]
The total number of rows in the final result can be computed in advance by
df2[, uniqueN(id), by = group][, sum(choose(V1, 2L))]
来源:https://stackoverflow.com/questions/46937866/split-and-expand-grid-by-group-on-large-data-set