split and expand.grid by group on large data set

问题

I have a df in the following format and try to get a dataframe with all the pairwise combinations per group

df<-structure(list(id = c(209044052, 209044061, 209044061, 209044061,209044062, 209044062, 209044062, 209044182, 209044183, 209044295), group = c(2365686, 387969, 388978, 2365686, 387969, 388978, 2365686, 2278460, 2278460, 654238)), .Names = c("id", "group"), row.names = c(NA, -10L), class = "data.frame")

While do.call(rbind, lapply(split(df, df$group), function(i) expand.grid(i$id, i$id))) works for a small data frame I run into time problems on my large data (~12 million obs. and ~1.5 million groups).

After some testing I recognized that the split command seems to be the bottleneck and expand.grid might also not be the fastest solution.

Found some improvements for expand.grid Use outer instead of expand.grid and some faster split alternatives here Improving performance of split() function in R? but struggle to put it all together with grouping.

Output should be something like

  Var1      Var2
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044295 209044295
209044182 209044182
209044183 209044182
....

As an extra I would like to exclude repetitions of the same pair, self-reference (e.g. above 209044061 209044061) and only keep one combination, if they are in different orders (e.g. above 209044061 209044062and 209044062 209044061) (Combinations without repetitions). Tried library(gtools) with 'combinations()` but could not figure out if this slows down the calculation even more.

回答1:

One possible solution which avoids repetitions of the same pair as well as different orders is using the data.table and combinat packages:

library(data.table)
setDT(df)[order(id), data.table(combinat::combn2(unique(id))), by = group]

     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4:  387969 209044061 209044062
5:  388978 209044061 209044062
6: 2278460 209044182 209044183

order(id) is used here just for convenience to better check the results but can be skipped in production code.

Replace `combn2()` by a non-equi join

There is another approach where the call to combn2() is replaced by a non-equi join:

mdf <- setDT(df)[order(id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
    allow.cartesian = TRUE]

     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4:  387969 209044061 209044062
5:  388978 209044061 209044062
6: 2278460 209044182 209044183

Note that the non-equi join requires the data to be ordered.

Benchmark

The second method seems to be much faster

# create benchmark data
nr <- 1.2e5L # number of rows
rg <- 8L # number of ids within each group
ng <- nr / rg # number of groups
set.seed(1L)
df2 <- data.table(
  id = sample.int(rg, nr, TRUE),
  group = sample.int(ng, nr, TRUE)
)

#benchmark code
microbenchmark::microbenchmark(
  combn2 = df2[order(group, id), data.table((combinat::combn2(unique(id)))), by = group],
  nej = {
    mdf <- df2[order(group, id), unique(id), by = group]
    mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
        allow.cartesian = TRUE]},
  times = 1L)

For 120000 rows and 14994 groups the timings are:

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 combn2 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115     1
    nej   137.3228   137.3228   137.3228   137.3228   137.3228   137.3228     1

Caveat

As pointed out by the OP the number of id per group is crucial in terms of memory consumption and speed. The number of combinations is of O(n²), exactly n * (n-1) / 2 or choose(n, 2L) if n is the number of ids.

The size of the largest group can be found by

df2[, uniqueN(id), by = group][, max(V1)]

The total number of rows in the final result can be computed in advance by

df2[, uniqueN(id), by = group][, sum(choose(V1, 2L))]

来源：https://stackoverflow.com/questions/46937866/split-and-expand-grid-by-group-on-large-data-set

标签

split

expand

split and expand.grid by group on large data set

问题

回答1:

Replace combn2() by a non-equi join

Benchmark

Caveat

Replace `combn2()` by a non-equi join