I\'m partitioning a data frame with split() in order to use parLapply() to call a function on each partition in parallel. The data frame has 1.3 m
Split(x,f) is slow if x is a factor AND f contains a lot of different elements
So, this code if fast:
system.time(split(seq_len(1300000), sample(250000, 1300000, TRUE)))
But, this is very slow:
system.time(split(factor(seq_len(1300000)), sample(250000, 1300000, TRUE)))
And this is fast again because there are only 25 groups
system.time(split(factor(seq_len(1300000)), sample(25, 1300000, TRUE)))