I\'m trying to move from a serial to parallel approach to accomplish some multivariate time series analysis tasks on a large data.table. The table contains data fo
The answer requires the iterators package and use of isplit which is similar to split in that it breaks the main data object into chunks based on one or more factor columns. The foreach loop iterates through the chunks of data, passing only the subset out to the worker process rather than the whole table.
So the differences in the code are as follows:
library(iterators)
dt.all = data.table(
grp = factor(rep(1:num.series, each =num.periods)), # grp column is a factor
pd = rep(1:num.periods, num.series),
y = rnorm(num.series * num.periods),
x1 = rnorm(num.series * num.periods),
x2 = rnorm(num.series * num.periods)
)
results =
foreach(dt.sub = isplit(dt.all, dt.all$grp), .packages="data.table", .combine="rbind")
%dopar%
{
f_lm(dt.sub$value, dt.sub$key[[1]])
}
The result of the isplit is that dt.sub is now a list with 2 elements: the key is in itself a list of the values used to split and the value contains the subset as a data.table.
Credit for this solution is given to a SO answer given by David and a response by Russell to my question on an excellent blog post about iterators.
------------------------------------ EDIT ------------------------------------
To test the performance of isplitDT v isplit and rbindlist v rbind the following code was used:
rm(list=ls())
library(data.table) ; library(iterators) ; library(doParallel)
num.series = 400
num.periods = 2000
dt.all = data.table(
grp = factor(rep(1:num.series,each=num.periods)),
pd = rep(1:num.periods, num.series),
y = rnorm(num.series * num.periods),
x1 = rnorm(num.series * num.periods),
x2 = rnorm(num.series * num.periods)
)
dt.all[,y_lag := c(NA, head(y, -1)), by = c("grp")]
f_lm = function(dt.sub, grp) {
my.model = lm("y ~ y_lag + x1 + x2 ", data = dt.sub)
coef = summary(my.model)$coefficients
data.table(grp, variable = rownames(coef), coef)
}
registerDoParallel(8)
isplitDT <- function(x, colname, vals) {
colname <- as.name(colname)
ival <- iter(vals)
nextEl <- function() {
val <- nextElem(ival)
list(value=eval(bquote(x[.(colname) == .(val)])), key=val)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
dtcomb <- function(...) {
rbindlist(list(...))
}
# isplit/rbind
st1 = system.time(results <- foreach(dt.sub=isplit(dt.all,dt.all$grp),
.combine="rbind",
.packages="data.table") %dopar% {
f_lm(dt.sub$value, dt.sub$key[[1]])
})
# isplit/rbindlist
st2 = system.time(results <- foreach(dt.sub=isplit(dt.all,dt.all$grp),
.combine='dtcomb', .multicombine=TRUE,
.packages="data.table") %dopar% {
f_lm(dt.sub$value, dt.sub$key[[1]])
})
# isplitDT/rbind
st3 = system.time(results <- foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
})
# isplitDT/rbindlist
st4 = system.time(results <- foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
})
rbind(st1, st2, st3, st4)
This gives the following timings:
user.self sys.self elapsed user.child sys.child
st1 12.08 1.53 14.66 NA NA
st2 12.05 1.41 14.08 NA NA
st3 45.33 2.40 48.14 NA NA
st4 45.00 3.30 48.70 NA NA
------------------------------------ EDIT 2 ------------------------------------
Thanks to Steve's updated answer and the function isplitDT2, which makes use of the keys on the data.table, we have a clear new winner in terms of speed. Running microbenchmark to compare my original solution (in this answer) shows around 7-fold improvement from isplitDT2 with rbindlist. Memory usage has not yet been compared directly but the performance gain leads me to accept the answer at last.