executing glmnet in parallel in R

后端 未结 2 1839
春和景丽
春和景丽 2020-12-28 09:06

My training dataset has about 200,000 records and I have 500 features. (These are sales data from a retail org). Most of the features are 0/1 and is stored as a sparse matri

2条回答
  •  醉话见心
    2020-12-28 09:58

    In order to execute "cv.glmnet" in parallel, you have to specify the parallel=TRUE option, and register a foreach parallel backend. This allows you to choose the parallel backend that works best for your computing environment.

    Here's the documentation for the "parallel" argument from the cv.glmnet man page:

    parallel: If 'TRUE', use parallel 'foreach' to fit each fold. Must register parallel before hand, such as 'doMC' or others. See the example below.

    Here's an example using the doParallel package which works on Windows, Mac OS X, and Linux:

    library(doParallel)
    registerDoParallel(4)
    m <- cv.glmnet(x, target[,1], family="binomial", alpha=0, type.measure="auc",
                   grouped=FALSE, standardize=FALSE, parallel=TRUE)
    

    This call to cv.glmnet will execute in parallel using four workers. On Linux and Mac OS X, it will execute the tasks using "mclapply", while on Windows it will use "clusterApplyLB".

    Nested parallelism gets tricky, and may not help a lot with only 4 workers. I would try using a normal for loop around cv.glmnet (as in your second example) with a parallel backend registered and see what the performance is before adding another level of parallelism.

    Also note that the assignment to "model" in your first example isn't going to work when you register a parallel backend. When running in parallel, side-effects generally get thrown away, as with most parallel programming packages.

提交回复
热议问题