问题
Through searching and asking, I've found many packages I can use to make use of all the cores of my server, and many packages that can do random forest.
I'm quite new at this, and I'm getting lost between all the ways to parallelize the training of my random forest. Could you give some advice on reasons to use and/or avoid each of them, or some specific combinations of them (and with or without caret ?) that have made their proof ?
Packages for parallelization :
doParallel,
doSNOW,
doSMP (discontinued ?),
doMC
(and what about mclapply ?)
Packages for random forest :
[caret + some of the following]
rf,
parRF,
randomForest,
ranger,
Rborist,
parallelRandomForest (crashes my R Studio session...)
Thanks
回答1:
There are a few answers on SO, such as parallel execution of random forest in R and Suggestions for speeding up Random Forests, that I would take a look at.
Those posts are helpful, but are a bit older. the ranger package is an especially fast implementation of random forest, so if you are new to this it might be the easiest way to speed up your model training. Their paper discusses the tradeoffs of some of the available packages - depending on your data size and number of features, which package gives you the best performance will vary.
来源:https://stackoverflow.com/questions/37213279/parallelizing-random-forests