sklearn increasing number of jobs leads to slow training

问题

I've been trying to get sklearn to use more cpu cores during gridsearch (doing this on a Windows machine). Code is this:

parameters = {'n_estimators': numpy.arange(1,10), 'max_depth':numpy.arange(1,10)}

estimator = RandomForestClassifier(verbose=1)

clf = grid_search.GridSearchCV(estimator, parameters, n_jobs=-1)
clf.fit(features_train, labels_train)

I'm testing this on a small dataset of only 100 samples.

When n_jobs is set to 1 (default), everything proceeds as normal and finishes quickly. However, it only uses 1 cpu core.

In the above, I set n_jobs to -1 to use all cpu cores. When I do that (or if I use any value > 1) I can see that the correct number of cores are being utilized on my machine, but the speed is extremely extremely slow. With n_jobs = 1, the training finishes in about 10 seconds. With anything > 1, training can take 5-10minutes.

What is the correct way to increase the number of cores being used by gridsearch?

回答1:

My suspicion is that this could be related to the fact that you're only testing it with a small dataset of 100 samples - perhaps it just isn't big enough to justify the overhead of parallelization.

It should be that for a significantly larger dataset the parallel mode will outperform the n_jobs = 1 approach. Have you tried testing this against a much larger sample?

来源：https://stackoverflow.com/questions/32219172/sklearn-increasing-number-of-jobs-leads-to-slow-training

标签

python

machine-learning

scikit-learn

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!