问题
I've been trying to get sklearn to use more cpu cores during gridsearch (doing this on a Windows machine). Code is this:
parameters = {'n_estimators': numpy.arange(1,10), 'max_depth':numpy.arange(1,10)}
estimator = RandomForestClassifier(verbose=1)
clf = grid_search.GridSearchCV(estimator, parameters, n_jobs=-1)
clf.fit(features_train, labels_train)
I'm testing this on a small dataset of only 100 samples.
When n_jobs is set to 1 (default), everything proceeds as normal and finishes quickly. However, it only uses 1 cpu core.
In the above, I set n_jobs to -1 to use all cpu cores. When I do that (or if I use any value > 1) I can see that the correct number of cores are being utilized on my machine, but the speed is extremely extremely slow. With n_jobs = 1, the training finishes in about 10 seconds. With anything > 1, training can take 5-10minutes.
What is the correct way to increase the number of cores being used by gridsearch?
回答1:
My suspicion is that this could be related to the fact that you're only testing it with a small dataset of 100 samples - perhaps it just isn't big enough to justify the overhead of parallelization.
It should be that for a significantly larger dataset the parallel mode will outperform the n_jobs = 1 approach. Have you tried testing this against a much larger sample?
来源:https://stackoverflow.com/questions/32219172/sklearn-increasing-number-of-jobs-leads-to-slow-training