问题
I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV
with KNeighborsClassifier
. The results I got are surprising to me and I wonder if I misunderstood the benefits of multi cores or maybe I haven't done it right.
There is no difference in time to completion between 2-8 jobs. How come ? I have noticed the difference on a CPU Performance tab. While the first cell was running CPU usage was ~13% and it was gradually increasing to 100% for the last cell. I was expecting it to finish faster. Maybe not linearly faster aka 8 jobs would be 2 times faster then 4 jobs but a bit faster.
This is how I set it up:
I am using jupyter-notebook, cell refers to jupyter-notebook cell.
I have loaded MNIST and used 0.05
test size for 3000
digits in a X_play
.
from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist['target']
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
_, X_play, _, y_play = train_test_split(X_train, y_train, test_size=0.05, random_state=42, stratify=y_train, shuffle=True)
In the next cell I have setup KNN
and a GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
knn_clf = KNeighborsClassifier()
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]
Then I done 8 cells for 8 n_jobs values. My CPU is i7-4770 with 4 cores 8 threads.
grid_search = GridSearchCV(knn_clf, param_grid, cv=3, verbose=3, n_jobs=N_JOB_1_TO_8)
grid_search.fit(X_play, y_play)
Results
Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 2.0min finished
Parallel(n_jobs=2)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=3)]: Done 18 out of 18 | elapsed: 1.3min finished
Parallel(n_jobs=4)]: Done 18 out of 18 | elapsed: 1.3min finished
Parallel(n_jobs=5)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=6)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=7)]: Done 18 out of 18 | elapsed: 1.4min finished
Parallel(n_jobs=8)]: Done 18 out of 18 | elapsed: 1.4min finished
Second test
Random Forest Classifier usage was much better. Test size was 0.5
, 30000
images.
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
param_grid = [{'n_estimators': [20, 30, 40, 50, 60], 'max_features': [100, 200, 300, 400, 500], 'criterion': ['gini', 'entropy']}]
Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 110.9min finished
Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed: 56.8min finished
Parallel(n_jobs=3)]: Done 150 out of 150 | elapsed: 39.3min finished
Parallel(n_jobs=4)]: Done 150 out of 150 | elapsed: 35.3min finished
Parallel(n_jobs=5)]: Done 150 out of 150 | elapsed: 36.0min finished
Parallel(n_jobs=6)]: Done 150 out of 150 | elapsed: 34.4min finished
Parallel(n_jobs=7)]: Done 150 out of 150 | elapsed: 32.1min finished
Parallel(n_jobs=8)]: Done 150 out of 150 | elapsed: 30.1min finished
回答1:
Here are some reasons which might be a cause of this behaviour
- With increasing no. of threads, there is an apparent overhead incurred for intializing and releasing each thread. I ran your code on my i7 7700HQ, I saw the following behaviour with each inceasing
n_job
- when
n_job=1
andn_job=2
the time per thread(Time per model evaluation by GridSearchCV to fully train the model and test it) was 2.9s (overall time ~2 mins) - when
n_job=3
, time was 3.4s (overall time 1.4 mins) - when
n_job=4
, time was 3.8s (overall time 58 secs) - when
n_job=5
, time was 4.2s (overall time 51 secs) - when
n_job=6
, time was 4.2s (overall time ~49 secs) - when
n_job=7
, time was 4.2s (overall time ~49 secs) - when
n_job=8
, time was 4.2s (overall time ~49 secs)
- when
Now as you can see, time per thread increased but overall time seem to decrease (although beyond
n_job=4 the different was not exactly linear) and remained constained with
n_jobs>=6` This is due to the fact that there is a cost incurred with initializing and releaseing threads. See this github issue and this issue.Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.
I suggest you to read about Ahmdal's Law which states that there is a theoretical bound on the speedup that can be achieved through parallelization which is given by the formula Image Source : Ahmdal's Law : Wikipedia
Finally, it might be due to the data size and the complexity of the model you use for training as well.
Here is a blog post explaining the same issue regarding multithreading.
来源:https://stackoverflow.com/questions/50993867/increasing-n-jobs-has-no-effect-on-gridsearchcv