问题
From what I've found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microsoft, so I am hoping there is another package or answer to this question.
I am looking to compare multiple algorithms and gridsearch a wide range of parameters (maybe too many parameters?), what ways are there besides mpi4py which could speed up running my code? As I understand it I cannot use n_jobs=-1 as that is then not nested?
Also to note, I have not been able to run this on the many parameters I am trying to look at below (runs longer than I have time). Only have results after 2 hours if I give each model only 2 parameters to compare. Also I run this code on a dataset of 252 rows and 25 feature columns with 4 categorical variables to predict ('certain', 'likely', 'possible', or 'unknown') whether a gene (with 252 genes) affects a disease. Using SMOTE increases the sample size to 420 which is then what goes into use.
dataset= pd.read_csv('data.csv')
data = dataset.drop(["gene"],1)
df = data.iloc[:,0:24]
df = df.fillna(0)
X = MinMaxScaler().fit_transform(df)
le = preprocessing.LabelEncoder()
encoded_value = le.fit_transform(["certain", "likely", "possible", "unlikely"])
Y = le.fit_transform(data["category"])
sm = SMOTE(random_state=100)
X_res, y_res = sm.fit_resample(X, Y)
seed = 7
logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')
LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[500, 1000, 5000]}
rfc =RandomForestClassifier()
param_grid = {'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4,25],
'min_samples_split': [2, 5, 10, 25],
'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
mlp = MLPClassifier(random_state=seed)
parameter_space = {'hidden_layer_sizes': [(10,20), (10,20,10), (50,)],
'activation': ['tanh', 'relu'],
'solver': ['adam', 'sgd'],
'max_iter': [10000],
'alpha': [0.1, 0.01, 0.001],
'learning_rate': ['constant','adaptive']}
gbm = GradientBoostingClassifier(min_samples_split=25, min_samples_leaf=25)
param = {"loss":["deviance"],
"learning_rate": [0.15,0.1,0.05,0.01,0.005,0.001],
"min_samples_split": [2, 5, 10, 25],
"min_samples_leaf": [1, 2, 4,25],
"max_depth":[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
"max_features":['auto', 'sqrt'],
"criterion": ["friedman_mse"],
"n_estimators":[200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}
svm = SVC(gamma="scale", probability=True)
tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}
def baseline_model(optimizer='adam', learn_rate=0.01):
model = Sequential()
model.add(Dense(100, input_dim=X_res.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(50, activation='relu')) #8 is the dim/ the number of hidden units (units are the kernel)
model.add(Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
return model
keras = KerasClassifier(build_fn=baseline_model, batch_size=32, epochs=100, verbose=0)
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
kerasparams = dict(optimizer=optimizer, learn_rate=learn_rate)
inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
models = []
models.append(('GBM', GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('RFC', GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('LR', GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('SVM', GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('MLP', GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('Keras', GridSearchCV(estimator=keras, param_grid=kerasparams, cv=inner_cv,iid=False, n_jobs=1)))
results = []
names = []
scoring = 'accuracy'
X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)
for name, model in models:
nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)
results.append(nested_cv_results)
names.append(name)
msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)
print(msg)
model.fit(X_train, Y_train)
print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100), '%')
print("Best Parameters: \n{}\n".format(model.best_params_))
print("Best CV Score: \n{}\n".format(model.best_score_))
As an example, most of the dataset is binary and looks like this:
gene Tissue Druggable Eigenvalue CADDvalue Catalogpresence Category
ACE 1 1 1 0 1 Certain
ABO 1 0 0 0 0 Likely
TP53 1 1 0 0 0 Possible
Any guidance on how I could speed this up would be appreciated.
Edit: I have also tried using parallel processing with dask, but I am not sure I am doing it right, and it doesn't seem to run any faster:
for name, model in models:
with joblib.parallel_backend('dask'):
nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)
results.append(nested_cv_results)
names.append(name)
msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)
print(msg)
model.fit(X_train, Y_train)
print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100), '%')
#print("Best Estimator: \n{}\n".format(model.best_estimator_))
print("Best Parameters: \n{}\n".format(model.best_params_))
print("Best CV Score: \n{}\n".format(model.best_score_)) #average of all cv folds for a single combination of the parameters you specify
Edit: also to note with reducing the gridsearch, I have tried with for example 5 parameters per model however this still takes several hours to complete, so whilst trimming down the number will be helpful, if there is any advice for efficency beyond that I would be grateful.
回答1:
Two things:
Instead of
GridSearchtry using HyperOpt - it's a Python library for serial and parallel optimization.I would reduce the dimensionality by using UMAP or PCA. Probably UMAP is the better choice.
After you apply SMOTE:
import umap
dim_reduced = umap.UMAP(
min_dist=min_dist,
n_neighbors=neighbours,
random_state=1234,
).fit_transform(smote_output)
And then you can use dim_reduced for the train test split.
Reducing the dimensionality will help to remove noise from the data and instead of dealing with 25 features you'll bring them down to 2 (using UMAP) or the number of components you choose (using PCA). Which should have significant implications on the performance.
回答2:
The Dask-ML has scalable implementations GridSearchCV and RandomSearchCV that are, I believe, drop in replacements for Scikit-Learn. They were developed alongside Scikit-Learn developers.
- https://ml.dask.org/hyper-parameter-search.html
They can be faster for two reasons:
- They avoid repeating shared work between different stages of a Pipeline
- They can scale out to a cluster anywhere you can deploy Dask (which is easy on most cluster infrastructure)
回答3:
There is an easy win in your situation and that is .... start using parallel processing :). dask will help you if you have a cluster (it will work on a single machine, but the improvement compared to the default scheduling in sklearn is not significant), but if you plan to run it on a single machine (but have several cores/threads and "enough" memory) then you can run nested CV in parallel. The only trick is that sklearn will not allow you to run the outer CV loop in multiple processes. However, it will allow you to run the inner loop in multiple threads.
At the moment you have n_jobs=None in the outer CV loop (that's the default in cross_val_score), which means n_jobs=1 and that's the only option that you can use with sklearn in the nested CV.
However, you can achieve and easy gain by setting n_jobs=some_reasonable_number in all GridSearchCV that you use. some_reasonable_number does not have to be -1 (but it is a good starting point). Some algorithms either plateau on n_jobs=n_cores instead of n_threads (for example, xgboost) or already have built-in multi-processing (like RandomForestClassifier, for example) and there might be clashes if you spawn too many processes.
回答4:
IIUC, you are trying to parallelize this example from the sklearn docs. If this is the case, then here is one possible approach to address
why dask is not working
and
Any kind of constructive guidance or further knowledge on this problem
General imports
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, train_test_split
from sklearn.neural_network import MLPClassifier
import dask_ml.model_selection as dcv
import time
Data
- I defined 3 datasets to try out implementation of
dask_ml- the size, # rows, of the third (Dataset 3) one is adjustable and can be arbirarily increased depending on your computing power
- I timed execution of
dask_mlusing this dataset only
- I timed execution of
- the code below works for all 3 datasets
- Dataset 1 is a slightly longer version of sample data in SO question
- the size, # rows, of the third (Dataset 3) one is adjustable and can be arbirarily increased depending on your computing power
#### Dataset 1 - longer version of data in the question
d = """gene Tissue Druggable Eigenvalue CADDvalue Catalogpresence Category
ACE 1 1 1 0 1 Certain
ABO 1 0 0 0 0 Likely
TP53 1 1 0 0 0 Possible"""
data = pd.DataFrame([x.split(' ') for x in d.split('\n')])
data.columns = data.loc[0,:]
data.drop(0, axis=0, inplace=True)
data = pd.concat([data]*15)
data = data.drop(["gene"],1)
df = data.iloc[:,0:5]
X = MinMaxScaler().fit_transform(df)
le = preprocessing.LabelEncoder()
encoded_value = le.fit_transform(["Certain", "Likely", "Possible"])
Y = le.fit_transform(data["Category"])
sm = SMOTE(random_state=100)
X_res, y_res = sm.fit_resample(X, Y)
#### Dataset 2 - iris dataset from example in sklearn nested cross validation docs
# Load the dataset
from sklearn.datasets import load_iris
iris = load_iris()
X_res = iris.data
y_res = iris.target
#### Dataset 3 - size (#rows, #columns) is adjustable (I used this to time code execution)
X_res = pd.DataFrame(np.random.rand(300,50), columns=['col_'+str(c+1) for c in list(range(50))])
from random import shuffle
cats = ["paris", "barcelona", "kolkata", "new york", 'sydney']
y_values = cats*int(len(X_res)/len(cats))
shuffle(y_values)
y_res = pd.Series(y_values)
Instantiate classifiers - no changes from code in the question
seed = 7
logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')
LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[500, 1000, 5000]}
mlp = MLPClassifier(random_state=seed)
parameter_space = {'hidden_layer_sizes': [(10,20), (10,20,10), (50,)],
'activation': ['tanh', 'relu'],
'solver': ['adam', 'sgd'],
'max_iter': [10000],
'alpha': [0.1, 0.01, 0.001],
'learning_rate': ['constant','adaptive']}
rfc =RandomForestClassifier()
param_grid = {'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4,25],
'min_samples_split': [2, 5, 10, 25],
'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
gbm = GradientBoostingClassifier(min_samples_split=25, min_samples_leaf=25)
param = {"loss":["deviance"],
"learning_rate": [0.15,0.1,0.05,0.01,0.005,0.001],
"min_samples_split": [2, 5, 10, 25],
"min_samples_leaf": [1, 2, 4,25],
"max_depth":[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
"max_features":['auto', 'sqrt'],
"criterion": ["friedman_mse"],
"n_estimators":[200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}
svm = SVC(gamma="scale", probability=True)
tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}
inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
Use GridSearchCV as implemented by dask_ml (as originally suggested by @MRocklin here) - see the dask_ml docs for dask_ml.model_selection.GridSearchCV
- for brevity I am excluding
KerasClassifierand the helper functionbaseline_model()but my approach to handling the former would be the same as for the others
models = []
models.append(('MLP', dcv.GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('GBM', dcv.GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('RFC', dcv.GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('LR', dcv.GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('SVM', dcv.GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))
Initialize an extra blank list to hold non-nested CV results
non_nested_results = []
nested_results = []
names = []
scoring = 'accuracy'
X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)
Joblib and dask client setup
- I created the cluster on my local machine
- see single machine dask.distributed
# Create a local cluster
from dask.distributed import Client
client = Client(processes=False, threads_per_worker=4,
n_workers=1, memory_limit='6GB')
from sklearn.externals import joblib
Perform Nested CV per the sklearn docs example
- first perform
GridSearchCV - second use
cross_val_score - note that, for demonstration purposes, I have only used 1
sklearnmodel (SVC) from the list of models in the example code in the question
start = time.time()
for name, model in [models[-1]]:
# Non_nested parameter search and scoring
with joblib.parallel_backend('dask'):
model.fit(X_train, Y_train)
non_nested_results.append(model.best_score_)
# Nested CV with parameter optimization
nested_score = cross_val_score(model, X=X_train, y=Y_train, cv=outer_cv)
nested_results.append(nested_score.mean())
names.append(name)
msg = "Nested CV Accuracy %s: %f (+/- %f )" %\
(name, np.mean(nested_results)*100, np.std(nested_results)*100)
print(msg)
print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100), '%')
print("Best Estimator: \n{}\n".format(model.best_estimator_))
print("Best Parameters: \n{}\n".format(model.best_params_))
print("Best CV Score: \n{}\n".format(model.best_score_))
score_difference = [a_i - b_i for a_i, b_i in zip(non_nested_results, nested_results)]
print("Average difference of {0:6f} with std. dev. of {1:6f}."
.format(np.mean(score_difference), np.std(score_difference)))
print('Total running time of the script: {:.2f} seconds' .format(time.time()-start))
client.close()
Below are the outputs (with script execution timing) using dataset 3
Output+Timing without dask1
Nested CV Accuracy SVM: 20.416667 (+/- 0.000000 )
Test set accuracy: 16.67 %
Best Estimator:
SVC(C=0.75, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Best Parameters:
{'C': 0.75, 'kernel': 'linear'}
Best CV Score:
0.2375
Average difference of 0.033333 with std. dev. of 0.000000.
Total running time of the script: 23.96 seconds
Output+Timing with dask (using n_workers=1 and threads_per_worker=4)2
Nested CV Accuracy SVM: 18.750000 (+/- 0.000000 )
Test set accuracy: 13.33 %
Best Estimator:
SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Best Parameters:
{'C': 0.5, 'kernel': 'rbf'}
Best CV Score:
0.1916666666666667
Average difference of 0.004167 with std. dev. of 0.000000.
Total running time of the script: 8.84 seconds
Output+Timing with dask (using n_workers=4 and threads_per_worker=4)2
Nested CV Accuracy SVM: 23.333333 (+/- 0.000000 )
Test set accuracy: 21.67 %
Best Estimator:
SVC(C=0.25, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Best Parameters:
{'C': 0.25, 'kernel': 'linear'}
Best CV Score:
0.25
Average difference of 0.016667 with std. dev. of 0.000000.
Total running time of the script: 7.52 seconds
Output+Timing with dask (using n_workers=1 and threads_per_worker=8)2
Nested CV Accuracy SVM: 20.416667 (+/- 0.000000 )
Test set accuracy: 18.33 %
Best Estimator:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Best Parameters:
{'C': 1, 'kernel': 'rbf'}
Best CV Score:
0.23333333333333334
Average difference of 0.029167 with std. dev. of 0.000000.
Total running time of the script: 7.06 seconds
1 uses sklearn.model_selection.GridSearchCV() and does not use joblib()
2 uses dask_ml.model_selection.GridSearchCV() to replace sklearn.model_selection.GridSearchCV() and uses joblib()
Notes about code and output in this answer
- I noticed in your question, you had the order of
sklearn.model_selection.GridSearchCV()andcross_val_scoreinverted, compared to the example in the docs- not sure if this effects your question too much, but thought I would mention it
- I do not have experience with nested cross validation so I cannot comment on whether
Client(..., n_workers=n, threads_per_worker=m), withn>1and/orm=4 or m=8, is acceptable/incorrect
General comments about usage of dask_ml (as I understand it)
- Case 1: if the training data is small enough to fit into memory on a single machine, but the testing dataset does not fit into memory, you can use the wrapper ParallelPostFit
- read testing data in parallel onto a cluster
- make predictions on testing data in parallel, using all workers on the cluster
- IIUC, this case is not relevant to your question
- Case 2: if you want to use
joblibto train a largescikit-learnmodel on a cluster (but training/testing data fits in memory) - a.k.a. distributedscikit-learn- then you could use a cluster to do the training and the skeleton code (per thedask_mldocs) is shown below- IIUC this case is
- relevant to your question
- the approach that I have used in this answer
- IIUC this case is
System Details (used for executing code)
dask==1.2.0
dask-ml==0.12.0
numpy==1.16.2+mkl
pandas==0.24.0
scikit-learn==0.20.3
sklearn==0.0
OS==Windows 8 (64-bit)
Python version (import platform; print(platform.python_version()))==3.7.2
来源:https://stackoverflow.com/questions/55808504/how-to-speed-up-nested-cross-validation-in-python