How to speed up nested cross validation in python?

问题

From what I've found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microsoft, so I am hoping there is another package or answer to this question.

I am looking to compare multiple algorithms and gridsearch a wide range of parameters (maybe too many parameters?), what ways are there besides mpi4py which could speed up running my code? As I understand it I cannot use n_jobs=-1 as that is then not nested?

Also to note, I have not been able to run this on the many parameters I am trying to look at below (runs longer than I have time). Only have results after 2 hours if I give each model only 2 parameters to compare. Also I run this code on a dataset of 252 rows and 25 feature columns with 4 categorical variables to predict ('certain', 'likely', 'possible', or 'unknown') whether a gene (with 252 genes) affects a disease. Using SMOTE increases the sample size to 420 which is then what goes into use.

dataset= pd.read_csv('data.csv')
data = dataset.drop(["gene"],1)
df = data.iloc[:,0:24]
df = df.fillna(0)
X = MinMaxScaler().fit_transform(df)

le = preprocessing.LabelEncoder()
encoded_value = le.fit_transform(["certain", "likely", "possible", "unlikely"])
Y = le.fit_transform(data["category"])

sm = SMOTE(random_state=100)
X_res, y_res = sm.fit_resample(X, Y)

seed = 7
logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')
LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[500, 1000, 5000]}

rfc =RandomForestClassifier()
param_grid = {'bootstrap': [True, False],
              'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
              'max_features': ['auto', 'sqrt'],
              'min_samples_leaf': [1, 2, 4,25],
              'min_samples_split': [2, 5, 10, 25],
              'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

mlp = MLPClassifier(random_state=seed)
parameter_space = {'hidden_layer_sizes': [(10,20), (10,20,10), (50,)],
     'activation': ['tanh', 'relu'],
     'solver': ['adam', 'sgd'],
     'max_iter': [10000],
     'alpha': [0.1, 0.01, 0.001],
     'learning_rate': ['constant','adaptive']}

gbm = GradientBoostingClassifier(min_samples_split=25, min_samples_leaf=25)
param = {"loss":["deviance"],
    "learning_rate": [0.15,0.1,0.05,0.01,0.005,0.001],
    "min_samples_split": [2, 5, 10, 25],
    "min_samples_leaf": [1, 2, 4,25],
    "max_depth":[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    "max_features":['auto', 'sqrt'],
    "criterion": ["friedman_mse"],
    "n_estimators":[200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
    }

svm = SVC(gamma="scale", probability=True)
tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}

def baseline_model(optimizer='adam', learn_rate=0.01):
    model = Sequential()
    model.add(Dense(100, input_dim=X_res.shape[1], activation='relu')) 
    model.add(Dropout(0.5))
    model.add(Dense(50, activation='relu')) #8 is the dim/ the number of hidden units (units are the kernel)
    model.add(Dense(4, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

keras = KerasClassifier(build_fn=baseline_model, batch_size=32, epochs=100, verbose=0)
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
kerasparams = dict(optimizer=optimizer, learn_rate=learn_rate)

inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)

models = []
models.append(('GBM', GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('RFC', GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('LR', GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('SVM', GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('MLP', GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('Keras', GridSearchCV(estimator=keras, param_grid=kerasparams, cv=inner_cv,iid=False, n_jobs=1)))


results = []
names = []
scoring = 'accuracy'
X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)


for name, model in models:
    nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)
    results.append(nested_cv_results)
    names.append(name)
    msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)
    print(msg)
    model.fit(X_train, Y_train)
    print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')
    print("Best Parameters: \n{}\n".format(model.best_params_))
    print("Best CV Score: \n{}\n".format(model.best_score_))

As an example, most of the dataset is binary and looks like this:

gene   Tissue    Druggable Eigenvalue CADDvalue Catalogpresence   Category
ACE      1           1         1          0           1            Certain
ABO      1           0         0          0           0            Likely
TP53     1           1         0          0           0            Possible

Any guidance on how I could speed this up would be appreciated.

Edit: I have also tried using parallel processing with dask, but I am not sure I am doing it right, and it doesn't seem to run any faster:

for name, model in models:
    with joblib.parallel_backend('dask'):
        nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)
        results.append(nested_cv_results)
        names.append(name)
        msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)
        print(msg)
        model.fit(X_train, Y_train)
        print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')
    #print("Best Estimator: \n{}\n".format(model.best_estimator_))
        print("Best Parameters: \n{}\n".format(model.best_params_))
        print("Best CV Score: \n{}\n".format(model.best_score_)) #average of all cv folds for a single combination of the parameters you specify

Edit: also to note with reducing the gridsearch, I have tried with for example 5 parameters per model however this still takes several hours to complete, so whilst trimming down the number will be helpful, if there is any advice for efficency beyond that I would be grateful.

回答1:

Two things:

Instead of GridSearch try using HyperOpt - it's a Python library for serial and parallel optimization.
I would reduce the dimensionality by using UMAP or PCA. Probably UMAP is the better choice.

After you apply SMOTE:

import umap

dim_reduced = umap.UMAP(
        min_dist=min_dist,
        n_neighbors=neighbours,
        random_state=1234,
    ).fit_transform(smote_output)

And then you can use dim_reduced for the train test split.

Reducing the dimensionality will help to remove noise from the data and instead of dealing with 25 features you'll bring them down to 2 (using UMAP) or the number of components you choose (using PCA). Which should have significant implications on the performance.

回答2:

The Dask-ML has scalable implementations GridSearchCV and RandomSearchCV that are, I believe, drop in replacements for Scikit-Learn. They were developed alongside Scikit-Learn developers.

https://ml.dask.org/hyper-parameter-search.html

They can be faster for two reasons:

They avoid repeating shared work between different stages of a Pipeline
They can scale out to a cluster anywhere you can deploy Dask (which is easy on most cluster infrastructure)

回答3:

There is an easy win in your situation and that is .... start using parallel processing :). dask will help you if you have a cluster (it will work on a single machine, but the improvement compared to the default scheduling in sklearn is not significant), but if you plan to run it on a single machine (but have several cores/threads and "enough" memory) then you can run nested CV in parallel. The only trick is that sklearn will not allow you to run the outer CV loop in multiple processes. However, it will allow you to run the inner loop in multiple threads.

At the moment you have n_jobs=None in the outer CV loop (that's the default in cross_val_score), which means n_jobs=1 and that's the only option that you can use with sklearn in the nested CV.

However, you can achieve and easy gain by setting n_jobs=some_reasonable_number in all GridSearchCV that you use. some_reasonable_number does not have to be -1 (but it is a good starting point). Some algorithms either plateau on n_jobs=n_cores instead of n_threads (for example, xgboost) or already have built-in multi-processing (like RandomForestClassifier, for example) and there might be clashes if you spawn too many processes.

回答4:

IIUC, you are trying to parallelize this example from the sklearn docs. If this is the case, then here is one possible approach to address

why dask is not working

and

Any kind of constructive guidance or further knowledge on this problem

General imports

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, train_test_split
from sklearn.neural_network import MLPClassifier
import dask_ml.model_selection as dcv


import time

Data

I defined 3 datasets to try out implementation of dask_ml
- the size, # rows, of the third (Dataset 3) one is adjustable and can be arbirarily increased depending on your computing power
  - I timed execution of dask_ml using this dataset only
- the code below works for all 3 datasets
- Dataset 1 is a slightly longer version of sample data in SO question

#### Dataset 1 - longer version of data in the question
d = """gene Tissue Druggable Eigenvalue CADDvalue Catalogpresence Category
ACE 1 1 1 0 1 Certain
ABO 1 0 0 0 0 Likely
TP53 1 1 0 0 0 Possible"""
data = pd.DataFrame([x.split(' ') for x in d.split('\n')])
data.columns = data.loc[0,:]
data.drop(0, axis=0, inplace=True)
data = pd.concat([data]*15)

data = data.drop(["gene"],1)
df = data.iloc[:,0:5]

X = MinMaxScaler().fit_transform(df)
le = preprocessing.LabelEncoder()
encoded_value = le.fit_transform(["Certain", "Likely", "Possible"])
Y = le.fit_transform(data["Category"])

sm = SMOTE(random_state=100)
X_res, y_res = sm.fit_resample(X, Y)

#### Dataset 2 - iris dataset from example in sklearn nested cross validation docs
# Load the dataset
from sklearn.datasets import load_iris
iris = load_iris()
X_res = iris.data
y_res = iris.target

#### Dataset 3 - size (#rows, #columns) is adjustable (I used this to time code execution)
X_res = pd.DataFrame(np.random.rand(300,50), columns=['col_'+str(c+1) for c in list(range(50))])
from random import shuffle
cats = ["paris", "barcelona", "kolkata", "new york", 'sydney']
y_values = cats*int(len(X_res)/len(cats))
shuffle(y_values)
y_res = pd.Series(y_values)

Instantiate classifiers - no changes from code in the question

seed = 7
logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')
LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[500, 1000, 5000]}

mlp = MLPClassifier(random_state=seed)
parameter_space = {'hidden_layer_sizes': [(10,20), (10,20,10), (50,)],
     'activation': ['tanh', 'relu'],
     'solver': ['adam', 'sgd'],
     'max_iter': [10000],
     'alpha': [0.1, 0.01, 0.001],
     'learning_rate': ['constant','adaptive']}

rfc =RandomForestClassifier()
param_grid = {'bootstrap': [True, False],
              'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
              'max_features': ['auto', 'sqrt'],
              'min_samples_leaf': [1, 2, 4,25],
              'min_samples_split': [2, 5, 10, 25],
              'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

gbm = GradientBoostingClassifier(min_samples_split=25, min_samples_leaf=25)
param = {"loss":["deviance"],
    "learning_rate": [0.15,0.1,0.05,0.01,0.005,0.001],
    "min_samples_split": [2, 5, 10, 25],
    "min_samples_leaf": [1, 2, 4,25],
    "max_depth":[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    "max_features":['auto', 'sqrt'],
    "criterion": ["friedman_mse"],
    "n_estimators":[200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
    }

svm = SVC(gamma="scale", probability=True)
tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}

inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)

Use GridSearchCV as implemented by dask_ml (as originally suggested by @MRocklin here) - see the dask_ml docs for dask_ml.model_selection.GridSearchCV

for brevity I am excluding KerasClassifier and the helper function baseline_model() but my approach to handling the former would be the same as for the others

models = []
models.append(('MLP', dcv.GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('GBM', dcv.GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('RFC', dcv.GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('LR', dcv.GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('SVM', dcv.GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))

Initialize an extra blank list to hold non-nested CV results

non_nested_results = []
nested_results = []
names = []
scoring = 'accuracy'
X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)

Joblib and dask client setup

I created the cluster on my local machine
- see single machine dask.distributed

# Create a local cluster
from dask.distributed import Client
client = Client(processes=False, threads_per_worker=4,
        n_workers=1, memory_limit='6GB')
from sklearn.externals import joblib

Perform Nested CV per the sklearn docs example

first perform GridSearchCV
second use cross_val_score
note that, for demonstration purposes, I have only used 1 sklearn model (SVC) from the list of models in the example code in the question

start = time.time()
for name, model in [models[-1]]:
  # Non_nested parameter search and scoring
  with joblib.parallel_backend('dask'):
    model.fit(X_train, Y_train)
  non_nested_results.append(model.best_score_)

  # Nested CV with parameter optimization
  nested_score = cross_val_score(model, X=X_train, y=Y_train, cv=outer_cv)
  nested_results.append(nested_score.mean())

  names.append(name)
  msg = "Nested CV Accuracy %s: %f (+/- %f )" %\
        (name, np.mean(nested_results)*100, np.std(nested_results)*100)
  print(msg)
  print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')
  print("Best Estimator: \n{}\n".format(model.best_estimator_))
  print("Best Parameters: \n{}\n".format(model.best_params_))
  print("Best CV Score: \n{}\n".format(model.best_score_))

score_difference = [a_i - b_i for a_i, b_i in zip(non_nested_results, nested_results)]
print("Average difference of {0:6f} with std. dev. of {1:6f}."
      .format(np.mean(score_difference), np.std(score_difference)))

print('Total running time of the script: {:.2f} seconds' .format(time.time()-start))

client.close()

Below are the outputs (with script execution timing) using dataset 3

Output+Timing without dask¹

Nested CV Accuracy SVM: 20.416667 (+/- 0.000000 )
Test set accuracy: 16.67 %
Best Estimator: 
SVC(C=0.75, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Best Parameters: 
{'C': 0.75, 'kernel': 'linear'}

Best CV Score: 
0.2375

Average difference of 0.033333 with std. dev. of 0.000000.
Total running time of the script: 23.96 seconds

Output+Timing with dask (using n_workers=1 and threads_per_worker=4)²

Nested CV Accuracy SVM: 18.750000 (+/- 0.000000 )
Test set accuracy: 13.33 %
Best Estimator: 
SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Best Parameters: 
{'C': 0.5, 'kernel': 'rbf'}

Best CV Score: 
0.1916666666666667

Average difference of 0.004167 with std. dev. of 0.000000.
Total running time of the script: 8.84 seconds

Output+Timing with dask (using n_workers=4 and threads_per_worker=4)²

Nested CV Accuracy SVM: 23.333333 (+/- 0.000000 )
Test set accuracy: 21.67 %
Best Estimator: 
SVC(C=0.25, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Best Parameters: 
{'C': 0.25, 'kernel': 'linear'}

Best CV Score: 
0.25

Average difference of 0.016667 with std. dev. of 0.000000.
Total running time of the script: 7.52 seconds

Output+Timing with dask (using n_workers=1 and threads_per_worker=8)²

Nested CV Accuracy SVM: 20.416667 (+/- 0.000000 )
Test set accuracy: 18.33 %
Best Estimator: 
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Best Parameters: 
{'C': 1, 'kernel': 'rbf'}

Best CV Score: 
0.23333333333333334

Average difference of 0.029167 with std. dev. of 0.000000.
Total running time of the script: 7.06 seconds

¹ uses sklearn.model_selection.GridSearchCV() and does not use joblib()

² uses dask_ml.model_selection.GridSearchCV() to replace sklearn.model_selection.GridSearchCV() and uses joblib()

Notes about code and output in this answer

I noticed in your question, you had the order of sklearn.model_selection.GridSearchCV() and cross_val_score inverted, compared to the example in the docs
- not sure if this effects your question too much, but thought I would mention it
I do not have experience with nested cross validation so I cannot comment on whether Client(..., n_workers=n, threads_per_worker=m), with n>1 and/or m=4 or m=8, is acceptable/incorrect

General comments about usage of dask_ml (as I understand it)

Case 1: if the training data is small enough to fit into memory on a single machine, but the testing dataset does not fit into memory, you can use the wrapper ParallelPostFit
- read testing data in parallel onto a cluster
- make predictions on testing data in parallel, using all workers on the cluster
- IIUC, this case is not relevant to your question
Case 2: if you want to use joblib to train a large scikit-learn model on a cluster (but training/testing data fits in memory) - a.k.a. distributed scikit-learn - then you could use a cluster to do the training and the skeleton code (per the dask_ml docs) is shown below
- IIUC this case is
  - relevant to your question
  - the approach that I have used in this answer

System Details (used for executing code)

dask==1.2.0
dask-ml==0.12.0
numpy==1.16.2+mkl
pandas==0.24.0
scikit-learn==0.20.3
sklearn==0.0
OS==Windows 8 (64-bit)
Python version (import platform; print(platform.python_version()))==3.7.2

来源：https://stackoverflow.com/questions/55808504/how-to-speed-up-nested-cross-validation-in-python

标签

python

parallel-processing

scikit-learn

dask

cross-validation