In sklearn, a serial pipeline can be defined to get the best combination of hyperparameters for all consecutive parts of the pipeline. A serial pipeline can be implemented as follows:
from sklearn.svm import SVC
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
digits = datasets.load_digits()
X_train = digits.data
y_train = digits.target
#Use Principal Component Analysis to reduce dimensionality
# and improve generalization
pca = decomposition.PCA()
# Use a linear SVC
svm = SVC()
# Combine PCA and SVC to a pipeline
pipe = Pipeline(steps=[('pca', pca), ('svm', svm)])
# Check the training time for the SVC
n_components = [20, 40, 64]
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
}
But what if I want to try different algorithms for each step of the pipeline? How can I e.g. gridsearch over
Principal Component Analysis OR Singular Value Decomposition AND Support Vector machines OR Random Forest
This would require some kind of 2nd level or "meta-gridsearch", since the type of model would be one of the hyperparameters. Is that possible in sklearn?
Pipeline supports None
in its steps
(list of estimators) by which certain part of the pipeline can be toggled off.
You can pass None
parameter to the named_steps
of the pipeline to not use that estimator by setting that in params passed to GridSearchCV.
Lets assume you want to use PCA
and TruncatedSVD
.
pca = decomposition.PCA()
svd = decomposition.TruncatedSVD()
svm = SVC()
n_components = [20, 40, 64]
Add svd
in pipeline
pipe = Pipeline(steps=[('pca', pca), ('svd', svd), ('svm', svm)])
# Change params_grid -> Instead of dict, make it a list of dict**
# In the first element, pass `svd = None`, and in second `pca = None`
params_grid = [{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
'svd':[None]
},
{
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca':[None],
'svd__n_components': n_components,
'svd__algorithm':['randomized']
}]
and now just pass the pipeline object to gridsearchCV
grd = GridSearchCV(pipe, param_grid = params_grid)
Calling grd.fit()
will search the parameters over both the elements of the params_grid
list, using all values from one at a time.
Simplification if parameters have same name
If both estimators in your "OR" have same name of parameters as in this case, where PCA
and TruncatedSVD
has n_components
(or you just want to search over this parameter, this can be simplified as:
#Here I have changed the name to `preprocessor`
pipe = Pipeline(steps=[('preprocessor', pca), ('svm', svm)])
#Now assign both estimators to `preprocessor` as below:
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'preprocessor':[pca, svd],
'preprocessor__n_components': n_components,
}
Generalization of this scheme
We can make a function which can automatically populate our param_grid
to be supplied to the GridSearchCV
using appropriate values:-
def make_param_grids(steps, param_grids):
final_params=[]
# Itertools.product will do a permutation such that
# (pca OR svd) AND (svm OR rf) will become ->
# (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
for estimator_names in itertools.product(*steps.values()):
current_grid = {}
# Step_name and estimator_name should correspond
# i.e preprocessor must be from pca and select.
for step_name, estimator_name in zip(steps.keys(), estimator_names):
for param, value in param_grids.get(estimator_name).iteritems():
if param == 'object':
# Set actual estimator in pipeline
current_grid[step_name]=[value]
else:
# Set parameters corresponding to above estimator
current_grid[step_name+'__'+param]=value
#Append this dictionary to final params
final_params.append(current_grid)
return final_params
And use this function on any number of transformers and estimators
# add all the estimators you want to "OR" in single key
# use OR between `pca` and `select`,
# use OR between `svm` and `rf`
# different keys will be evaluated as serial estimator in pipeline
pipeline_steps = {'preprocessor':['pca', 'select'],
'classifier':['svm', 'rf']}
# fill parameters to be searched in this dict
all_param_grids = {'svm':{'object':SVC(),
'C':[0.1,0.2]
},
'rf':{'object':RandomForestClassifier(),
'n_estimators':[10,20]
},
'pca':{'object':PCA(),
'n_components':[10,20]
},
'select':{'object':SelectKBest(),
'k':[5,10]
}
}
# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
Now initialize a pipeline object with names as used in above pipeline_steps
# The PCA() and SVC() used here are just to initialize the pipeline,
# actual estimators will be used from our `param_grids_list`
pipe = Pipeline(steps=[('preprocessor',PCA()), ('classifier', SVC())])
Now, finally set out gridSearchCV object and fit data
grd = GridSearchCV(pipe, param_grid = param_grids_list)
grd.fit(X, y)
来源:https://stackoverflow.com/questions/42266737/parallel-pipeline-to-get-best-model-using-gridsearch