“Parallel” pipeline to get best model using gridsearch

后端 未结 1 1301
时光取名叫无心
时光取名叫无心 2020-12-15 15:01

In sklearn, a serial pipeline can be defined to get the best combination of hyperparameters for all consecutive parts of the pipeline. A serial pipeline can be implemented a

相关标签:
1条回答
  • 2020-12-15 15:35

    Pipeline supports None in its steps(list of estimators) by which certain part of the pipeline can be toggled off.

    You can pass None parameter to the named_steps of the pipeline to not use that estimator by setting that in params passed to GridSearchCV.

    Lets assume you want to use PCA and TruncatedSVD.

    pca = decomposition.PCA()
    svd = decomposition.TruncatedSVD()
    svm = SVC()
    n_components = [20, 40, 64]
    

    Add svd in pipeline

    pipe = Pipeline(steps=[('pca', pca), ('svd', svd), ('svm', svm)])
    
    # Change params_grid -> Instead of dict, make it a list of dict**
    # In the first element, pass `svd = None`, and in second `pca = None`
    params_grid = [{
    'svm__C': [1, 10, 100, 1000],
    'svm__kernel': ['linear', 'rbf'],
    'svm__gamma': [0.001, 0.0001],
    'pca__n_components': n_components,
    'svd':[None]
    },
    {
    'svm__C': [1, 10, 100, 1000],
    'svm__kernel': ['linear', 'rbf'],
    'svm__gamma': [0.001, 0.0001],
    'pca':[None],
    'svd__n_components': n_components,
    'svd__algorithm':['randomized']
    }]
    

    and now just pass the pipeline object to gridsearchCV

    grd = GridSearchCV(pipe, param_grid = params_grid)
    

    Calling grd.fit() will search the parameters over both the elements of the params_grid list, using all values from one at a time.

    Simplification if parameters have same name

    If both estimators in your "OR" have same name of parameters as in this case, where PCA and TruncatedSVD has n_components (or you just want to search over this parameter, this can be simplified as:

    #Here I have changed the name to `preprocessor`
    pipe = Pipeline(steps=[('preprocessor', pca), ('svm', svm)])
    
    #Now assign both estimators to `preprocessor` as below:
    params_grid = {
    'svm__C': [1, 10, 100, 1000],
    'svm__kernel': ['linear', 'rbf'],
    'svm__gamma': [0.001, 0.0001],
    'preprocessor':[pca, svd],
    'preprocessor__n_components': n_components,
    }
    

    Generalization of this scheme

    We can make a function which can automatically populate our param_grid to be supplied to the GridSearchCV using appropriate values:-

    def make_param_grids(steps, param_grids):
    
        final_params=[]
    
        # Itertools.product will do a permutation such that 
        # (pca OR svd) AND (svm OR rf) will become ->
        # (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
        for estimator_names in itertools.product(*steps.values()):
            current_grid = {}
    
            # Step_name and estimator_name should correspond
            # i.e preprocessor must be from pca and select.
            for step_name, estimator_name in zip(steps.keys(), estimator_names):
                for param, value in param_grids.get(estimator_name).iteritems():
                    if param == 'object':
                        # Set actual estimator in pipeline
                        current_grid[step_name]=[value]
                    else:
                        # Set parameters corresponding to above estimator
                        current_grid[step_name+'__'+param]=value
            #Append this dictionary to final params            
            final_params.append(current_grid)
    
    return final_params
    

    And use this function on any number of transformers and estimators

    # add all the estimators you want to "OR" in single key
    # use OR between `pca` and `select`, 
    # use OR between `svm` and `rf`
    # different keys will be evaluated as serial estimator in pipeline
    pipeline_steps = {'preprocessor':['pca', 'select'],
                      'classifier':['svm', 'rf']}
    
    # fill parameters to be searched in this dict
    all_param_grids = {'svm':{'object':SVC(), 
                              'C':[0.1,0.2]
                             }, 
    
                       'rf':{'object':RandomForestClassifier(),
                             'n_estimators':[10,20]
                            },
    
                       'pca':{'object':PCA(),
                              'n_components':[10,20]
                             },
    
                       'select':{'object':SelectKBest(),
                                 'k':[5,10]
                                }
                      }  
    
    
    # Call the method on the above declared variables
    param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
    

    Now initialize a pipeline object with names as used in above pipeline_steps

    # The PCA() and SVC() used here are just to initialize the pipeline,
    # actual estimators will be used from our `param_grids_list`
    pipe = Pipeline(steps=[('preprocessor',PCA()), ('classifier', SVC())])  
    

    Now, finally set out gridSearchCV object and fit data

    grd = GridSearchCV(pipe, param_grid = param_grids_list)
    grd.fit(X, y)
    
    0 讨论(0)
提交回复
热议问题