XGBoost with GridSearchCV, Scaling, PCA, and Early-Stopping in sklearn Pipeline

问题

I want to combine a XGBoost model with input scaling and feature space reduction by PCA. In addition, the hyperparameters of the model as well as the number of components used in the PCA should be tuned using cross-validation. And to prevent the model from overfitting, early stopping should be added.

For combining the various steps, I decided to use sklearn's Pipeline functionalities.

At the beginning, I had some problems making sure, that the PCA is also applied to the validation set. But I think using XGB__eval_set makes the deal.

The code is actually running without any errors, but seems to run forever (at some point the CPU usage of all cores goes down to zero but the processes continue to run for hours; had to kill the session at some point).

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor   

# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X_with_features, y, test_size=0.2, random_state=123)

# Train / Validation split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=123)

# Pipeline
pipe = Pipeline(steps=[("Scale", StandardScaler()),
                       ("PCA", PCA()),
                       ("XGB", XGBRegressor())])

# Hyper-parameter grid (Test only)
grid_param_pipe = {'PCA__n_components': [5],
                   'XGB__n_estimators': [1000],
                   'XGB__max_depth': [3],
                   'XGB__reg_alpha': [0.1],
                   'XGB__reg_lambda': [0.1]}

# Grid object
grid_search_pipe = GridSearchCV(estimator=pipe,
                                param_grid=grid_param_pipe,
                                scoring="neg_mean_squared_error",
                                cv=5,
                                n_jobs=5,
                                verbose=3)

# Run CV
grid_search_pipe.fit(X_train, y_train, XGB__early_stopping_rounds=10, XGB__eval_metric="rmse", XGB__eval_set=[[X_val, y_val]])

回答1:

The problem is that fit method requires an evaluation set created externally, but we cannot create one before the transformation by the pipeline.

This is a bit hacky, but the idea is to create a thin wrapper to the xgboost regressor/classifier that prepare for the evaluation set inside.

from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor, XGBClassifier

class XGBoostWithEarlyStop(BaseEstimator):
    def __init__(self, early_stopping_rounds=5, test_size=0.1, 
                 eval_metric='mae', **estimator_params):
        self.early_stopping_rounds = early_stopping_rounds
        self.test_size = test_size
        self.eval_metric=eval_metric='mae'        
        if self.estimator is not None:
            self.set_params(**estimator_params)

    def set_params(self, **params):
        return self.estimator.set_params(**params)

    def get_params(self, **params):
        return self.estimator.get_params()

    def fit(self, X, y):
        x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=self.test_size)
        self.estimator.fit(x_train, y_train, 
                           early_stopping_rounds=self.early_stopping_rounds, 
                           eval_metric=self.eval_metric, eval_set=[(x_val, y_val)])
        return self

    def predict(self, X):
        return self.estimator.predict(X)

class XGBoostRegressorWithEarlyStop(XGBoostWithEarlyStop):
    def __init__(self, *args, **kwargs):
        self.estimator = XGBRegressor()
        super(XGBoostRegressorWithEarlyStop, self).__init__(*args, **kwargs)

class XGBoostClassifierWithEarlyStop(XGBoostWithEarlyStop):
    def __init__(self, *args, **kwargs):
        self.estimator = XGBClassifier()
        super(XGBoostClassifierWithEarlyStop, self).__init__(*args, **kwargs)

Below is a test.

from sklearn.datasets import load_diabetes
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

x, y = load_diabetes(return_X_y=True)
print(x.shape, y.shape)
# (442, 10) (442,)

pipe = Pipeline([
    ('pca', PCA(5)),
    ('xgb', XGBoostRegressorWithEarlyStop())
])

param_grid = {
    'pca__n_components': [3, 5, 7],
    'xgb__n_estimators': [10, 20, 30, 50]
}

grid = GridSearchCV(pipe, param_grid, scoring='neg_mean_absolute_error')
grid.fit(x, y)
print(grid.best_params_)

If requesting feature requests to the developers, the easiest extension to make is to allow XGBRegressor to create evaluation set internally when not provided. This way, no extension to the scikit-learn is necessary (I guess).

来源：https://stackoverflow.com/questions/50824326/xgboost-with-gridsearchcv-scaling-pca-and-early-stopping-in-sklearn-pipeline

标签

python

scikit-learn

pca

xgboost

grid-search