XGboost: cannot pass validation data for eval_set in pipeline

问题

I want to implement GridSearchCV for XGboost model in pipeline. I have preprocessor for data, defined above the code, some grid params

XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('XGBmodel', XGBmodel)
])

And I want to pass these fit params

fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], 
              "XGBmodel__early_stopping_rounds": 10, 
              "XGBmodel__verbose": False}

I am trying to fit model

searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)

but I get error on the line with eval_set: DataFrame.dtypes for data must be int, float or bool

I guess it is because validation data aren't going through the preprocessing, but when I google I find that everywhere it is done by this way and seems it should work. Also I tried to find a way to apply preprocessor for validation data separately, but it is not possible to transform validation data without fitting train data before it.

Full code

columns = num_cols + cat_cols
X_train = X_full_train[columns].copy()
X_valid = X_full_valid[columns].copy()

num_preprocessor = SimpleImputer(strategy = 'mean')
cat_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', num_preprocessor, num_cols),
    ('cat', cat_preprocessor, cat_cols)
])

XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('XGBmodel', XGBmodel)
])

param_grid = {
    "XGBmodel__n_estimators": [10, 50, 100, 500],
    "XGBmodel__learning_rate": [0.1, 0.5, 1],
}

fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], 
              "XGBmodel__early_stopping_rounds": 10, 
              "XGBmodel__verbose": False}

searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)

Is there any way to preprocess validation data in pipeline? Or maybe completely different way to implement this thing?

回答1:

There is no good way. If you have a long pipeline of transformers before fitting a model, then you can consider to fit those in the pipeline and then apply the model separately.

The underlying issue is that a pipeline has no notion of a validation set used in the model fitting. You can see a discussion on LightGBM github here. Their proposal is to pre-train transformers and apply those to the validation data before you fit the full pipeline. This can be fine, if you use fast transformers, but can double CPU time in an extreme scenario.

来源：https://stackoverflow.com/questions/56346632/xgboost-cannot-pass-validation-data-for-eval-set-in-pipeline

标签

python-3.x

machine-learning

scikit-learn

xgboost