XGboost: cannot pass validation data for eval_set in pipeline

守給你的承諾、 提交于 2021-01-22 12:10:53

问题


I want to implement GridSearchCV for XGboost model in pipeline. I have preprocessor for data, defined above the code, some grid params

XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('XGBmodel', XGBmodel)
])

And I want to pass these fit params

fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], 
              "XGBmodel__early_stopping_rounds": 10, 
              "XGBmodel__verbose": False}

I am trying to fit model

searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)

but I get error on the line with eval_set: DataFrame.dtypes for data must be int, float or bool

I guess it is because validation data aren't going through the preprocessing, but when I google I find that everywhere it is done by this way and seems it should work. Also I tried to find a way to apply preprocessor for validation data separately, but it is not possible to transform validation data without fitting train data before it.

Full code

columns = num_cols + cat_cols
X_train = X_full_train[columns].copy()
X_valid = X_full_valid[columns].copy()

num_preprocessor = SimpleImputer(strategy = 'mean')
cat_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', num_preprocessor, num_cols),
    ('cat', cat_preprocessor, cat_cols)
])

XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('XGBmodel', XGBmodel)
])

param_grid = {
    "XGBmodel__n_estimators": [10, 50, 100, 500],
    "XGBmodel__learning_rate": [0.1, 0.5, 1],
}

fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], 
              "XGBmodel__early_stopping_rounds": 10, 
              "XGBmodel__verbose": False}

searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)

Is there any way to preprocess validation data in pipeline? Or maybe completely different way to implement this thing?


回答1:


There is no good way. If you have a long pipeline of transformers before fitting a model, then you can consider to fit those in the pipeline and then apply the model separately.

The underlying issue is that a pipeline has no notion of a validation set used in the model fitting. You can see a discussion on LightGBM github here. Their proposal is to pre-train transformers and apply those to the validation data before you fit the full pipeline. This can be fine, if you use fast transformers, but can double CPU time in an extreme scenario.



来源:https://stackoverflow.com/questions/56346632/xgboost-cannot-pass-validation-data-for-eval-set-in-pipeline

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!