问题
I noticed that there are two possible implementations of XGBoost in Python as discussed here and here
When I tried running the same dataset through the two possible implementations I noticed that the results were different.
Code
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
import xgboost
import pandas as pd
import numpy as np
from sklearn import datasets
boston_data = datasets.load_boston()
df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df['target'] = pd.Series(boston_data.target)
Y = df["target"]
X = df.drop('target', axis=1)
#### Code using Native Impl for XGBoost
dtrain = xgboost.DMatrix(X, label=Y, missing=0.0)
params = {'max_depth': 3, 'learning_rate': .05, 'min_child_weight' : 4, 'subsample' : 0.8}
evallist = [(dtrain, 'eval'), (dtrain, 'train')]
model = xgboost.train(dtrain=dtrain, params=params,num_boost_round=200)
predictions = model.predict(dtrain)
#### Code using Sklearn Wrapper for XGBoost
model = XGBRegressor(n_estimators = 200, max_depth=3, learning_rate =.05, min_child_weight=4, subsample=0.8 )
#model = model.fit(X, Y, eval_set = [(X, Y), (X, Y)], eval_metric = 'rmse', verbose=True)
model = model.fit(X, Y)
predictions2 = model.predict(X)
print(np.absolute(predictions-predictions2).sum())
Absolute difference sum using sklearn boston dataset
62.687134
When I ran the same for other datasets like the sklearn diabetes dataset I observed that the difference was much smaller.
Absolute difference sum using sklearn diabetes dataset
0.0011711121
回答1:
Make sure random seeds are the same.
For both approaches set the same seed
param['seed'] = 123
EDIT: then there are a couple of different things. First is n_estimators also 200? Are you imputing missing values in the second dataset also with 0? are others default values also the same(for this one I think yes because its a wrapper, but check other 2 things)
来源:https://stackoverflow.com/questions/59395651/difference-is-value-between-xgb-train-and-xgb-xgbregressor-in-python-for-certain