问题
I'm reaching out to all SO c++ geniuses.
I've trained (and successfully tested) an xgboost model in python like so:
dtrain
=xgb.DMatrix(np.asmatrix(X_train),label=np.asarray(y_train,dtype=np.int), feature_names=feat_names)
optimal_model = xgb.train(plst, dtrain)
dtest = xgb.DMatrix(np.asmatrix(X_test),feature_names=feat_names)
optimal_model.save_model('sigdet.model')
I've followed a post on the XgBoost (see link) which explains the correct way to load and apply prediction in c++:
// Load Model
g_learner = std::make_unique<Learner>(Learner::Create({}));
std::unique_ptr<dmlc::Stream> fi(
dmlc::Stream::Create(filename, "r"));
g_learner->Load(fi.get());
// Predict
DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *)features, 1, numFeatures , -999.9f, &h_test);
xgboost::bst_ulong out_len;
std::vector<float> preds;
g_learner->Predict((DMatrix*)h_test,true, &preds);
My problem (1): I need to create a DMatrix*, however I only have a DMatrixHandle. How do I properly create a DMatrix with my data?
My problem (2): When I tried the following prediction method:
DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *)features, 1, numFeatures , -999.9f, &h_test);
xgboost::bst_ulong out_len;
int res = XGBoosterPredict(g_modelHandle, h_test, 1, 0, &out_len, (const float**)&scores);
I'm getting completely different scores than by loading the exact same model and using it for prediction (in python).
Whoever helps me achieve consistent results across c++ and python will probably go to heaven. BTW, I need to apply prediction in c++ for a real-time application, otherwise I would use a different language.
回答1:
To get the DMatrix you can do this:
g_learner->Predict(static_cast<std::shared_ptr<xgboost::DMatrix>*>(h_test)->get(), true, &pred);
For problem (2), I don't have an answer. This is actually the same problem I have. I've got a XGBRegression in python and I obtain different results with the same features in C++.
回答2:
So the method you are using to serialize your model:
optimal_model.save_model('sigdet.model')
This method strips the model of all of its feature names (see https://github.com/dmlc/xgboost/issues/3089).
When you load the model into C++ for prediction, the column feature ordering is not necessarily maintained. You can verify this by calling the .dump_model() method.
Additionally, calling .dump_model() on both your Python and C++ model objects will yield the same Decision Trees, but the Python one will have all the feature names and the C++ one will likely have f0, f1, f2, .... You can compare these two to get your actual column ordering, and then your predictions will match across languages(Not entirely, b/c of rounding).
I do not know how the columns get ordered, but it seems to be a stable process that maintains ordering even when you retrain the same model on a sliding data window. I am not 100% confident here, and would also appreciate clarity.
This problem exists for a lot of Python trained, other language predicted XGBoost models. I have faced this in Java, and it doesn't seem like there is a way to persist the feature column ordering across different bindings of XGBoost.
回答3:
Here is an example,But the predictions of the program are the same:
const int cols=3,rows=100;
float train[rows][cols];
for (int i=0;i<rows;i++)
for (int j=0;j<cols;j++)
train[i][j] = (i+1) * (j+1);
float train_labels[rows];
for (int i=0;i<50;i++)
train_labels[i] = 0;
for (int i=50;i<rows;i++)
train_labels[i] = 1;
// convert to DMatrix
DMatrixHandle h_train[1];
XGDMatrixCreateFromMat((float *) train, rows, cols, -1, &h_train[0]);
// load the labels
XGDMatrixSetFloatInfo(h_train[0], "label", train_labels, rows);
// read back the labels, just a sanity check
bst_ulong bst_result;
const float *out_floats;
XGDMatrixGetFloatInfo(h_train[0], "label" , &bst_result, &out_floats);
for (unsigned int i=0;i<bst_result;i++)
std::cout << "label[" << i << "]=" << out_floats[i] << std::endl;
// create the booster and load some parameters
BoosterHandle h_booster;
XGBoosterCreate(h_train, 1, &h_booster);
XGBoosterSetParam(h_booster, "objective", "binary:logistic");
XGBoosterSetParam(h_booster, "eval_metric", "error");
XGBoosterSetParam(h_booster, "silent", "0");
XGBoosterSetParam(h_booster, "max_depth", "9");
XGBoosterSetParam(h_booster, "eta", "0.1");
XGBoosterSetParam(h_booster, "min_child_weight", "3");
XGBoosterSetParam(h_booster, "gamma", "0.6");
XGBoosterSetParam(h_booster, "colsample_bytree", "1");
XGBoosterSetParam(h_booster, "subsample", "1");
XGBoosterSetParam(h_booster, "reg_alpha", "10");
// perform 200 learning iterations
for (int iter=0; iter<10; iter++)
XGBoosterUpdateOneIter(h_booster, iter, h_train[0]);
// predict
const int sample_rows = 100;
float test[sample_rows][cols];
for (int i=0;i<sample_rows;i++)
for (int j=0;j<cols;j++)
test[i][j] = (i+1) * (j+1);
DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *) test, sample_rows, cols, -1, &h_test);
bst_ulong out_len;
const float *f;
XGBoosterPredict(h_booster, h_test, 0,0,&out_len,&f);
for (unsigned int i=0;i<out_len;i++)
std::cout << "prediction[" << i << "]=" << f[i] << std::endl;
// free xgboost internal structures
XGDMatrixFree(h_train[0]);
XGDMatrixFree(h_test);
XGBoosterFree(h_booster);
回答4:
In the problem (2), training model using python and do prediction use C++. The feature vector is a float* array.
DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *)features, 1, numFeatures , -999.9f, &h_test);
xgboost::bst_ulong out_len;
int res = XGBoosterPredict(g_modelHandle, h_test, 1, 0, &out_len, (const
float**)&scores);
So your model needs to be trained using dense matrix format (numpy array). Below are the python snippet from the official doc.
data = np.random.rand(5, 10) # 5 entities, each contains 10 features
label = np.random.randint(2, size=5) # binary target
dtrain = xgb.DMatrix(data, label=label)
来源:https://stackoverflow.com/questions/39335051/xgboost-load-model-in-c-python-c-prediction-scores-mismatch