xgboost load model in c++ (python -> c++ prediction scores mismatch)

问题

I'm reaching out to all SO c++ geniuses.

I've trained (and successfully tested) an xgboost model in python like so:

dtrain 
=xgb.DMatrix(np.asmatrix(X_train),label=np.asarray(y_train,dtype=np.int), feature_names=feat_names)

optimal_model = xgb.train(plst, dtrain)

dtest = xgb.DMatrix(np.asmatrix(X_test),feature_names=feat_names)

optimal_model.save_model('sigdet.model')

I've followed a post on the XgBoost (see link) which explains the correct way to load and apply prediction in c++:

// Load Model
g_learner = std::make_unique<Learner>(Learner::Create({}));
        std::unique_ptr<dmlc::Stream> fi(
            dmlc::Stream::Create(filename, "r"));
        g_learner->Load(fi.get());

// Predict
    DMatrixHandle h_test;
        XGDMatrixCreateFromMat((float *)features, 1, numFeatures , -999.9f, &h_test);
        xgboost::bst_ulong out_len;


        std::vector<float> preds;
        g_learner->Predict((DMatrix*)h_test,true, &preds);

My problem (1): I need to create a DMatrix*, however I only have a DMatrixHandle. How do I properly create a DMatrix with my data?

My problem (2): When I tried the following prediction method:

DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *)features, 1, numFeatures , -999.9f, &h_test);
xgboost::bst_ulong out_len;


int res = XGBoosterPredict(g_modelHandle, h_test, 1, 0, &out_len, (const float**)&scores);

I'm getting completely different scores than by loading the exact same model and using it for prediction (in python).

Whoever helps me achieve consistent results across c++ and python will probably go to heaven. BTW, I need to apply prediction in c++ for a real-time application, otherwise I would use a different language.

回答1:

To get the DMatrix you can do this:

g_learner->Predict(static_cast<std::shared_ptr<xgboost::DMatrix>*>(h_test)->get(), true, &pred);

For problem (2), I don't have an answer. This is actually the same problem I have. I've got a XGBRegression in python and I obtain different results with the same features in C++.

回答2:

So the method you are using to serialize your model:

    optimal_model.save_model('sigdet.model')

This method strips the model of all of its feature names (see https://github.com/dmlc/xgboost/issues/3089).

When you load the model into C++ for prediction, the column feature ordering is not necessarily maintained. You can verify this by calling the .dump_model() method.

Additionally, calling .dump_model() on both your Python and C++ model objects will yield the same Decision Trees, but the Python one will have all the feature names and the C++ one will likely have f0, f1, f2, .... You can compare these two to get your actual column ordering, and then your predictions will match across languages(Not entirely, b/c of rounding).

I do not know how the columns get ordered, but it seems to be a stable process that maintains ordering even when you retrain the same model on a sliding data window. I am not 100% confident here, and would also appreciate clarity.

This problem exists for a lot of Python trained, other language predicted XGBoost models. I have faced this in Java, and it doesn't seem like there is a way to persist the feature column ordering across different bindings of XGBoost.

回答3:

Here is an example，But the predictions of the program are the same:

const int cols=3,rows=100;
float train[rows][cols];
for (int i=0;i<rows;i++)
    for (int j=0;j<cols;j++)
        train[i][j] = (i+1) * (j+1);

float train_labels[rows];
for (int i=0;i<50;i++)
    train_labels[i] = 0;
for (int i=50;i<rows;i++)
    train_labels[i] = 1;


// convert to DMatrix
DMatrixHandle h_train[1];
XGDMatrixCreateFromMat((float *) train, rows, cols, -1, &h_train[0]);

// load the labels
XGDMatrixSetFloatInfo(h_train[0], "label", train_labels, rows);

// read back the labels, just a sanity check
bst_ulong bst_result;
const float *out_floats;
XGDMatrixGetFloatInfo(h_train[0], "label" , &bst_result, &out_floats);
for (unsigned int i=0;i<bst_result;i++)
    std::cout << "label[" << i << "]=" << out_floats[i] << std::endl;

// create the booster and load some parameters
BoosterHandle h_booster;
XGBoosterCreate(h_train, 1, &h_booster);
XGBoosterSetParam(h_booster, "objective", "binary:logistic");
XGBoosterSetParam(h_booster, "eval_metric", "error");
XGBoosterSetParam(h_booster, "silent", "0");
XGBoosterSetParam(h_booster, "max_depth", "9");
XGBoosterSetParam(h_booster, "eta", "0.1");
XGBoosterSetParam(h_booster, "min_child_weight", "3");
XGBoosterSetParam(h_booster, "gamma", "0.6");
XGBoosterSetParam(h_booster, "colsample_bytree", "1");
XGBoosterSetParam(h_booster, "subsample", "1");
XGBoosterSetParam(h_booster, "reg_alpha", "10");

// perform 200 learning iterations
for (int iter=0; iter<10; iter++)
    XGBoosterUpdateOneIter(h_booster, iter, h_train[0]);

// predict
const int sample_rows = 100;
float test[sample_rows][cols];
for (int i=0;i<sample_rows;i++)
    for (int j=0;j<cols;j++)
        test[i][j] = (i+1) * (j+1);
DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *) test, sample_rows, cols, -1, &h_test);
bst_ulong out_len;
const float *f;
XGBoosterPredict(h_booster, h_test, 0,0,&out_len,&f);

for (unsigned int i=0;i<out_len;i++)
    std::cout << "prediction[" << i << "]=" << f[i] << std::endl;


// free xgboost internal structures
XGDMatrixFree(h_train[0]);
XGDMatrixFree(h_test);
XGBoosterFree(h_booster);

回答4:

In the problem (2), training model using python and do prediction use C++. The feature vector is a float* array.

DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *)features, 1, numFeatures , -999.9f, &h_test);
xgboost::bst_ulong out_len;
int res = XGBoosterPredict(g_modelHandle, h_test, 1, 0, &out_len, (const 
float**)&scores);

So your model needs to be trained using dense matrix format (numpy array). Below are the python snippet from the official doc.

data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
label = np.random.randint(2, size=5)  # binary target
dtrain = xgb.DMatrix(data, label=label)

来源：https://stackoverflow.com/questions/39335051/xgboost-load-model-in-c-python-c-prediction-scores-mismatch

标签

python

c++

xgboost