XgBoost Script is not outputing binary properly

偶尔善良 提交于 2019-12-13 02:57:32

问题


I'm learning to use xgboost, and I have read through the documentation! However, I'm not understanding why the output of my script is coming out between 0~~2. First, I thought it should come as either 0 or 1, since its a binary classification, but then, I read it comes as a probability of 0 or 1, however, some outputs are 1.5+ ( at least on the CSV ), which doesnt make sense to me!

I'm unsure if the problem is on xgboost parameters or in the csv creation! This line, np.expm1(preds) , im not sure it should be np.expm1, but I dont know for what I could change it!

In conclusion, my question is :

Why the output is not 0 or 1, and instead comes as 0.0xxx and 1.xxx ?

Here is my script:

import numpy as np
import xgboost as xgb
import pandas as pd

train = pd.read_csv('../dataset/train.csv')
train = train.drop('ID', axis=1)

y = train['TARGET']

train = train.drop('TARGET', axis=1)
x = train

dtrain = xgb.DMatrix(x.as_matrix(), label=y.tolist())

test = pd.read_csv('../dataset/test.csv')

test = test.drop('ID', axis=1)
dtest = xgb.DMatrix(test.as_matrix())


# XGBoost params:
def get_params():
    #
    params = {}
    params["objective"] = "binary:logistic"
    params["booster"] = "gbtree"
    params["eval_metric"] = "auc"
    params["eta"] = 0.3  #
    params["subsample"] = 0.50
    params["colsample_bytree"] = 1.0
    params["max_depth"] = 20
    params["nthread"] = 4
    plst = list(params.items())
    #
    return plst


bst = xgb.train(get_params(), dtrain, 1000)

preds = bst.predict(dtest)

print np.max(preds)
print np.min(preds)
print np.average(preds)

# Make Submission
test_aux = pd.read_csv('../dataset/test.csv')
result = pd.DataFrame({"Id": test_aux["ID"], 'TARGET': np.expm1(preds)})

result.to_csv("xgboost_submission.csv", index=False)

回答1:


When you run a xgb model with objective binary:logistic you get arrays of probabilities for each sample. Those probabilities are the chance of the sample to belong at class i.

Let's say you have 3 classes [A, B, C]. An output for the sample y like [0.2, 0.6, 0.4] indicates that this sample will probabliy belong to class B.

If you want just the more probable class, take the index of the maximum element in such probability array, for example using numpy function argmax.

You can find more info at the xgb package parameter's documentation.




回答2:


You just need to do that:

from xgboost import XGBClassifier

Call predict and the output will be 0 or 1, if you call predict_proba the output will be probabilities of the classes.

Sorry for my english.



来源:https://stackoverflow.com/questions/35826948/xgboost-script-is-not-outputing-binary-properly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!