问题
I'm learning to use xgboost
, and I have read through the documentation!
However, I'm not understanding why the output of my script is coming out between 0~~2
.
First, I thought it should come as either 0 or 1, since its a binary
classification, but then, I read it comes as a probability of 0 or 1, however, some outputs are 1.5+
( at least on the CSV ), which doesnt make sense to me!
I'm unsure if the problem is on xgboost
parameters or in the csv creation!
This line, np.expm1(preds)
, im not sure it should be np.expm1
, but I dont know for what I could change it!
In conclusion, my question is :
Why the output is not 0 or 1, and instead comes as 0.0xxx and 1.xxx ?
Here is my script:
import numpy as np
import xgboost as xgb
import pandas as pd
train = pd.read_csv('../dataset/train.csv')
train = train.drop('ID', axis=1)
y = train['TARGET']
train = train.drop('TARGET', axis=1)
x = train
dtrain = xgb.DMatrix(x.as_matrix(), label=y.tolist())
test = pd.read_csv('../dataset/test.csv')
test = test.drop('ID', axis=1)
dtest = xgb.DMatrix(test.as_matrix())
# XGBoost params:
def get_params():
#
params = {}
params["objective"] = "binary:logistic"
params["booster"] = "gbtree"
params["eval_metric"] = "auc"
params["eta"] = 0.3 #
params["subsample"] = 0.50
params["colsample_bytree"] = 1.0
params["max_depth"] = 20
params["nthread"] = 4
plst = list(params.items())
#
return plst
bst = xgb.train(get_params(), dtrain, 1000)
preds = bst.predict(dtest)
print np.max(preds)
print np.min(preds)
print np.average(preds)
# Make Submission
test_aux = pd.read_csv('../dataset/test.csv')
result = pd.DataFrame({"Id": test_aux["ID"], 'TARGET': np.expm1(preds)})
result.to_csv("xgboost_submission.csv", index=False)
回答1:
When you run a xgb
model with objective binary:logistic
you get arrays of probabilities for each sample. Those probabilities are the chance of the sample to belong at class i
.
Let's say you have 3 classes [A, B, C]
. An output for the sample y
like [0.2, 0.6, 0.4]
indicates that this sample will probabliy belong to class B.
If you want just the more probable class, take the index of the maximum element in such probability array, for example using numpy
function argmax.
You can find more info at the xgb
package parameter's documentation.
回答2:
You just need to do that:
from xgboost import XGBClassifier
Call predict and the output will be 0 or 1, if you call predict_proba the output will be probabilities of the classes.
Sorry for my english.
来源:https://stackoverflow.com/questions/35826948/xgboost-script-is-not-outputing-binary-properly