xgboost | 易学教程

XGBoost predictor in R predicts the same value for all rows [duplicate]

阅读更多关于 XGBoost predictor in R predicts the same value for all rows [duplicate]

问题 This question already has answers here : xgboost predict method returns the same predicted value for all rows (5 answers) Closed last year . I looked into the the post on the same thing in Python, but I want a solution in R. I'm working on the Titanic dataset from Kaggle, and it looks like this: 'data.frame': 891 obs. of 13 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : num 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ... $ Age :

Xgboost 得调参思路

阅读更多关于 Xgboost 得调参思路

文章目录 xgboost的优点参数调试通用参数 Booster 参数目标参数大致步骤 xgboost的优点 1、正则化 GBM（Gradient Boosting Machine）的实现没有像XGBoost这样的正则化步骤，因此很多时候过拟合处理比较麻烦，而XGBoost以“正则化提升(regularized boosting)”技术而闻名。 2、并行处理 XGBoost支持并行处理，相比GBM有了速度上的巨大提升。注：Boosting还是串行的，并行的表现在于数据预处理后保存在block中，避免每次调用都进行一次预处理 3、兼容性强可以处理底层的numpy和scipy数据，特别对于部分大数据集可以处理稀疏矩阵直接进行训练。 4、内置交叉验证 XGBoost允许在每一轮Boosting迭代中使用交叉验证。因此，可以方便地获得最优Boosting迭代次数。GBM的网格搜索有个最大弊端，只能在用户给出的范围内进行寻值。 5、灵活性强（1）允许用户定义自定义优化目标和评价标准，它对模型的使用开辟了一个全新的维度，用户的处理不会受到任何限制。（2）可以自动处理缺失值，避免了太过繁琐的预处理，用户需要提供一个和其它样本不同的值，然后把它作为一个参数传进去，以此来作为缺失值的取值。并且XGBoost在不同节点遇到缺失值时采用不同的处理方法，而且会学习未来遇到缺失值时的处理方法。

XGBoost can't find sklearn

阅读更多关于 XGBoost can't find sklearn

问题 I’m experimenting with XGBoost and am blocked by an error I can’t figure out. I have sklearn installed in the active environment and can verify it by training a sklearn RandomForestClassifier in the same notebook . When I try to train a XGBoost model I get the error XGBoostError: sklearn needs to be installed in order to use this module This works: clf = RandomForestClassifier(n_estimators=200, random_state=0, n_jobs=-1) This throws the exception: clf = xgb.XGBClassifier(max_depth=3, n

Jupyter notebook xgboost import

阅读更多关于 Jupyter notebook xgboost import

问题 I have the problem below (I'm on a MAC) I can import xgboost from python2.7 or python3.6 with my Terminal but the thing is that I can not import it on my Jupyter notebook. import xgboost as xgb ModuleNotFoundError Traceback (most recent call last) in () ----> 1 import xgboost as xgb ModuleNotFoundError: No module named 'xgboost' Although I write : !pip3 install xgboost It prints that : Requirement already satisfied: xgboost in /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6

Difference is value between xgb.train and xgb.XGBRegressor in Python for certain cases

阅读更多关于 Difference is value between xgb.train and xgb.XGBRegressor in Python for certain cases

问题 I noticed that there are two possible implementations of XGBoost in Python as discussed here and here When I tried running the same dataset through the two possible implementations I noticed that the results were different. Code import xgboost as xgb from xgboost.sklearn import XGBRegressor import xgboost import pandas as pd import numpy as np from sklearn import datasets boston_data = datasets.load_boston() df = pd.DataFrame(boston_data.data,columns=boston_data.feature_names) df['target'] =

GPU support for XGBoost and LightGBM

阅读更多关于 GPU support for XGBoost and LightGBM

GPU support for XGBoost and LightGBM GBDT 是表格型数据挖掘比赛的大杀器，其主要思想是利用弱分类器（决策树）迭代训练以得到最优模型，该模型具有训练效果好、不易过拟合等优点。XGBoost 和 LightGBM 是两个实现 GBDT 算法的框架，为了加快模型的训练效率，本文记录了 GPU Support 的 XGBoost and LightGBM 的构建过程。本次构建的系统环境为 CentOS 7.2。 Installation Guide for XGBoost GPU support Building XGBoost from source，构建和安装 XGBoost 包括如下两个步骤，从 C++ 代码构建共享库（libxgboost.so for Linux/OSX and xgboost.dll for Windows）安装语言包（如Python） Building the Shared Library 在 CentOS 上构建共享库，默认情况下，分布式 GPU 训练是关闭的，仅仅只有一个 GPU 将被使用，为开启分布式 GPU 训练，用 CMake 构建时，设置选项 USE_NCLL=ON ，分布式 GPU 训练依赖 NCLL2 ，可在 https://developer.nvidia.com/nccl 获取

how to enforce Monotonic Constraints in XGBoost with ScikitLearn?

阅读更多关于 how to enforce Monotonic Constraints in XGBoost with ScikitLearn?

问题 I build up a XGBoost model using scikit-learn and I am pretty happy with it. As fine tuning to avoid overfitting, I'd like to ensure monotonicity of some features but there I start facing some difficulties... As far as I understood, there is no documentation in scikit-learn about xgboost (which I confess I am really surprised about - knowing that this situation is lasting for several months). The only documentation I found is directly on http://xgboost.readthedocs.io On this website, I found

xgb.plot.tree layout in r

阅读更多关于 xgb.plot.tree layout in r

问题 I was reading a xgb notebook and the xgb.plot.tree command in example result in a pic like this: However when i do the same thing I got a pic like this which are two separate graphs and in different colors too. Is that normal? are the two graphs two trees? 回答1: I have the same issue. According to an issue case on the xgboost github repository, this could be due to a change in the DiagrammeR library used by xgboost for rendering trees. https://github.com/dmlc/xgboost/issues/2640 Instead of

How is the gradient and hessian of logarithmic loss computed in the custom objective function example script in xgboost's github repository?

阅读更多关于 How is the gradient and hessian of logarithmic loss computed in the custom objective function example script in xgboost's github repository?

问题 I would like to understand how the gradient and hessian of the logloss function are computed in an xgboost sample script. I've simplified the function to take numpy arrays, and generated y_hat and y_true which are a sample of the values used in the script. Here is a simplified example: import numpy as np def loglikelihoodloss(y_hat, y_true): prob = 1.0 / (1.0 + np.exp(-y_hat)) grad = prob - y_true hess = prob * (1.0 - prob) return grad, hess y_hat = np.array([1.80087972, -1.82414818, -1

Parallel processing with xgboost and caret

阅读更多关于 Parallel processing with xgboost and caret

问题 I want to parallelize the model fitting process for xgboost while using caret. From what I have seen in xgboost's documentation, the nthread parameter controls the number of threads to use while fitting the models, in the sense of, building the trees in a parallel way. Caret's train function will perform parallelization in the sense of, for example, running a process for each iteration in a k-fold CV. Is this understanding correct, if yes, is it better to: Register the number of cores (for