xgboost | 易学教程

How to install Python XGBoost package in virtualenv on mac

阅读更多关于 How to install Python XGBoost package in virtualenv on mac

问题 I have been trying to install Python 2.7 XGBoost on my mac. I am running a framework build of python via brew and trying to install into a virtualenv. I have tried the following methods: Manual build found here: https://github.com/dmlc/xgboost/blob/master/doc/build.md#python-package-installation This results in this error: error: Error: setup script specifies an absolute path: /Users/username/git/xgboost/python-package/xgboost/../../lib/libxgboost.so setup() arguments must *always* be /

How to get the params from a saved XGBoost model

阅读更多关于 How to get the params from a saved XGBoost model

问题 I'm trying to train a XGBoost model using the params below: xgb_params = { 'objective': 'binary:logistic', 'eval_metric': 'auc', 'lambda': 0.8, 'alpha': 0.4, 'max_depth': 10, 'max_delta_step': 1, 'verbose': True } Since my input data is too big to be fully loaded into the memory, I adapt the incremental training: xgb_clf = xgb.train(xgb_params, input_data, num_boost_round=rounds_per_batch, xgb_model=model_path) The code for prediction is xgb_clf = xgb.XGBClassifier() booster = xgb.Booster()

Create RMSLE metric in caret in r

阅读更多关于 Create RMSLE metric in caret in r

问题 Could someone please help me with the following: I need to change my xgboost training model with caret package to an undefault metric RMSLE. By default caret and xgboost train and measure in RMSE. Here are the lines of code: create custom summary function in caret format custom_summary = function(data, lev = NULL, model = NULL){ out = rmsle(data[, "obs"], data[, "pred"]) names(out) = c("rmsle") out } create control object control = trainControl(method = "cv", number = 2, summaryFunction =

multiclass classification in xgboost (python)

阅读更多关于 multiclass classification in xgboost (python)

问题 I can't figure out how to pass number of classes or eval metric to xgb.XGBClassifier with the objective function 'multi:softmax'. I looked at many documentations but the only talk about the sklearn wrapper which accepts n_class/num_class. My current setup looks like kf = cross_validation.KFold(y_data.shape[0], \ n_folds=10, shuffle=True, random_state=30) err = [] # to hold cross val errors # xgb instance xgb_model = xgb.XGBClassifier(n_estimators=_params['n_estimators'], \ max_depth=params[

spark 运行 xgboost 脱坑记

阅读更多关于 spark 运行 xgboost 脱坑记

坑： Spark Xgboost 对 spark的dataframe 的空值非常敏感，如果dataframe里有空值（null ， “NaN”），xgboost就会报错。 Spark2.4.4 的 Vector Assemble转换dataframe以后，对于0很多的行，会默认转成sparse vector，造成xgboost报错示例代码： val schema = new StructType(Array( StructField("BIZ_DATE", StringType, true), StructField("SKU", StringType, true), StructField("WINDGUST", DoubleType, true), StructField("WINDSPEED", DoubleType, true))) val predictDF = spark.read.schema(schema) .format("csv") .option("header", "true") .option("delimiter", ",") .load("/mnt/parquet/smaller.csv") import scala.collection.mutable.ArrayBuffer val featureColsBuffer=ArrayBuffer

【三人行必有我师】同学提分经验分享大全，进步原来如此简单！

阅读更多关于【三人行必有我师】同学提分经验分享大全，进步原来如此简单！

【三人行必有我师】同学提分经验分享大全，进步原来如此简单！上周说的搞大事终于来了！参赛心得体会分享投票活动已上线很多进步巨大的同学都纷纷献出了自己的提分妙招小编从留言中精选出了以下10条分享参与投票干货满满诚意满满学习完毕记得给分享的同学投票哦~ 1 首先感谢腾讯举办这次腾讯社交广告高校算法大赛，让我们能有机会与国内各个高校的机器学习和数据挖掘方向的爱好者进行交流与学习。在这里我就简单描述一下我参与比赛这十几天的经历和收获，与大家交流一下。 5月初的时候通过各方面途径了解到腾讯要举办一个算法大赛，也知道是5月10号比赛开始。但由于5月初的时候在忙老师的小论文，而且由于粗心大意也没注意到比赛的数据早就开放下载，所以直到5月9号才获取到数据。获取数据当天就对数据进行了初步的了解。由于本届比赛的奖品丰厚（前300名提交有纪念衫），所以5月10号就提交了一个全0的结果试试水，同时借此机会验证一下测试集的正负样本比例。但是显然这样的结果是不好的，排名是251，这也为后面我能入围最快进步队员提供了先决条件。由于之前参加过一个kaggle的比赛，当时xgboost取得了不错的效果，所以分类器我决定暂时选用xgboost。之后的几天里，我主要集中精力在对数据的分析上。下面列举一下我在这几天的数据分析中总结出来的几点： 1.本次竞赛的数据比较复杂，需要注意各个ID之间的关系

Grid Search and Early Stopping Using Cross Validation with XGBoost in SciKit-Learn

阅读更多关于 Grid Search and Early Stopping Using Cross Validation with XGBoost in SciKit-Learn

问题 I am fairly new to sci-kit learn and have been trying to hyper-paramater tune XGBoost. My aim is to use early stopping and grid search to tune the model parameters and use early stopping to control the number of trees and avoid overfitting. As I am using cross validation for the grid search, I was hoping to also use cross-validation in the early stopping criteria. The code I have so far looks like this: import numpy as np import pandas as pd from sklearn import model_selection import xgboost

How can I implement incremental training for xgboost?

阅读更多关于 How can I implement incremental training for xgboost?

问题 The problem is that my train data could not be placed into RAM due to train data size. So I need a method which first builds one tree on whole train data set, calculate residuals build another tree and so on (like gradient boosted tree do). Obviously if I call model = xgb.train(param, batch_dtrain, 2) in some loop - it will not help, because in such case it just rebuilds whole model for each batch. 回答1: Disclaimer: I'm new to xgboost as well, but I think I figured this out. Try saving your

How can I implement incremental training for xgboost?

阅读更多关于 How can I implement incremental training for xgboost?

xgboost回归代码及lgb参数说明

阅读更多关于 xgboost回归代码及lgb参数说明

了解xgboost 找到网络一个图侵删感谢原作者提供图 https://pic3.zhimg.com/v2-07783eb41e619927e8911b85442b9e38_r.jpg xgboost训练回归模型很简单，按照前面的博客安装了xgboost库之后： xgboost的参数说明如下代码： params={ 'booster':'gbtree', 'objective': 'multi:softmax', #多分类的问题 'num_class':10, # 类别数，与 multisoftmax 并用 'gamma':0.1, # 用于控制是否后剪枝的参数,越大越保守，一般0.1、0.2这样子。 'max_depth':12, # 构建树的深度，越大越容易过拟合 'lambda':2, # 控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。 'subsample':0.7, # 随机采样训练样本 'colsample_bytree':0.7, # 生成树时进行的列采样 'min_child_weight':3, # 这个参数默认是 1，是每个叶子里面 h 的和至少是多少，对正负样本不均衡时的 0-1 分类而言 #，假设 h 在 0.01 附近，min_child_weight 为 1 意味着叶子节点中最少需要包含 100 个样本。