xgboost

Should I need to normalize (or scale) the data for Random forest (drf) or Gradient Boosting Machine (GBM) in H2O or in general? [closed]

蓝咒 提交于 2019-12-11 05:39:30
问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed last year . I am creating a classification and regression models using Random forest (DRF) and GBM in H2O.ai. I believe that I don't need to normalize (or scale) the data as it's un-neccessary rather more harmful as it might smooth out the nonlinear nature of the model. Could you please confirm if my

How can I tell if H2O 3.11.0.266 is running with GPUs?

半世苍凉 提交于 2019-12-11 05:37:59
问题 I've installed H2O 3.11.0.266 on a Ubuntu 16.04 with CUDA 8.0 and libcudnn.so.5.1.10 so I believe H2O should be able to find my GPUs. However, when I start up my h2o.init() in Python, I do not see evidence that it is actually using my GPUs. I see: H2O cluster total cores: 8 H2O cluster allowed cores: 8 which is the same as I had in the previous version (pre GPU). Also, http://127.0.0.1:54321/flow/index.html shows only 8 cores as well. I wonder if I don't have something properly installed or

How to get prediction p-values of an XGBClassifier?

痴心易碎 提交于 2019-12-11 05:37:36
问题 I'd like to know how confident an XGBClassifier is for each prediction it makes. Is it possible to have such a value? Or is the predict_proba already indirectly the confidence of the model? 回答1: Your intuition is indeed correct: predict_proba returns the probability of each example being of a given class; from the docs: predict_proba ( data, output_margin=False, ntree_limit=0 ) Predict the probability of each data example being of a given class. This probability in turn is routinely

feature_names mismach in xgboost despite having same columns

与世无争的帅哥 提交于 2019-12-11 04:26:54
问题 I have training (X) and test data (test_data_process) set with the same columns and order, as indicated below: But when I do predictions = my_model.predict(test_data_process) It gives the following error: ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34'] ['MSSubClass',

XGBRegressor: change random_state no effect

走远了吗. 提交于 2019-12-11 02:35:17
问题 the xgboost.XGBRegressor seems to produce the same results despite the fact a new random seed is given. According to the xgboost documentation xgboost.XGBRegressor: seed : int Random number seed. (Deprecated, please use random_state) random_state : int Random number seed. (replaces seed) random_state is the one to be used, however, no matter what random_state or seed I use, the model produce the same results. A Bug? from xgboost import XGBRegressor from sklearn.datasets import load_boston

peculiar installation warning causing packages to malfunction

寵の児 提交于 2019-12-10 18:27:57
问题 I want to install the package xgboost in R as per the instructions: install.packages("drat", repos="https://cran.rstudio.com") drat:::addRepo("dmlc") install.packages("xgboost", repos="http://dmlc.ml/drat/", type = "source") The installation of the first two packages seems to work fine: install.packages("drat", repos="https://cran.rstudio.com") % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0

XGBoost-4j by DMLC on Spark-1.6.1

▼魔方 西西 提交于 2019-12-10 15:56:32
问题 I am trying to use the XGBoost implementation by DMLC on Spark-1.6.1. I am able to train my data with XGBoost but facing difficulties in prediction. I actually want to do prediction in the way it can be done in Apache Spark mllib libraries, that helps with calculation of training error,precision, recall, specificity etc. I am posting the codes below, also the error I am getting. I used this xgboost4j-spark-0.5-jar-with-dependencies.jar in spark-shell to start. import org.apache.spark.mllib

Does oversampling happen before or after cross-validation using imblearn pipelines?

荒凉一梦 提交于 2019-12-10 15:42:26
问题 I have split my data into train/test before doing cross-validation on the training data to validate my hyperparameters. I have an unbalanced dataset and want to perform SMOTE oversampling on each iteration, so I have established a pipeline using imblearn . My understanding is that oversampling should be done after dividing the data into k-folds to prevent information leaking. Is this order of operations (data split into k-folds, k-1 folds oversampled, predict on remaining fold) preserved when

inputs for nDCG in sklearn

本秂侑毒 提交于 2019-12-10 15:01:30
问题 I'm unable to understand the input format of sklearn nDcg: http://sklearn.apachecn.org/en/0.19.0/modules/generated/sklearn.metrics.ndcg_score.html Currently I have the following problem: I have multiple queries for each of which the ranking probabilities have been calculated successfully. But now the problem is calculating nDCG for the test set for which I would like to use the sklearn nDcg. The example given on the link >>> y_true = [1, 0, 2] >>> y_score = [[0.15, 0.55, 0.2], [0.7, 0.2, 0.1]

gcc via homebrew has no --without-multilib option

断了今生、忘了曾经 提交于 2019-12-10 14:21:21
问题 I want to install xgboost in Python 3.5. This requires gcc to support -fopenmp option. Default gcc does not support it. So I am using brew install gcc --without-multilib But I get Warning: gcc: this formula has no '--without-multilib' option so it will be ignored! Any ideas? 回答1: The option no longer exists, since 8/2017. Many older 3rd-party xgboost instructions are outdated. Just do brew install gcc without options and be amazed that everything still works. 来源: https://stackoverflow.com