random-forest

How to extract feature importances from an Sklearn pipeline

喜你入骨 提交于 2019-12-04 20:40:10
问题 I've built a pipeline in Scikit-Learn with two steps: one to construct features, and the second is a RandomForestClassifier. While I can save that pipeline, look at various steps and the various parameters set in the steps, I'd like to be able to examine the feature importances from the resulting model. Is that possible? 回答1: Ah, yes it is. You list identify the step where you want to check the estimator: For instance: pipeline.steps[1] Which returns: ('predictor', RandomForestClassifier

ROC for random forest

随声附和 提交于 2019-12-04 19:35:48
I understand that ROC is drawn between tpr and fpr , but I am having difficulty in determining which parameters I should vary to get different tpr / fpr pairs. Soren Havelund Welling I wrote this answer on a similar question. Basicly you can increase weighting on certain classes and/or downsample other classes and/or change vote aggregating rule. [[EDITED 13.15PM CEST 1st July 2015]] @ "the two classes are very balanced – Suryavansh" In such case your data is balanced you should mainly go with option 3 (changing aggregation rule). In randomForest this can be accessed with cutoff parameter

muti output regression in xgboost

回眸只為那壹抹淺笑 提交于 2019-12-04 17:56:33
问题 Is it possible to train a model in Xgboost that have multiple continuous outputs (multi regression)? What would be the objective to train such a model? Thanks in advance for any suggestions 回答1: My suggestion is to use sklearn.multioutput.MultiOutputRegressor as a wrapper of xgb.XGBRegressor . MultiOutputRegressor trains one regressor per target and only requires that the regressor implements fit and predict , which xgboost happens to support. # get some noised linear data X = np.random

Stratified sampling doesn't seem to change randomForest results

让人想犯罪 __ 提交于 2019-12-04 15:09:23
问题 I am using the randomForest package in R to build several species distribution models. My response variable is binary (0 - absence or 1-presence), and pretty unbalanced - for some species the ratio of absences:presences is 37:1. This imbalance (or zero-inflation) leads to questionable out-of-bag error estimates - the larger the ratio of absences to presence, the lower my out-of-bag (OOB) error estimate. To compensate for this imbalance, I wanted to implement stratified sampling such that each

My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

江枫思渺然 提交于 2019-12-04 13:32:29
问题 For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below: https://www.kaggle.com/ludobenistant/hr-analytics import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder,OneHotEncoder dataset = pd.read_csv("HR_comma_sep.csv") x = dataset.iloc[:,

How to use RandomForest in Spark Pipeline

对着背影说爱祢 提交于 2019-12-04 13:16:26
问题 I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model, which can be new as an object. However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice? Thanks 回答1: However, the RandomForest model cannot be new by

Issues with tuneGrid parameter in random forest

喜你入骨 提交于 2019-12-04 12:28:42
问题 I've been dealing with some extremely imbalanced data and I would like to use stratified sampling to created more balanced random forests Right now, I'm using the caret package, mainly to for tuning the random forests. So I try to setup a tuneGrid to pass in the mtry and sampsize parameters into caret train method as follows. mtryGrid <- data.frame(.mtry = 100),.sampsize=80) rfTune<- train(x = trainX, y = trainY, method = "rf", trControl = ctrl, metric = "Kappa", ntree = 1000, tuneGrid =

ROC curve for classification from randomForest

痞子三分冷 提交于 2019-12-04 11:51:47
问题 I am using randomForest package in R platform for classification task. rf_object<-randomForest(data_matrix, label_factor, cutoff=c(k,1-k)) where k ranges from 0.1 to 0.9. pred <- predict(rf_object,test_data_matrix) I have the output from the random forest classifier and I compared it with the labels. So, I have the performance measures like accuracy, MCC, sensitivity, specificity, etc for 9 cutoff points. Now, I want to plot the ROC curve and obtain the area under the ROC curve to see how

matplotlib: Plot Feature Importance with feature names

夙愿已清 提交于 2019-12-04 11:04:25
问题 In R there are pre-built functions to plot feature importance of Random Forest model. But in python such method seems to be missing. I search for a method in matplotlib . model.feature_importances gives me following: array([ 2.32421835e-03, 7.21472336e-04, 2.70491223e-03, 3.34521084e-03, 4.19443238e-03, 1.50108737e-03, 3.29160540e-03, 4.82320256e-01, 3.14117333e-03]) Then using following plotting function: >> pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) >>

How to handle categorical features for Decision Tree, Random Forest in spark ml?

流过昼夜 提交于 2019-12-04 10:38:34
I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing . There are many categorical features (having string values) in the data set. In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features to vector feature accepts only numeric type ). Using this approach, each of the level of a