random-forest | 易学教程

What does negative %IncMSE in RandomForest package mean?

阅读更多关于 What does negative %IncMSE in RandomForest package mean?

问题 I used RandomForest for a regression problem. I used importance(rf,type=1) to get the %IncMSE for the variables and one of them has a negative %IncMSE. Does this mean that this variable is bad for the model? I searched the Internet to get some answers but I didn't find a clear one. I also found something strange in the model's summary ( attached below), It seems that only one tree was used although I defined ntrees as 800. model: rf<-randomForest(var1~va2+var3+..+var35,data=d7depo,ntree=800

Extract and Visualize Model Trees from Sparklyr

阅读更多关于 Extract and Visualize Model Trees from Sparklyr

Does anyone have any advice about how to convert the tree information from sparklyr's ml_decision_tree_classifier, ml_gbt_classifier, or ml_random_forest_classifier models into a.) a format that can be understood by other R tree-related libraries and (ultimately) b.) a visualization of the trees for non-technical consumption? This would include the ability to convert back to the actual feature names from the substituted string indexing values that are produced during the vector assembler. The following code is copied liberally from a sparklyr blog post for the purposes of providing an example:

Why connection is terminating

阅读更多关于 Why connection is terminating

I'm trying a random forest classification model by using H2O library inside R on a training set having 70 million rows and 25 numeric features.The total file size is 5.6 GB. The validation file's size is 1 GB. I have 16 GB RAM and 8 core CPU on my system. The system successfully able to read both of the files in H2O object. Then I'm giving below command to build the model: model <- h2o.randomForest(x = c(1:18,20:25), y = 19, training_frame = traindata, validation_frame = testdata, ntrees = 150, mtries = 6) But after few minutes (without generating any tree), I'm getting following error: "Error

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

阅读更多关于 How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

In the documentation of SciKit-Learn Random Forest classifier , it is stated that The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training. Am I missing something here? Lol4t0 I believe this part of docs answers your question In random forests (see RandomForestClassifier

Incremental training of random forest model using python sklearn

阅读更多关于 Incremental training of random forest model using python sklearn

I am using the below code to save a random forest model. I am using cPickle to save the trained model. As I see new data, can I train the model incrementally. Currently, the train set has about 2 years data. Is there a way to train on another 2 years and (kind of) append it to the existing saved model. rf = RandomForestRegressor(n_estimators=100) print ("Trying to fit the Random Forest model --> ") if os.path.exists('rf.pkl'): print ("Trained model already pickled -- >") with open('rf.pkl', 'rb') as f: rf = cPickle.load(f) else: df_x_train = x_train[col_feature] rf.fit(df_x_train,y_train)

how to use classwt in randomForest of R?

阅读更多关于 how to use classwt in randomForest of R?

问题 I have a highly imbalanced data set with target class instances in the following ratio 60000:1000:1000:50 (i.e. a total of 4 classes). I want to use randomForest for making predictions of the target class. So, to reduce the class imbalance, I played with sampsize parameter, setting it to c(5000, 1000, 1000, 50) and some other values, but there was not much use of it. Actually, the accuracy of the 1st class decreased while I played with sampsize , though the improvement in other class

How does sklearn random forest index feature_importances_

阅读更多关于 How does sklearn random forest index feature_importances_

I have used the RandomForestClassifier in sklearn for determining the important features in my dataset. How am I able to return the actual feature names (my variables are labeled x1, x2, x3, etc.) rather than their relative name (it tells me the important features are '12', '22', etc.). Below is the code that I am currently using to return the important features. important_features = [] for x,i in enumerate(rf.feature_importances_): if i>np.average(rf.feature_importances_): important_features.append(str(x)) print important_features Additionally, in an effort to understand the indexing, I was

R: using ranger with caret, tuneGrid argument

阅读更多关于 R: using ranger with caret, tuneGrid argument

问题 I'm using the caret package to analyse Random Forest models built using ranger. I can't figure out how to call the train function using the tuneGrid argument to tune the model parameters. I think I'm calling the tuneGrid argument wrong, but can't figure out why it's wrong. Any help would be appreciated. data(iris) library(ranger) model_ranger <- ranger(Species ~ ., data = iris, num.trees = 500, mtry = 4, importance = 'impurity') library(caret) # my tuneGrid object: tgrid <- expand.grid( num

R: unclear behaviour of tuneRF function (randomForest package)

阅读更多关于 R: unclear behaviour of tuneRF function (randomForest package)

问题 I feel uncomfortable with the meaning of the stepFactor parameter of the tuneRF function which is used for tuning the mtry parameter used further in the randomForest function. The documentation of tuneRF says that stepFactor is a magnitude by which the chosen mtry gets deflated or inflated. Obviously, since mtry is a number of variables chosen randomly, it has to be an integer, however I saw many examples on the net using stepFactor=1.5 . At first I thought that R uses by default next mtry

How to extract feature importances from an Sklearn pipeline

阅读更多关于 How to extract feature importances from an Sklearn pipeline

I've built a pipeline in Scikit-Learn with two steps: one to construct features, and the second is a RandomForestClassifier. While I can save that pipeline, look at various steps and the various parameters set in the steps, I'd like to be able to examine the feature importances from the resulting model. Is that possible? Ah, yes it is. You list identify the step where you want to check the estimator: For instance: pipeline.steps[1] Which returns: ('predictor', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None,