random-forest

What does negative %IncMSE in RandomForest package mean?

无人久伴 提交于 2019-12-04 08:36:36
问题 I used RandomForest for a regression problem. I used importance(rf,type=1) to get the %IncMSE for the variables and one of them has a negative %IncMSE. Does this mean that this variable is bad for the model? I searched the Internet to get some answers but I didn't find a clear one. I also found something strange in the model's summary ( attached below), It seems that only one tree was used although I defined ntrees as 800. model: rf<-randomForest(var1~va2+var3+..+var35,data=d7depo,ntree=800

Extract and Visualize Model Trees from Sparklyr

邮差的信 提交于 2019-12-04 07:44:29
Does anyone have any advice about how to convert the tree information from sparklyr's ml_decision_tree_classifier, ml_gbt_classifier, or ml_random_forest_classifier models into a.) a format that can be understood by other R tree-related libraries and (ultimately) b.) a visualization of the trees for non-technical consumption? This would include the ability to convert back to the actual feature names from the substituted string indexing values that are produced during the vector assembler. The following code is copied liberally from a sparklyr blog post for the purposes of providing an example:

Why connection is terminating

自古美人都是妖i 提交于 2019-12-04 06:06:17
I'm trying a random forest classification model by using H2O library inside R on a training set having 70 million rows and 25 numeric features.The total file size is 5.6 GB. The validation file's size is 1 GB. I have 16 GB RAM and 8 core CPU on my system. The system successfully able to read both of the files in H2O object. Then I'm giving below command to build the model: model <- h2o.randomForest(x = c(1:18,20:25), y = 19, training_frame = traindata, validation_frame = testdata, ntrees = 150, mtries = 6) But after few minutes (without generating any tree), I'm getting following error: "Error

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

我们两清 提交于 2019-12-03 15:03:38
In the documentation of SciKit-Learn Random Forest classifier , it is stated that The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training. Am I missing something here? Lol4t0 I believe this part of docs answers your question In random forests (see RandomForestClassifier

Incremental training of random forest model using python sklearn

家住魔仙堡 提交于 2019-12-03 14:45:33
I am using the below code to save a random forest model. I am using cPickle to save the trained model. As I see new data, can I train the model incrementally. Currently, the train set has about 2 years data. Is there a way to train on another 2 years and (kind of) append it to the existing saved model. rf = RandomForestRegressor(n_estimators=100) print ("Trying to fit the Random Forest model --> ") if os.path.exists('rf.pkl'): print ("Trained model already pickled -- >") with open('rf.pkl', 'rb') as f: rf = cPickle.load(f) else: df_x_train = x_train[col_feature] rf.fit(df_x_train,y_train)

how to use classwt in randomForest of R?

坚强是说给别人听的谎言 提交于 2019-12-03 14:43:01
问题 I have a highly imbalanced data set with target class instances in the following ratio 60000:1000:1000:50 (i.e. a total of 4 classes). I want to use randomForest for making predictions of the target class. So, to reduce the class imbalance, I played with sampsize parameter, setting it to c(5000, 1000, 1000, 50) and some other values, but there was not much use of it. Actually, the accuracy of the 1st class decreased while I played with sampsize , though the improvement in other class

How does sklearn random forest index feature_importances_

微笑、不失礼 提交于 2019-12-03 13:48:18
I have used the RandomForestClassifier in sklearn for determining the important features in my dataset. How am I able to return the actual feature names (my variables are labeled x1, x2, x3, etc.) rather than their relative name (it tells me the important features are '12', '22', etc.). Below is the code that I am currently using to return the important features. important_features = [] for x,i in enumerate(rf.feature_importances_): if i>np.average(rf.feature_importances_): important_features.append(str(x)) print important_features Additionally, in an effort to understand the indexing, I was

R: using ranger with caret, tuneGrid argument

℡╲_俬逩灬. 提交于 2019-12-03 13:33:36
问题 I'm using the caret package to analyse Random Forest models built using ranger. I can't figure out how to call the train function using the tuneGrid argument to tune the model parameters. I think I'm calling the tuneGrid argument wrong, but can't figure out why it's wrong. Any help would be appreciated. data(iris) library(ranger) model_ranger <- ranger(Species ~ ., data = iris, num.trees = 500, mtry = 4, importance = 'impurity') library(caret) # my tuneGrid object: tgrid <- expand.grid( num

R: unclear behaviour of tuneRF function (randomForest package)

大兔子大兔子 提交于 2019-12-03 13:06:09
问题 I feel uncomfortable with the meaning of the stepFactor parameter of the tuneRF function which is used for tuning the mtry parameter used further in the randomForest function. The documentation of tuneRF says that stepFactor is a magnitude by which the chosen mtry gets deflated or inflated. Obviously, since mtry is a number of variables chosen randomly, it has to be an integer, however I saw many examples on the net using stepFactor=1.5 . At first I thought that R uses by default next mtry

How to extract feature importances from an Sklearn pipeline

為{幸葍}努か 提交于 2019-12-03 12:59:24
I've built a pipeline in Scikit-Learn with two steps: one to construct features, and the second is a RandomForestClassifier. While I can save that pipeline, look at various steps and the various parameters set in the steps, I'd like to be able to examine the feature importances from the resulting model. Is that possible? Ah, yes it is. You list identify the step where you want to check the estimator: For instance: pipeline.steps[1] Which returns: ('predictor', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None,