random-forest | 易学教程

Plot trees for a Random Forest in Python with Scikit-Learn

阅读更多关于 Plot trees for a Random Forest in Python with Scikit-Learn

问题 I want to plot a decision tree of a random forest. So, i create the following code: clf = RandomForestClassifier(n_estimators=100) import pydotplus import six from sklearn import tree dotfile = six.StringIO() i_tree = 0 for tree_in_forest in clf.estimators_: if (i_tree <1): tree.export_graphviz(tree_in_forest, out_file=dotfile) pydotplus.graph_from_dot_data(dotfile.getvalue()).write_png('dtree'+ str(i_tree) +'.png') i_tree = i_tree + 1 But it doesn't generate anything.. Have you an idea how

R is there a way to find Inf/-Inf values?

阅读更多关于 R is there a way to find Inf/-Inf values?

I'm trying to run a randomForest on a large-ish data set (5000x300). Unfortunately I'm getting an error message as follows: > RF <- randomForest(prePrior1, postPrior1[,6] + ,,do.trace=TRUE,importance=TRUE,ntree=100,,forest=TRUE) Error in randomForest.default(prePrior1, postPrior1[, 6], , do.trace = TRUE, : NA/NaN/Inf in foreign function call (arg 1) So I try to find any NA's using : > df2 <- prePrior1[is.na(prePrior1)] > df2 character(0) > df2 <- postPrior1[is.na(postPrior1[,6])] > df2 numeric(0) which leads me to believe that it's Inf's that are the problem as there don't seem to be any NA's.

Combining random forests built with different training sets in R

阅读更多关于 Combining random forests built with different training sets in R

问题 I am new to R (day 2) and have been tasked with building a forest of random forests. Each individual random forest will be built using a different training set and we will combine all the forests at the end to make predictions. I am implementing this in R and am having some difficulty combining two forests not built using the same set. My attempt is as follows: d1 = read.csv("../data/rr/train/10/chunk0.csv",header=TRUE) d2 = read.csv("../data/rr/train/10/chunk1.csv",header=TRUE) rf1 =

Python RandomForest - Unknown label Error

阅读更多关于 Python RandomForest - Unknown label Error

I have trouble using RandomForest fit function This is my training set P1 Tp1 IrrPOA Gz Drz2 0 0.0 7.7 0.0 -1.4 -0.3 1 0.0 7.7 0.0 -1.4 -0.3 2 ... ... ... ... ... 3 49.4 7.5 0.0 -1.4 -0.3 4 47.4 7.5 0.0 -1.4 -0.3 ... (10k rows) I want to predict P1 thanks to all the other variables using sklearn.ensemble RandomForest colsRes = ['P1'] X_train = train.drop(colsRes, axis = 1) Y_train = pd.DataFrame(train[colsRes]) rf = RandomForestClassifier(n_estimators=100) rf.fit(X_train, Y_train) Here is the error I get: ValueError: Unknown label type: array([[ 0. ], [ 0. ], [ 0. ], ..., [ 49.4], [ 47.4], I

Using the predict_proba() function of RandomForestClassifier in the safe and right way

阅读更多关于 Using the predict_proba() function of RandomForestClassifier in the safe and right way

I'm using Scikit-learn to apply machine learning algorithm on my datasets. Sometimes I need to have the probabilities of labels/classes instated of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam. For such purpose, I'm using predict_proba() with RandomForestClassifier as following: clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0) scores = cross_val_score(clf, X, y) print(scores.mean()) classifier = clf.fit(X,y) predictions = classifier

R Random Forests Variable Importance

阅读更多关于 R Random Forests Variable Importance

问题 I am trying to use the random forests package for classification in R. The Variable Importance Measures listed are: mean raw importance score of variable x for class 0 mean raw importance score of variable x for class 1 MeanDecreaseAccuracy MeanDecreaseGini Now I know what these "mean" as in I know their definitions. What I want to know is how to use them. What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad

R: Plot trees from h2o.randomForest() and h2o.gbm()

阅读更多关于 R: Plot trees from h2o.randomForest() and h2o.gbm()

Looking for an efficient way to plot trees in rstudio, H2O's Flow or in local html page from h2o's RF and GBM models similar to the one in the image in link below. Specifically, how do you plot trees for the objects, (fitted models) rf1 and gbm2 produced by code below perhaps by parsing h2o.download_pojo(rf1) or h2o.download_pojo(gbm1)? # # The following two commands remove any previously installed H2O packages for R. # if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) } # if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") } # # Next, we download

What does the parameter 'classwt' in RandomForest function in RandomForest package in R stand for?

阅读更多关于 What does the parameter 'classwt' in RandomForest function in RandomForest package in R stand for?

The help page for randomforest::randomforest() says: "classwt - Priors of the classes. Need not add up to one. Ignored for regression." Could setting the classwt parameter help when you have heavy unbalanced data, ie. priors of classes differs strongly ? How should I set classwt when training a model on a dataset with 3 classes with a vector of priors equal to (p1,p2,p3), and in test set priors are (q1,q2,q3)? could setting classwt parameter help when you have heavy unbalanced data - priors of classes differs strongly? Yes, setting values of classwt could be useful for unbalanced datasets. And

Recursive feature elimination on Random Forest using scikit-learn

阅读更多关于 Recursive feature elimination on Random Forest using scikit-learn

I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process. However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_' Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem. Please note that I want to use a method that will explicitly tell me what features from my pandas DataFrame were selected in the

Proximity Matrix in sklearn.ensemble.RandomForestClassifier

阅读更多关于 Proximity Matrix in sklearn.ensemble.RandomForestClassifier

I'm trying to perform clustering in Python using Random Forests. In the R implementation of Random Forests, there is a flag you can set to get the proximity matrix. I can't seem to find anything similar in the python scikit version of Random Forest. Does anyone know if there is an equivalent calculation for the python version? We don't implement proximity matrix in Scikit-Learn (yet). However, this could be done by relying on the apply function provided in our implementation of decision trees. That is, for all pairs of samples in your dataset, iterate over the decision trees in the forest