random-forest

Plot trees for a Random Forest in Python with Scikit-Learn

爱⌒轻易说出口 提交于 2019-11-30 12:03:11
问题 I want to plot a decision tree of a random forest. So, i create the following code: clf = RandomForestClassifier(n_estimators=100) import pydotplus import six from sklearn import tree dotfile = six.StringIO() i_tree = 0 for tree_in_forest in clf.estimators_: if (i_tree <1): tree.export_graphviz(tree_in_forest, out_file=dotfile) pydotplus.graph_from_dot_data(dotfile.getvalue()).write_png('dtree'+ str(i_tree) +'.png') i_tree = i_tree + 1 But it doesn't generate anything.. Have you an idea how

R is there a way to find Inf/-Inf values?

拜拜、爱过 提交于 2019-11-30 11:50:00
I'm trying to run a randomForest on a large-ish data set (5000x300). Unfortunately I'm getting an error message as follows: > RF <- randomForest(prePrior1, postPrior1[,6] + ,,do.trace=TRUE,importance=TRUE,ntree=100,,forest=TRUE) Error in randomForest.default(prePrior1, postPrior1[, 6], , do.trace = TRUE, : NA/NaN/Inf in foreign function call (arg 1) So I try to find any NA's using : > df2 <- prePrior1[is.na(prePrior1)] > df2 character(0) > df2 <- postPrior1[is.na(postPrior1[,6])] > df2 numeric(0) which leads me to believe that it's Inf's that are the problem as there don't seem to be any NA's.

Combining random forests built with different training sets in R

跟風遠走 提交于 2019-11-30 11:37:36
问题 I am new to R (day 2) and have been tasked with building a forest of random forests. Each individual random forest will be built using a different training set and we will combine all the forests at the end to make predictions. I am implementing this in R and am having some difficulty combining two forests not built using the same set. My attempt is as follows: d1 = read.csv("../data/rr/train/10/chunk0.csv",header=TRUE) d2 = read.csv("../data/rr/train/10/chunk1.csv",header=TRUE) rf1 =

Python RandomForest - Unknown label Error

北慕城南 提交于 2019-11-30 11:19:37
I have trouble using RandomForest fit function This is my training set P1 Tp1 IrrPOA Gz Drz2 0 0.0 7.7 0.0 -1.4 -0.3 1 0.0 7.7 0.0 -1.4 -0.3 2 ... ... ... ... ... 3 49.4 7.5 0.0 -1.4 -0.3 4 47.4 7.5 0.0 -1.4 -0.3 ... (10k rows) I want to predict P1 thanks to all the other variables using sklearn.ensemble RandomForest colsRes = ['P1'] X_train = train.drop(colsRes, axis = 1) Y_train = pd.DataFrame(train[colsRes]) rf = RandomForestClassifier(n_estimators=100) rf.fit(X_train, Y_train) Here is the error I get: ValueError: Unknown label type: array([[ 0. ], [ 0. ], [ 0. ], ..., [ 49.4], [ 47.4], I

Using the predict_proba() function of RandomForestClassifier in the safe and right way

落爺英雄遲暮 提交于 2019-11-30 06:21:13
I'm using Scikit-learn to apply machine learning algorithm on my datasets. Sometimes I need to have the probabilities of labels/classes instated of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam. For such purpose, I'm using predict_proba() with RandomForestClassifier as following: clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0) scores = cross_val_score(clf, X, y) print(scores.mean()) classifier = clf.fit(X,y) predictions = classifier

R Random Forests Variable Importance

谁都会走 提交于 2019-11-30 06:09:28
问题 I am trying to use the random forests package for classification in R. The Variable Importance Measures listed are: mean raw importance score of variable x for class 0 mean raw importance score of variable x for class 1 MeanDecreaseAccuracy MeanDecreaseGini Now I know what these "mean" as in I know their definitions. What I want to know is how to use them. What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad

R: Plot trees from h2o.randomForest() and h2o.gbm()

最后都变了- 提交于 2019-11-30 05:05:49
Looking for an efficient way to plot trees in rstudio, H2O's Flow or in local html page from h2o's RF and GBM models similar to the one in the image in link below. Specifically, how do you plot trees for the objects, (fitted models) rf1 and gbm2 produced by code below perhaps by parsing h2o.download_pojo(rf1) or h2o.download_pojo(gbm1)? # # The following two commands remove any previously installed H2O packages for R. # if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) } # if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") } # # Next, we download

What does the parameter 'classwt' in RandomForest function in RandomForest package in R stand for?

我怕爱的太早我们不能终老 提交于 2019-11-30 04:46:37
The help page for randomforest::randomforest() says: "classwt - Priors of the classes. Need not add up to one. Ignored for regression." Could setting the classwt parameter help when you have heavy unbalanced data, ie. priors of classes differs strongly ? How should I set classwt when training a model on a dataset with 3 classes with a vector of priors equal to (p1,p2,p3), and in test set priors are (q1,q2,q3)? could setting classwt parameter help when you have heavy unbalanced data - priors of classes differs strongly? Yes, setting values of classwt could be useful for unbalanced datasets. And

Recursive feature elimination on Random Forest using scikit-learn

不羁的心 提交于 2019-11-30 03:55:03
I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process. However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_' Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem. Please note that I want to use a method that will explicitly tell me what features from my pandas DataFrame were selected in the

Proximity Matrix in sklearn.ensemble.RandomForestClassifier

▼魔方 西西 提交于 2019-11-30 03:46:32
I'm trying to perform clustering in Python using Random Forests. In the R implementation of Random Forests, there is a flag you can set to get the proximity matrix. I can't seem to find anything similar in the python scikit version of Random Forest. Does anyone know if there is an equivalent calculation for the python version? We don't implement proximity matrix in Scikit-Learn (yet). However, this could be done by relying on the apply function provided in our implementation of decision trees. That is, for all pairs of samples in your dataset, iterate over the decision trees in the forest