random-forest | 易学教程

How are feature_importances in RandomForestClassifier determined?

阅读更多关于 How are feature_importances in RandomForestClassifier determined?

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_ , which works well for me. However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on this topic. There are indeed several ways to get feature "importances". As often, there is no strict

r random forest error - type of predictors in new data do not match

阅读更多关于 r random forest error - type of predictors in new data do not match

问题 I am trying to use quantile regression forest function in R (quantregForest) which is built on Random Forest package. I am getting a type mismatch error that I can't quite figure why. I train the model by using qrf <- quantregForest(x = xtrain, y = ytrain) which works without a problem, but when I try to test with new data like quant.newdata <- predict(qrf, newdata= xtest) it gives the following error: Error in predict.quantregForest(qrf, newdata = xtest) : Type of predictors in new data do

Predict classes or class probabilities?

阅读更多关于 Predict classes or class probabilities?

问题 I am currently using H2O for a classification problem dataset. I am testing it out with H2ORandomForestEstimator in a python 3.6 environment. I noticed the results of the predict method was giving values between 0 to 1(I am assuming this is the probability). In my data set, the target attribute is numeric i.e. True values are 1 and False values are 0. I made sure I converted the type to category for the target attribute, I was still getting the same result. Then I modified to the code to

How can I use the row.names attribute to order the rows of my dataframe in R?

阅读更多关于 How can I use the row.names attribute to order the rows of my dataframe in R?

问题 I created a random forest and predicted the classes of my test set, which are living happily in a dataframe: row.names class 564028 1 275747 1 601137 0 922930 1 481988 1 ... The row.names attribute tells me which row is which, before I did various operations that scrambled the order of the rows during the process. So far so good. Now I would like get a general feel for the accuracy of my predictions. To do this, I need to take this dataframe and reorder it in ascending order according to the

Why is Random Forest with a single tree much better than a Decision Tree classifier?

阅读更多关于 Why is Random Forest with a single tree much better than a Decision Tree classifier?

问题 I learn the machine learning with the scikit-learn library. I apply the decision tree classifier and the random forest classifier to my data with this code: def decision_tree(train_X, train_Y, test_X, test_Y): clf = tree.DecisionTreeClassifier() clf.fit(train_X, train_Y) return clf.score(test_X, test_Y) def random_forest(train_X, train_Y, test_X, test_Y): clf = RandomForestClassifier(n_estimators=1) clf = clf.fit(X, Y) return clf.score(test_X, test_Y) Why the result are so much better for the

Scikit learn - fit_transform on the test set

阅读更多关于 Scikit learn - fit_transform on the test set

问题 I am struggling to use Random Forest in Python with Scikit learn. My problem is that I use it for text classification (in 3 classes - positive/negative/neutral) and the features that I extract are mainly words/unigrams, so I need to convert these to numerical features. I found a way to do it with DictVectorizer 's fit_transform : from sklearn.preprocessing import LabelEncoder from sklearn.metrics import classification_report from sklearn.feature_extraction import DictVectorizer vec =

What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

阅读更多关于 What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

问题 I've read from this documentation that : "Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value." But, it is still unclear to me how this works. If I set sample_weight with an array of only two possible values, 1 's and 2 's, does this mean that the samples with 2 's will get sampled twice as often as the samples with 1 's when doing the bagging? I cannot

Suggestions for speeding up Random Forests

阅读更多关于 Suggestions for speeding up Random Forests

问题 I\'m doing some work with the randomForest package and while it works well, it can be time-consuming. Any one have any suggestions for speeding things up? I\'m using a Windows 7 box w/ a dual core AMD chip. I know about R not being multi- thread/processor, but was curious if any of the parallel packages ( rmpi , snow , snowfall , etc.) worked for randomForest stuff. Thanks. EDIT: I\'m using rF for some classification work (0\'s and 1\'s). The data has about 8-12 variable columns and the

How to extract the decision rules from scikit-learn decision-tree?

阅读更多关于 How to extract the decision rules from scikit-learn decision-tree?

问题 Can I extract the underlying decision-rules (or \'decision paths\') from a trained tree in a decision tree as a textual list? Something like: if A>0.4 then if B<0.2 then if C>0.8 then class=\'X\' Thanks for your help. 回答1: I believe that this answer is more correct than the other answers here: from sklearn.tree import _tree def tree_to_code(tree, feature_names): tree_ = tree.tree_ feature_name = [ feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!" for i in tree_.feature ] print