random-forest

How are feature_importances in RandomForestClassifier determined?

人走茶凉 提交于 2019-11-27 02:30:59
I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_ , which works well for me. However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on this topic. There are indeed several ways to get feature "importances". As often, there is no strict

r random forest error - type of predictors in new data do not match

筅森魡賤 提交于 2019-11-27 00:14:56
问题 I am trying to use quantile regression forest function in R (quantregForest) which is built on Random Forest package. I am getting a type mismatch error that I can't quite figure why. I train the model by using qrf <- quantregForest(x = xtrain, y = ytrain) which works without a problem, but when I try to test with new data like quant.newdata <- predict(qrf, newdata= xtest) it gives the following error: Error in predict.quantregForest(qrf, newdata = xtest) : Type of predictors in new data do

Predict classes or class probabilities?

99封情书 提交于 2019-11-26 22:25:28
问题 I am currently using H2O for a classification problem dataset. I am testing it out with H2ORandomForestEstimator in a python 3.6 environment. I noticed the results of the predict method was giving values between 0 to 1(I am assuming this is the probability). In my data set, the target attribute is numeric i.e. True values are 1 and False values are 0. I made sure I converted the type to category for the target attribute, I was still getting the same result. Then I modified to the code to

How can I use the row.names attribute to order the rows of my dataframe in R?

偶尔善良 提交于 2019-11-26 20:36:45
问题 I created a random forest and predicted the classes of my test set, which are living happily in a dataframe: row.names class 564028 1 275747 1 601137 0 922930 1 481988 1 ... The row.names attribute tells me which row is which, before I did various operations that scrambled the order of the rows during the process. So far so good. Now I would like get a general feel for the accuracy of my predictions. To do this, I need to take this dataframe and reorder it in ascending order according to the

Why is Random Forest with a single tree much better than a Decision Tree classifier?

眉间皱痕 提交于 2019-11-26 17:48:32
问题 I learn the machine learning with the scikit-learn library. I apply the decision tree classifier and the random forest classifier to my data with this code: def decision_tree(train_X, train_Y, test_X, test_Y): clf = tree.DecisionTreeClassifier() clf.fit(train_X, train_Y) return clf.score(test_X, test_Y) def random_forest(train_X, train_Y, test_X, test_Y): clf = RandomForestClassifier(n_estimators=1) clf = clf.fit(X, Y) return clf.score(test_X, test_Y) Why the result are so much better for the

Scikit learn - fit_transform on the test set

折月煮酒 提交于 2019-11-26 16:52:58
问题 I am struggling to use Random Forest in Python with Scikit learn. My problem is that I use it for text classification (in 3 classes - positive/negative/neutral) and the features that I extract are mainly words/unigrams, so I need to convert these to numerical features. I found a way to do it with DictVectorizer 's fit_transform : from sklearn.preprocessing import LabelEncoder from sklearn.metrics import classification_report from sklearn.feature_extraction import DictVectorizer vec =

What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

倖福魔咒の 提交于 2019-11-26 16:02:22
问题 I've read from this documentation that : "Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value." But, it is still unclear to me how this works. If I set sample_weight with an array of only two possible values, 1 's and 2 's, does this mean that the samples with 2 's will get sampled twice as often as the samples with 1 's when doing the bagging? I cannot

Suggestions for speeding up Random Forests

馋奶兔 提交于 2019-11-26 12:05:35
问题 I\'m doing some work with the randomForest package and while it works well, it can be time-consuming. Any one have any suggestions for speeding things up? I\'m using a Windows 7 box w/ a dual core AMD chip. I know about R not being multi- thread/processor, but was curious if any of the parallel packages ( rmpi , snow , snowfall , etc.) worked for randomForest stuff. Thanks. EDIT: I\'m using rF for some classification work (0\'s and 1\'s). The data has about 8-12 variable columns and the

How to extract the decision rules from scikit-learn decision-tree?

爷,独闯天下 提交于 2019-11-26 00:32:34
问题 Can I extract the underlying decision-rules (or \'decision paths\') from a trained tree in a decision tree as a textual list? Something like: if A>0.4 then if B<0.2 then if C>0.8 then class=\'X\' Thanks for your help. 回答1: I believe that this answer is more correct than the other answers here: from sklearn.tree import _tree def tree_to_code(tree, feature_names): tree_ = tree.tree_ feature_name = [ feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!" for i in tree_.feature ] print