random-forest | 易学教程

Recursive feature elimination on Random Forest using scikit-learn

阅读更多关于 Recursive feature elimination on Random Forest using scikit-learn

问题 I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process. However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_' Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem. Please note that I want to

PySpark & MLLib: Class Probabilities of Random Forest Predictions

阅读更多关于 PySpark & MLLib: Class Probabilities of Random Forest Predictions

问题 I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel . How can I extract class probabilities from a RandomForestModel classifier in PySpark? Here's the sample code provided in the documentation that only provides the final class (not the probability): from pyspark.mllib.tree import RandomForest from pyspark.mllib.util import MLUtils #

Combining random forest models in scikit learn

阅读更多关于 Combining random forest models in scikit learn

I have two RandomForestClassifier models, and I would like to combine them into one meta model. They were both trained using similar, but different, data. How can I do this? rf1 #this is my first fitted RandomForestClassifier object, with 250 trees rf2 #this is my second fitted RandomForestClassifier object, also with 250 trees I want to create big_rf with all trees combined into one 500 tree model I believe this is possible by modifying the estimators_ and n_estimators attributes on the RandomForestClassifier object. Each tree in the forest is stored as a DecisionTreeClassifier object, and

PySpark & MLLib: Random Forest Feature Importances

阅读更多关于 PySpark & MLLib: Random Forest Feature Importances

I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel. How can I extract feature importances from a RandomForestModel regressor or classifier in PySpark? Here's the sample code provided in the documentation to get us started; however, there is no mention of feature importances in it. from pyspark.mllib.tree import RandomForest from pyspark.mllib.util import MLUtils # Load and parse the data file into an RDD of LabeledPoint. data =

How to tune parameters in Random Forest, using Scikit Learn?

阅读更多关于 How to tune parameters in Random Forest, using Scikit Learn?

问题 class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None) I'm using a random forest model with 9 samples and about 7000 attributes. Of these samples, there are 3 categories that my classifier recognizes. I know this is far from ideal

Numpy Array Get row index searching by a row

阅读更多关于 Numpy Array Get row index searching by a row

I am new to numpy and I am implementing clustering with random forest in python. My question is: How could I find the index of the exact row in an array? For example [[ 0. 5. 2.] [ 0. 0. 3.] [ 0. 0. 0.]] and I look for [0. 0. 3.] and get as result 1(the index of the second row). Any suggestion? Follows the code (not working...) for index, element in enumerate(leaf_node.x): for index_second_element, element_two in enumerate(leaf_node.x): if (index <= index_second_element): index_row = np.where(X == element) index_column = np.where(X == element_two) self.similarity_matrix[index_row][index_column

How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

阅读更多关于 How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

I'm running GridSearch CV to optimize the parameters of a classifier in scikit. Once I'm done, I'd like to know which parameters were chosen as the best. Whenever I do so I get a AttributeError: 'RandomForestClassifier' object has no attribute 'best_estimator_' , and can't tell why, as it seems to be a legitimate attribute on the documentation . from sklearn.grid_search import GridSearchCV X = data[usable_columns] y = data[target] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50

RandomForestClassfier.fit(): ValueError: could not convert string to float

阅读更多关于 RandomForestClassfier.fit(): ValueError: could not convert string to float

问题 Given is a simple CSV file: A,B,C Hello,Hi,0 Hola,Bueno,1 Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so: cols = ['A','B','C'] col_types = {'A': str, 'B': str, 'C': int} test = pd.read_csv('test.csv', dtype=col_types) train_y = test['C'] == 1 train_x = test[cols] clf_rf = RandomForestClassifier(n_estimators=50) clf_rf.fit(train_x, train_y) But I just get this traceback when

Unbalanced classification using RandomForestClassifier in sklearn

阅读更多关于 Unbalanced classification using RandomForestClassifier in sklearn

I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance alko You can pass sample weights argument to Random Forest fit method sample_weight : array-like, shape = [n_samples] or None Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or

Random Forest Feature Importance Chart using Python

阅读更多关于 Random Forest Feature Importance Chart using Python

问题 I am working with RandomForestRegressor in python and I want to create a chart that will illustrate the ranking of feature importance. This is the code I used: from sklearn.ensemble import RandomForestRegressor MT= pd.read_csv("MT_reduced.csv") df = MT.reset_index(drop = False) columns2 = df.columns.tolist() # Filter the columns to remove ones we don't want. columns2 = [c for c in columns2 if c not in["Violent_crime_rate","Change_Property_crime_rate","State","Year"]] # Store the variable we