random-forest

Recursive feature elimination on Random Forest using scikit-learn

旧城冷巷雨未停 提交于 2019-11-29 01:03:26
问题 I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process. However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_' Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem. Please note that I want to

PySpark & MLLib: Class Probabilities of Random Forest Predictions

会有一股神秘感。 提交于 2019-11-29 01:00:37
问题 I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel . How can I extract class probabilities from a RandomForestModel classifier in PySpark? Here's the sample code provided in the documentation that only provides the final class (not the probability): from pyspark.mllib.tree import RandomForest from pyspark.mllib.util import MLUtils #

Combining random forest models in scikit learn

假如想象 提交于 2019-11-28 20:46:56
I have two RandomForestClassifier models, and I would like to combine them into one meta model. They were both trained using similar, but different, data. How can I do this? rf1 #this is my first fitted RandomForestClassifier object, with 250 trees rf2 #this is my second fitted RandomForestClassifier object, also with 250 trees I want to create big_rf with all trees combined into one 500 tree model I believe this is possible by modifying the estimators_ and n_estimators attributes on the RandomForestClassifier object. Each tree in the forest is stored as a DecisionTreeClassifier object, and

PySpark & MLLib: Random Forest Feature Importances

余生颓废 提交于 2019-11-28 20:45:20
I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel. How can I extract feature importances from a RandomForestModel regressor or classifier in PySpark? Here's the sample code provided in the documentation to get us started; however, there is no mention of feature importances in it. from pyspark.mllib.tree import RandomForest from pyspark.mllib.util import MLUtils # Load and parse the data file into an RDD of LabeledPoint. data =

How to tune parameters in Random Forest, using Scikit Learn?

无人久伴 提交于 2019-11-28 18:49:09
问题 class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None) I'm using a random forest model with 9 samples and about 7000 attributes. Of these samples, there are 3 categories that my classifier recognizes. I know this is far from ideal

Numpy Array Get row index searching by a row

微笑、不失礼 提交于 2019-11-28 18:23:06
I am new to numpy and I am implementing clustering with random forest in python. My question is: How could I find the index of the exact row in an array? For example [[ 0. 5. 2.] [ 0. 0. 3.] [ 0. 0. 0.]] and I look for [0. 0. 3.] and get as result 1(the index of the second row). Any suggestion? Follows the code (not working...) for index, element in enumerate(leaf_node.x): for index_second_element, element_two in enumerate(leaf_node.x): if (index <= index_second_element): index_row = np.where(X == element) index_column = np.where(X == element_two) self.similarity_matrix[index_row][index_column

How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

前提是你 提交于 2019-11-28 17:28:14
I'm running GridSearch CV to optimize the parameters of a classifier in scikit. Once I'm done, I'd like to know which parameters were chosen as the best. Whenever I do so I get a AttributeError: 'RandomForestClassifier' object has no attribute 'best_estimator_' , and can't tell why, as it seems to be a legitimate attribute on the documentation . from sklearn.grid_search import GridSearchCV X = data[usable_columns] y = data[target] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50

RandomForestClassfier.fit(): ValueError: could not convert string to float

两盒软妹~` 提交于 2019-11-28 16:37:35
问题 Given is a simple CSV file: A,B,C Hello,Hi,0 Hola,Bueno,1 Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so: cols = ['A','B','C'] col_types = {'A': str, 'B': str, 'C': int} test = pd.read_csv('test.csv', dtype=col_types) train_y = test['C'] == 1 train_x = test[cols] clf_rf = RandomForestClassifier(n_estimators=50) clf_rf.fit(train_x, train_y) But I just get this traceback when

Unbalanced classification using RandomForestClassifier in sklearn

僤鯓⒐⒋嵵緔 提交于 2019-11-28 16:22:48
I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance alko You can pass sample weights argument to Random Forest fit method sample_weight : array-like, shape = [n_samples] or None Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or

Random Forest Feature Importance Chart using Python

£可爱£侵袭症+ 提交于 2019-11-28 16:22:19
问题 I am working with RandomForestRegressor in python and I want to create a chart that will illustrate the ranking of feature importance. This is the code I used: from sklearn.ensemble import RandomForestRegressor MT= pd.read_csv("MT_reduced.csv") df = MT.reset_index(drop = False) columns2 = df.columns.tolist() # Filter the columns to remove ones we don't want. columns2 = [c for c in columns2 if c not in["Violent_crime_rate","Change_Property_crime_rate","State","Year"]] # Store the variable we