random-forest

What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

随声附和 提交于 2019-11-27 12:41:14
I've read from this documentation that : "Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value." But, it is still unclear to me how this works. If I set sample_weight with an array of only two possible values, 1 's and 2 's, does this mean that the samples with 2 's will get sampled twice as often as the samples with 1 's when doing the bagging? I cannot think of a practical example for this. Matt Hancock So I spent a little time looking at the sklearn

Combining random forest models in scikit learn

夙愿已清 提交于 2019-11-27 12:28:36
问题 I have two RandomForestClassifier models, and I would like to combine them into one meta model. They were both trained using similar, but different, data. How can I do this? rf1 #this is my first fitted RandomForestClassifier object, with 250 trees rf2 #this is my second fitted RandomForestClassifier object, also with 250 trees I want to create big_rf with all trees combined into one 500 tree model 回答1: I believe this is possible by modifying the estimators_ and n_estimators attributes on the

Numpy Array Get row index searching by a row

醉酒当歌 提交于 2019-11-27 11:18:00
问题 I am new to numpy and I am implementing clustering with random forest in python. My question is: How could I find the index of the exact row in an array? For example [[ 0. 5. 2.] [ 0. 0. 3.] [ 0. 0. 0.]] and I look for [0. 0. 3.] and get as result 1(the index of the second row). Any suggestion? Follows the code (not working...) for index, element in enumerate(leaf_node.x): for index_second_element, element_two in enumerate(leaf_node.x): if (index <= index_second_element): index_row = np.where

How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

跟風遠走 提交于 2019-11-27 10:16:43
问题 I'm running GridSearch CV to optimize the parameters of a classifier in scikit. Once I'm done, I'd like to know which parameters were chosen as the best. Whenever I do so I get a AttributeError: 'RandomForestClassifier' object has no attribute 'best_estimator_' , and can't tell why, as it seems to be a legitimate attribute on the documentation. from sklearn.grid_search import GridSearchCV X = data[usable_columns] y = data[target] X_train, X_test, y_train, y_test = train_test_split(X, y, test

How to use random forests in R with missing values?

故事扮演 提交于 2019-11-27 09:11:29
问题 library(randomForest) rf.model <- randomForest(WIN ~ ., data = learn) I would like to fit a random forest model, but I get this error: Error in na.fail.default(list(WIN = c(2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, : missing values in object I have data frame learn with 16 numeric atributes and WIN is a factor with levels 0 1. 回答1: My initial reaction to this question was that it didn't show much research effort, since "everyone" knows that random forests don't handle missing values in predictors. But

how extraction decision rules of random forest in python

元气小坏坏 提交于 2019-11-27 08:36:42
问题 I have one question though. I heard from someone that in R, you can use extra packages to extract the decision rules implemented in RF, I try to google the same thing in python but without luck, if there is any help on how to achieve that. thanks in advance! 回答1: Assuming that you use sklearn RandomForestClassifier you can find the invididual decision trees as .estimators_ . Each tree stores the decision nodes as a number of NumPy arrays under tree_ . Here is some example code which just

Suggestions for speeding up Random Forests

落爺英雄遲暮 提交于 2019-11-27 06:38:22
I'm doing some work with the randomForest package and while it works well, it can be time-consuming. Any one have any suggestions for speeding things up? I'm using a Windows 7 box w/ a dual core AMD chip. I know about R not being multi- thread/processor, but was curious if any of the parallel packages ( rmpi , snow , snowfall , etc.) worked for randomForest stuff. Thanks. EDIT: I'm using rF for some classification work (0's and 1's). The data has about 8-12 variable columns and the training set is a sample of 10k lines, so it's decent size but not crazy. I'm running 500 trees and an mtry of 2,

Can sklearn random forest directly handle categorical features?

倾然丶 夕夏残阳落幕 提交于 2019-11-27 05:14:29
问题 Say I have a categorical feature, color, which takes the values ['red', 'blue', 'green', 'orange'], and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell sklearn that the four dummy variables are really one variable? Specifically, when sklearn is randomly selecting features to use at different nodes, it should either include the red, blue, green and orange dummies together, or it shouldn't include any of

Print the decision path of a specific sample in a random forest classifier

▼魔方 西西 提交于 2019-11-27 02:56:02
问题 How to print the decision path of a randomforest rather than the path of individual trees in a randomforest for a specific sample. import numpy as np import pandas as pd from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier X, y = make_classification(n_samples=1000, n_features=6, n_informative=3, n_classes=2, random_state=0, shuffle=False) # Creating a dataFrame df = pd.DataFrame({'Feature 1':X[:,0], 'Feature 2':X[:,1], 'Feature 3':X[:,2],

parRF on caret not working for more than one core

感情迁移 提交于 2019-11-27 02:54:44
问题 parRF from the caret R package is not working for me with more than one core, which is quite ironic, given the par in parRF stands for parallel. I'm on a windows machine, if that is a relevant piece of information. I checked that I'm using the latest an greatest regarding caret and doParallel. I made a minimal example and and give the results below. Any ideas? Source code library(caret) library(doParallel) trCtrl <- trainControl( method = "repeatedcv" , number = 2 , repeats = 5 ,