random-forest | 易学教程

How do I solve overfitting in random forest of Python sklearn?

阅读更多关于 How do I solve overfitting in random forest of Python sklearn?

问题 I am using RandomForestClassifier implemented in python sklearn package to build a binary classification model. The below is the results of cross validations: Fold 1 : Train: 164 Test: 40 Train Accuracy: 0.914634146341 Test Accuracy: 0.55 Fold 2 : Train: 163 Test: 41 Train Accuracy: 0.871165644172 Test Accuracy: 0.707317073171 Fold 3 : Train: 163 Test: 41 Train Accuracy: 0.889570552147 Test Accuracy: 0.585365853659 Fold 4 : Train: 163 Test: 41 Train Accuracy: 0.871165644172 Test Accuracy: 0

How to improve randomForest performance?

阅读更多关于 How to improve randomForest performance?

问题 I have a training set of size 38 MB (12 attributes with 420000 rows). I am running the below R snippet, to train the model using randomForest . This is taking hours for me. rf.model <- randomForest( Weekly_Sales~., data=newdata, keep.forest=TRUE, importance=TRUE, ntree=200, do.trace=TRUE, na.action=na.roughfix ) I think, due to na.roughfix , it is taking long time to execute. There are so many NA's in the training set. Could someone let me know how can I improve the performance? My system

How to use random forests in R with missing values?

阅读更多关于 How to use random forests in R with missing values?

library(randomForest) rf.model <- randomForest(WIN ~ ., data = learn) I would like to fit a random forest model, but I get this error: Error in na.fail.default(list(WIN = c(2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, : missing values in object I have data frame learn with 16 numeric atributes and WIN is a factor with levels 0 1. My initial reaction to this question was that it didn't show much research effort, since "everyone" knows that random forests don't handle missing values in predictors. But upon checking ?randomForest I must confess that it could be much more explicit about this. (Although,

How to get the probability per instance in classifications models in spark.mllib

阅读更多关于 How to get the probability per instance in classifications models in spark.mllib

I'm using spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithSGD} and spark.mllib.tree.RandomForest for classification. Using these packages I produce classification models. Only these models predict a specific class per instance. In Weka, we can get the exact probability for each instance to be of each class. How can we do it using these packages? In LogisticRegressionModel we can set the threshold. So I've created a function that check the results for each point on a different threshold. But this cannot be done for RandomForest (see How to set cutoff while training

how extraction decision rules of random forest in python

阅读更多关于 how extraction decision rules of random forest in python

I have one question though. I heard from someone that in R, you can use extra packages to extract the decision rules implemented in RF, I try to google the same thing in python but without luck, if there is any help on how to achieve that. thanks in advance! Assuming that you use sklearn RandomForestClassifier you can find the invididual decision trees as .estimators_ . Each tree stores the decision nodes as a number of NumPy arrays under tree_ . Here is some example code which just prints each node in order of the array. In a typical application one would instead traverse by following the

R - Random Forest and more than 53 categories

阅读更多关于 R - Random Forest and more than 53 categories

问题 I know. RandomForest is not able to handle more than 53 categories. Sadly I have to analyze data and one column has 165 levels. Therefor I want to use RandomForest for a classification. My problem is I cannot remove this columns since this predictor is really important and known as a valuable predictor. This predictor has 165 levels and is a factor. Are there any tips how I can handle this? Since we are talking about film genre I have no idea. Are there alternative packages for big data? A

scikit-lean GridSearchCV n_jobs != 1 freezing

阅读更多关于 scikit-lean GridSearchCV n_jobs != 1 freezing

问题 I'm running grid search on random forests and trying to use n_jobs different than one but the kernel freezes, there is no CPU usage. With n_jobs=1 it works fine. I can't even stop the command with ctl-C and have to restart the kernel. I'm running on windows 7. I saw that there is a similar problem with OS X but the solution is not relevant for windows 7. from sklearn.ensemble import RandomForestClassifier rf_tfdidf = Pipeline([('vect',tfidf), ('clf', RandomForestClassifier(n_estimators=50,

Print the decision path of a specific sample in a random forest classifier

阅读更多关于 Print the decision path of a specific sample in a random forest classifier

How to print the decision path of a randomforest rather than the path of individual trees in a randomforest for a specific sample. import numpy as np import pandas as pd from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier X, y = make_classification(n_samples=1000, n_features=6, n_informative=3, n_classes=2, random_state=0, shuffle=False) # Creating a dataFrame df = pd.DataFrame({'Feature 1':X[:,0], 'Feature 2':X[:,1], 'Feature 3':X[:,2], 'Feature 4':X[:,3], 'Feature 5':X[:,4], 'Feature 6':X[:,5], 'Class':y}) y_train = df['Class'] X_train = df

parRF on caret not working for more than one core

阅读更多关于 parRF on caret not working for more than one core

parRF from the caret R package is not working for me with more than one core, which is quite ironic, given the par in parRF stands for parallel. I'm on a windows machine, if that is a relevant piece of information. I checked that I'm using the latest an greatest regarding caret and doParallel. I made a minimal example and and give the results below. Any ideas? Source code library(caret) library(doParallel) trCtrl <- trainControl( method = "repeatedcv" , number = 2 , repeats = 5 , allowParallel = TRUE ) # WORKS registerDoParallel(1) train(form = Species~., data=iris, trControl = trCtrl, method=

How to cross validate RandomForest model?

阅读更多关于 How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? zero323 ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator