random-forest

How do I solve overfitting in random forest of Python sklearn?

你离开我真会死。 提交于 2019-11-28 16:01:59
问题 I am using RandomForestClassifier implemented in python sklearn package to build a binary classification model. The below is the results of cross validations: Fold 1 : Train: 164 Test: 40 Train Accuracy: 0.914634146341 Test Accuracy: 0.55 Fold 2 : Train: 163 Test: 41 Train Accuracy: 0.871165644172 Test Accuracy: 0.707317073171 Fold 3 : Train: 163 Test: 41 Train Accuracy: 0.889570552147 Test Accuracy: 0.585365853659 Fold 4 : Train: 163 Test: 41 Train Accuracy: 0.871165644172 Test Accuracy: 0

How to improve randomForest performance?

核能气质少年 提交于 2019-11-28 15:41:22
问题 I have a training set of size 38 MB (12 attributes with 420000 rows). I am running the below R snippet, to train the model using randomForest . This is taking hours for me. rf.model <- randomForest( Weekly_Sales~., data=newdata, keep.forest=TRUE, importance=TRUE, ntree=200, do.trace=TRUE, na.action=na.roughfix ) I think, due to na.roughfix , it is taking long time to execute. There are so many NA's in the training set. Could someone let me know how can I improve the performance? My system

How to use random forests in R with missing values?

╄→尐↘猪︶ㄣ 提交于 2019-11-28 15:20:40
library(randomForest) rf.model <- randomForest(WIN ~ ., data = learn) I would like to fit a random forest model, but I get this error: Error in na.fail.default(list(WIN = c(2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, : missing values in object I have data frame learn with 16 numeric atributes and WIN is a factor with levels 0 1. My initial reaction to this question was that it didn't show much research effort, since "everyone" knows that random forests don't handle missing values in predictors. But upon checking ?randomForest I must confess that it could be much more explicit about this. (Although,

How to get the probability per instance in classifications models in spark.mllib

你说的曾经没有我的故事 提交于 2019-11-28 14:30:57
I'm using spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithSGD} and spark.mllib.tree.RandomForest for classification. Using these packages I produce classification models. Only these models predict a specific class per instance. In Weka, we can get the exact probability for each instance to be of each class. How can we do it using these packages? In LogisticRegressionModel we can set the threshold. So I've created a function that check the results for each point on a different threshold. But this cannot be done for RandomForest (see How to set cutoff while training

how extraction decision rules of random forest in python

感情迁移 提交于 2019-11-28 14:25:33
I have one question though. I heard from someone that in R, you can use extra packages to extract the decision rules implemented in RF, I try to google the same thing in python but without luck, if there is any help on how to achieve that. thanks in advance! Assuming that you use sklearn RandomForestClassifier you can find the invididual decision trees as .estimators_ . Each tree stores the decision nodes as a number of NumPy arrays under tree_ . Here is some example code which just prints each node in order of the array. In a typical application one would instead traverse by following the

R - Random Forest and more than 53 categories

自古美人都是妖i 提交于 2019-11-28 12:59:06
问题 I know. RandomForest is not able to handle more than 53 categories. Sadly I have to analyze data and one column has 165 levels. Therefor I want to use RandomForest for a classification. My problem is I cannot remove this columns since this predictor is really important and known as a valuable predictor. This predictor has 165 levels and is a factor. Are there any tips how I can handle this? Since we are talking about film genre I have no idea. Are there alternative packages for big data? A

scikit-lean GridSearchCV n_jobs != 1 freezing

自作多情 提交于 2019-11-28 12:38:40
问题 I'm running grid search on random forests and trying to use n_jobs different than one but the kernel freezes, there is no CPU usage. With n_jobs=1 it works fine. I can't even stop the command with ctl-C and have to restart the kernel. I'm running on windows 7. I saw that there is a similar problem with OS X but the solution is not relevant for windows 7. from sklearn.ensemble import RandomForestClassifier rf_tfdidf = Pipeline([('vect',tfidf), ('clf', RandomForestClassifier(n_estimators=50,

Print the decision path of a specific sample in a random forest classifier

我只是一个虾纸丫 提交于 2019-11-28 09:28:36
How to print the decision path of a randomforest rather than the path of individual trees in a randomforest for a specific sample. import numpy as np import pandas as pd from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier X, y = make_classification(n_samples=1000, n_features=6, n_informative=3, n_classes=2, random_state=0, shuffle=False) # Creating a dataFrame df = pd.DataFrame({'Feature 1':X[:,0], 'Feature 2':X[:,1], 'Feature 3':X[:,2], 'Feature 4':X[:,3], 'Feature 5':X[:,4], 'Feature 6':X[:,5], 'Class':y}) y_train = df['Class'] X_train = df

parRF on caret not working for more than one core

故事扮演 提交于 2019-11-28 09:26:54
parRF from the caret R package is not working for me with more than one core, which is quite ironic, given the par in parRF stands for parallel. I'm on a windows machine, if that is a relevant piece of information. I checked that I'm using the latest an greatest regarding caret and doParallel. I made a minimal example and and give the results below. Any ideas? Source code library(caret) library(doParallel) trCtrl <- trainControl( method = "repeatedcv" , number = 2 , repeats = 5 , allowParallel = TRUE ) # WORKS registerDoParallel(1) train(form = Species~., data=iris, trControl = trCtrl, method=

How to cross validate RandomForest model?

房东的猫 提交于 2019-11-28 08:16:14
I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? zero323 ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator