random-forest | 易学教程

Predict using randomForest package in R

阅读更多关于 Predict using randomForest package in R

问题 How can I use result of randomForrest call in R to predict labels on some unlabled data (e.g. real world input to be classified)? Code: train_data = read.csv("train.csv") input_data = read.csv("input.csv") result_forest = randomForest(Label ~ ., data=train_data) labeled_input = result_forest.predict(input_data) # I need something like this train.csv: a;b;c;label; 1;1;1;a; 2;2;2;b; 1;2;1;c; input.csv: a;b;c; 1;1;1; 2;1;2; I need to get something like this a;b;c;label; 1;1;1;a; 2;1;2;b; 回答1:

How to use string variables in VectorAssembler in Pyspark

阅读更多关于 How to use string variables in VectorAssembler in Pyspark

问题 I want to run Random Forests algorithm on Pyspark. It is mentioned in the Pyspark documentation that VectorAssembler accepts only numerical or boolean datatypes. So, if my data contains Stringtype variables, say names of cities, should I be one-hot encoding them in order to proceed further with Random Forests classification/regression? Here is the code I have been trying, input file is here: train=sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('filename') drop

maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml

阅读更多关于 maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml

问题 Background: I'm doing a simple binary classification, using RandomForestClassifier from pyspark.ml. Before feeding the data to training, I managed to use VectorIndexer to decide whether features would be numerical or categorical by providing the argument maxCategories. Problem: Even if I have used the VectorIndexer with maxCategories setting to 30, I was still getting an error during training pipeline: An error occurred while calling o15371.fit. : java.lang.IllegalArgumentException:

Improve h2o DRF runtime on a multi-node cluster

阅读更多关于 Improve h2o DRF runtime on a multi-node cluster

问题 I am currently running h2o 's DRF algorithm an a 3-node EC2 cluster (the h2o server spans across all 3 nodes). My data set has 1m rows and 41 columns (40 predictors and 1 response). I use the R bindings to control the cluster and the RF call is as follows model=h2o.randomForest(x=x, y=y, ignore_const_cols=TRUE, training_frame=train_data, seed=1234, mtries=7, ntrees=2000, max_depth=15, min_rows=50, stopping_rounds=3, stopping_metric="MSE", stopping_tolerance=2e-5) For the 3-node cluster (c4

How to calculate the OOB of random forest?

阅读更多关于 How to calculate the OOB of random forest?

问题 I am comparing some models to get a best model. Now, I want to get an OOB error of random forest model to compare it with the cross-validation errors of some others model. Can I do the comparison? If I can, how can I get the OOB error by R code? 回答1: To get the OOB of a random forest model in R you can: library(randomForest) set.seed(1) model <- randomForest(Species ~ ., data = iris) OOB error is in: model$err.rate[,1] where the i-th element is the (OOB) error rate for all trees up to the i

Difference of prediction results in random forest model

阅读更多关于 Difference of prediction results in random forest model

问题 I have built an Random Forest model and I got two different prediction results when I wrote two different lines of code in order to generate the prediction. I wonder which one is the right one. Here is my example dataframe and the usedcode: dat <- read.table(text = " cats birds wolfs snakes 0 3 9 7 1 3 8 4 1 1 2 8 0 1 2 3 0 1 8 3 1 6 1 2 0 6 7 1 1 6 1 5 0 5 9 7 1 3 8 7 1 4 2 7 0 1 2 3 0 7 6 3 1 6 1 1 0 6 3 9 1 6 1 1 ",header = TRUE) I've built a random forest model: model<-randomForest(snakes

Implementing custom stopping metrics to optimize during training in H2O model directly from R

阅读更多关于 Implementing custom stopping metrics to optimize during training in H2O model directly from R

问题 I'm trying to implement the FBeta_Score() of the MLmetrics R package: FBeta_Score <- function(y_true, y_pred, positive = NULL, beta = 1) { Confusion_DF <- ConfusionDF(y_pred, y_true) if (is.null(positive) == TRUE) positive <- as.character(Confusion_DF[1,1]) Precision <- Precision(y_true, y_pred, positive) Recall <- Recall(y_true, y_pred, positive) Fbeta_Score <- (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall) return(Fbeta_Score) } in the H2O distributed random forest model

What does the parameter 'classwt' in RandomForest function in RandomForest package in R stand for?

阅读更多关于 What does the parameter 'classwt' in RandomForest function in RandomForest package in R stand for?

问题 The help page for randomforest::randomforest() says: "classwt - Priors of the classes. Need not add up to one. Ignored for regression." Could setting the classwt parameter help when you have heavy unbalanced data, ie. priors of classes differs strongly ? How should I set classwt when training a model on a dataset with 3 classes with a vector of priors equal to (p1,p2,p3), and in test set priors are (q1,q2,q3)? 回答1: could setting classwt parameter help when you have heavy unbalanced data -

Do I need to normalize (or scale) data for randomForest (R package)?

阅读更多关于 Do I need to normalize (or scale) data for randomForest (R package)?

问题 I am doing regression task - do I need to normalize (or scale) data for randomForest (R package)? And is it neccessary to scale also target values? And if - I want to use scale function from caret package, but I did not find how to get data back (descale, denormalize). Do not you know about some other function (in any package) which is helpfull with normalization/denormalization? Thanks, Milan 回答1: No, scaling is not necessary for random forests. The nature of RF is such that convergence and

Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error?

阅读更多关于 Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error?

问题 I have 30 factor levels of a predictor in my training data. I again have 30 factor levels of the same predictor in my test data but some levels are different. And randomForest does not predict unless the levels are same exactly. It shows error. Says, Error in predict.randomForest(model,test) New factor levels not present in the training data 回答1: One workaround I've found is to first convert the factor variables in your train and test sets into characters test$factor <- as.character(test