random-forest

Spark RandomForest training StackOverflow error

风格不统一 提交于 2020-01-02 09:04:15
问题 I am running a training of my model and I am getting the StackOverflow error whenever I increase the maxDepth over 12. Everything works correctly for 5,10,11. I am using spark 2.0.2 (and i cannot upgrade it for next couple of weeks). I have > 3M data, 200 features, 2500 trees and I would like to improve the accuracy by increasing the max depth. Is there a way to overcome this problem? Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 92 in

How does sklearn random forest index feature_importances_

和自甴很熟 提交于 2020-01-01 05:17:08
问题 I have used the RandomForestClassifier in sklearn for determining the important features in my dataset. How am I able to return the actual feature names (my variables are labeled x1, x2, x3, etc.) rather than their relative name (it tells me the important features are '12', '22', etc.). Below is the code that I am currently using to return the important features. important_features = [] for x,i in enumerate(rf.feature_importances_): if i>np.average(rf.feature_importances_): important_features

Spark 1.5.1, MLLib Random Forest Probability

北战南征 提交于 2019-12-30 07:12:22
问题 I am using Spark 1.5.1 with MLLib. I built a random forest model using MLLib, now use the model to do prediction. I can find the predict category (0.0 or 1.0) using the .predict function. However, I can't find the function to retrieve the probability (see the attached screenshot). I thought spark 1.5.1 random forest would provide the probability, am I missing anything here? 回答1: Unfortunately the feature is not available in the older Spark MLlib 1.5.1. You can however find it in the recent

Spark 1.5.1, MLLib Random Forest Probability

心不动则不痛 提交于 2019-12-30 07:12:02
问题 I am using Spark 1.5.1 with MLLib. I built a random forest model using MLLib, now use the model to do prediction. I can find the predict category (0.0 or 1.0) using the .predict function. However, I can't find the function to retrieve the probability (see the attached screenshot). I thought spark 1.5.1 random forest would provide the probability, am I missing anything here? 回答1: Unfortunately the feature is not available in the older Spark MLlib 1.5.1. You can however find it in the recent

How to set seed for random simulations with foreach and doMC packages?

一个人想着一个人 提交于 2019-12-30 00:25:28
问题 I need to do some simulations and for debugging purposes I want to use set.seed to get the same result. Here is the example of what I am trying to do: library(foreach) library(doMC) registerDoMC(2) set.seed(123) a <- foreach(i=1:2,.combine=cbind) %dopar% {rnorm(5)} set.seed(123) b <- foreach(i=1:2,.combine=cbind) %dopar% {rnorm(5)} Objects a and b should be identical, i.e. sum(abs(a-b)) should be zero, but this is not the case. I am doing something wrong, or have I stumbled on to some feature

setting values for ntree and mtry for random forest regression model

耗尽温柔 提交于 2019-12-29 10:28:32
问题 I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201 . I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry ? Is there an approximate formula to find such parameter values? Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.

Unbalanced classification using RandomForestClassifier in sklearn

天涯浪子 提交于 2019-12-29 02:43:32
问题 I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance 回答1: You can pass sample weights argument to Random Forest fit method sample_weight : array-like, shape = [n_samples] or None Sample

How to get around randomForest Error in R (Predictors in new data do not match)

北战南征 提交于 2019-12-25 18:51:34
问题 I am having a hard time troubleshooting the error message below. I am trying to do a random forest model on a titanic data set. Is there a way to get around this error? Is there a code to check the levels in the tree? Error in predict.randomForest(my_rf_model, test1) : Type of predictors in new data do not match that of the training data. 回答1: This is probably occurring because one of the predictor variables in test1 is a factor variable that has a value not present in the original data set.

How would you interpret an ensemble tree model?

橙三吉。 提交于 2019-12-25 18:21:48
问题 In machine learning ensemble tree models such as random forest are common. This models consist of an ensemble of so called decision tree models. How can we analyse, however, what those models have specifically learned? 回答1: You cannot in this sense in what you can just plot simple decision tree. Only extremely simple models can be easily investigated. More complex methods require more complex tools, which are just approximations, general ideas of what to look for. So for ensembles you can try

TypeError when training Tensorflow Random Forest using TensorForestEstimator

[亡魂溺海] 提交于 2019-12-25 09:29:26
问题 I get a TypeError when attempting to train an Tensorflow Random Forest using TensorForestEstimator. TypeError: Input 'input_data' of 'CountExtremelyRandomStats' Op has type float64 that does not match expected type of float32. I've tried using Python 2.7 and Python 3, and I've tried using tf.cast() to put everything in float32 but it doesn't help. I have checked the data type on execution and it's float32. The problem doesn't seem to be the data I provide (csv of all floats), so I'm not sure