random-forest

Do I need to normalize (or scale) data for randomForest (R package)?

末鹿安然 提交于 2019-11-29 19:50:27
I am doing regression task - do I need to normalize (or scale) data for randomForest (R package)? And is it neccessary to scale also target values? And if - I want to use scale function from caret package, but I did not find how to get data back (descale, denormalize). Do not you know about some other function (in any package) which is helpfull with normalization/denormalization? Thanks, Milan No, scaling is not necessary for random forests. The nature of RF is such that convergence and numerical precision issues, which can sometimes trip up the algorithms used in logistic and linear

How to output RandomForest Classifier from python?

ε祈祈猫儿з 提交于 2019-11-29 18:38:24
问题 I have trained a RandomForestClassifier from Python Sckit Learn Module with very big dataset, but question is how can I possibly save this model and let other people apply it on their end. Thank you! 回答1: The recommended method is to use joblib , this will result in a much smaller file than a pickle: from sklearn.externals import joblib joblib.dump(clf, 'filename.pkl') #then your colleagues can load it clf = joblib.load('filename.pkl') See the online docs 回答2: Have you tried pickling the

R is there a way to find Inf/-Inf values?

六眼飞鱼酱① 提交于 2019-11-29 17:38:15
问题 I'm trying to run a randomForest on a large-ish data set (5000x300). Unfortunately I'm getting an error message as follows: > RF <- randomForest(prePrior1, postPrior1[,6] + ,,do.trace=TRUE,importance=TRUE,ntree=100,,forest=TRUE) Error in randomForest.default(prePrior1, postPrior1[, 6], , do.trace = TRUE, : NA/NaN/Inf in foreign function call (arg 1) So I try to find any NA's using : > df2 <- prePrior1[is.na(prePrior1)] > df2 character(0) > df2 <- postPrior1[is.na(postPrior1[,6])] > df2

Python RandomForest - Unknown label Error

て烟熏妆下的殇ゞ 提交于 2019-11-29 16:59:39
问题 I have trouble using RandomForest fit function This is my training set P1 Tp1 IrrPOA Gz Drz2 0 0.0 7.7 0.0 -1.4 -0.3 1 0.0 7.7 0.0 -1.4 -0.3 2 ... ... ... ... ... 3 49.4 7.5 0.0 -1.4 -0.3 4 47.4 7.5 0.0 -1.4 -0.3 ... (10k rows) I want to predict P1 thanks to all the other variables using sklearn.ensemble RandomForest colsRes = ['P1'] X_train = train.drop(colsRes, axis = 1) Y_train = pd.DataFrame(train[colsRes]) rf = RandomForestClassifier(n_estimators=100) rf.fit(X_train, Y_train) Here is the

What does the value of 'leaf' in the following xgboost model tree diagram means?

元气小坏坏 提交于 2019-11-29 13:01:22
问题 I am guessing that it is conditional probability given that the above (tree branch) condition exists. However, I am not clear on it. If you want to read more about the data used or how do we get this diagram then go to : http://machinelearningmastery.com/visualize-gradient-boosting-decision-trees-xgboost-python/ 回答1: Attribute leaf is the predicted value. In other words, if the evaluation of a tree model ends at that terminal node (aka leaf node), then this is the value that is returned. In

How to deal with multiple class ROC analysis in R (pROC package)?

只愿长相守 提交于 2019-11-29 11:51:31
When I use multiclass.roc function in R (pROC package), for instance, I trained a data set by random forest, here is my code: # randomForest & pROC packages should be installed: # install.packages(c('randomForest', 'pROC')) data(iris) library(randomForest) library(pROC) set.seed(1000) # 3-class in response variable rf = randomForest(Species~., data = iris, ntree = 100) # predict(.., type = 'prob') returns a probability matrix multiclass.roc(iris$Species, predict(rf, iris, type = 'prob')) And the result is: Call: multiclass.roc.default(response = iris$Species, predictor = predict(rf, iris, type

Using the predict_proba() function of RandomForestClassifier in the safe and right way

邮差的信 提交于 2019-11-29 06:09:30
问题 I'm using Scikit-learn to apply machine learning algorithm on my datasets. Sometimes I need to have the probabilities of labels/classes instated of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam. For such purpose, I'm using predict_proba() with RandomForestClassifier as following: clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=1, random_state=0) scores

Random forest package in R shows error during prediction() if there are new factor levels present in test data. Is there any way to avoid this error?

a 夏天 提交于 2019-11-29 05:10:09
I have 30 factor levels of a predictor in my training data. I again have 30 factor levels of the same predictor in my test data but some levels are different. And randomForest does not predict unless the levels are same exactly. It shows error. Says, Error in predict.randomForest(model,test) New factor levels not present in the training data One workaround I've found is to first convert the factor variables in your train and test sets into characters test$factor <- as.character(test$factor) Then add a column to each with a flag for test/train, i.e. test$isTest <- rep(1,nrow(test)) train$isTest

R: Plot trees from h2o.randomForest() and h2o.gbm()

此生再无相见时 提交于 2019-11-29 02:54:29
问题 Looking for an efficient way to plot trees in rstudio, H2O's Flow or in local html page from h2o's RF and GBM models similar to the one in the image in link below. Specifically, how do you plot trees for the objects, (fitted models) rf1 and gbm2 produced by code below perhaps by parsing h2o.download_pojo(rf1) or h2o.download_pojo(gbm1)? # # The following two commands remove any previously installed H2O packages for R. # if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) } #

Proximity Matrix in sklearn.ensemble.RandomForestClassifier

蓝咒 提交于 2019-11-29 01:25:23
问题 I'm trying to perform clustering in Python using Random Forests. In the R implementation of Random Forests, there is a flag you can set to get the proximity matrix. I can't seem to find anything similar in the python scikit version of Random Forest. Does anyone know if there is an equivalent calculation for the python version? 回答1: We don't implement proximity matrix in Scikit-Learn (yet). However, this could be done by relying on the apply function provided in our implementation of decision