random-forest

Spark Multiclass Classification Example

故事扮演 提交于 2019-11-28 06:06:40
Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation. zero323 ML ( Recommended in Spark 2.0+ ) We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly: val trainRawDf = trainRaw.toDF import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer} import org.apache

How to deal with multiple class ROC analysis in R (pROC package)?

一曲冷凌霜 提交于 2019-11-28 05:09:47
问题 When I use multiclass.roc function in R (pROC package), for instance, I trained a data set by random forest, here is my code: # randomForest & pROC packages should be installed: # install.packages(c('randomForest', 'pROC')) data(iris) library(randomForest) library(pROC) set.seed(1000) # 3-class in response variable rf = randomForest(Species~., data = iris, ntree = 100) # predict(.., type = 'prob') returns a probability matrix multiclass.roc(iris$Species, predict(rf, iris, type = 'prob')) And

r random forest error - type of predictors in new data do not match

依然范特西╮ 提交于 2019-11-28 04:10:07
I am trying to use quantile regression forest function in R ( quantregForest ) which is built on Random Forest package. I am getting a type mismatch error that I can't quite figure why. I train the model by using qrf <- quantregForest(x = xtrain, y = ytrain) which works without a problem, but when I try to test with new data like quant.newdata <- predict(qrf, newdata= xtest) it gives the following error: Error in predict.quantregForest(qrf, newdata = xtest) : Type of predictors in new data do not match types of the training data. My training and testing data are coming from separate files

Can sklearn random forest directly handle categorical features?

徘徊边缘 提交于 2019-11-28 03:55:12
Say I have a categorical feature, color, which takes the values ['red', 'blue', 'green', 'orange'], and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell sklearn that the four dummy variables are really one variable? Specifically, when sklearn is randomly selecting features to use at different nodes, it should either include the red, blue, green and orange dummies together, or it shouldn't include any of them. I've heard that there's no way to do this, but I'd imagine there must be a way to deal with

Random Forest with classes that are very unbalanced

陌路散爱 提交于 2019-11-27 19:27:17
I am using random forests in a big data problem, which has a very unbalanced response class, so I read the documentation and I found the following parameters: strata sampsize The documentation for these parameters is sparse (or I didn´t have the luck to find it) and I really don´t understand how to implement it. I am using the following code: randomForest(x=predictors, y=response, data=train.data, mtry=lista.params[1], ntree=lista.params[2], na.action=na.omit, nodesize=lista.params[3], maxnodes=lista.params[4], sampsize=c(250000,2000), do.trace=100, importance=TRUE) The response is a class

Error when using predict() on a randomForest object trained with caret's train() using formula

只愿长相守 提交于 2019-11-27 16:21:57
问题 Using R 3.2.0 with caret 6.0-41 and randomForest 4.6-10 on a 64-bit Linux machine. When trying to use the predict() method on a randomForest object trained with the train() function from the caret package using a formula, the function returns an error. When training via randomForest() and/or using x= and y= rather than a formula, it all runs smoothly. Here is a working example: library(randomForest) library(caret) data(imports85) imp85 <- imports85[, c("stroke", "price", "fuelType",

Predict classes or class probabilities?

天大地大妈咪最大 提交于 2019-11-27 16:04:09
I am currently using H2O for a classification problem dataset. I am testing it out with H2ORandomForestEstimator in a python 3.6 environment. I noticed the results of the predict method was giving values between 0 to 1(I am assuming this is the probability). In my data set, the target attribute is numeric i.e. True values are 1 and False values are 0. I made sure I converted the type to category for the target attribute, I was still getting the same result. Then I modified to the code to convert the target column to factor using asfactor() method on the H2OFrame still, there wasn't any change

Understanding Spark RandomForest featureImportances results

纵然是瞬间 提交于 2019-11-27 15:25:08
问题 I'm using RandomForest.featureImportances but I don't understand the output result. I have 12 features, and this is the output I get. I get that this might not be an apache-spark specific question but I cannot find anywhere that explains the output. // org.apache.spark.mllib.linalg.Vector = (12,[0,1,2,3,4,5,6,7,8,9,10,11], [0.1956128039688559,0.06863606797951556,0.11302128590305296,0.091986700351889,0.03430651625283274,0.05975817050022879,0.06929766152519388,0.052654922125615934,0

Why is Random Forest with a single tree much better than a Decision Tree classifier?

我们两清 提交于 2019-11-27 14:48:55
I learn the machine learning with the scikit-learn library. I apply the decision tree classifier and the random forest classifier to my data with this code: def decision_tree(train_X, train_Y, test_X, test_Y): clf = tree.DecisionTreeClassifier() clf.fit(train_X, train_Y) return clf.score(test_X, test_Y) def random_forest(train_X, train_Y, test_X, test_Y): clf = RandomForestClassifier(n_estimators=1) clf = clf.fit(X, Y) return clf.score(test_X, test_Y) Why the result are so much better for the random forest classifier (for 100 runs, with randomly sampling 2/3 of data for the training and 1/3

Is there easy way to grid search without cross validation in python?

≯℡__Kan透↙ 提交于 2019-11-27 13:53:29
问题 There is absolutely helpful class GridSearchCV in scikit-learn to do grid search and cross validation, but I don't want to do cross validataion. I want to do grid search without cross validation and use whole data to train. To be more specific, I need to evaluate my model made by RandomForestClassifier with "oob score" during grid search. Is there easy way to do it? or should I make a class by myself? The points are I'd like to do grid search with easy way. I don't want to do cross validation