random-forest

When using multiple classifiers - How to measure the ensemble's performance? [SciKit Learn]

落爺英雄遲暮 提交于 2019-12-03 12:47:36
问题 I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives. (The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) . Now, the classifiers have roughly similar performance metrics

My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

微笑、不失礼 提交于 2019-12-03 09:20:17
For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below: https://www.kaggle.com/ludobenistant/hr-analytics import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder,OneHotEncoder dataset = pd.read_csv("HR_comma_sep.csv") x = dataset.iloc[:,:-1].values ##Independent variable y = dataset.iloc[:,9].values ##Dependent variable ##Encoding the

R randomForest subsetting can't get rid of factor levels [duplicate]

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-03 08:54:34
This question already has answers here : Drop factor levels in a subsetted data frame (14 answers) Possible Duplicate: dropping factor levels in a subsetted data frame in R I'm trying to use a randomForest to predict sales. I have 3 variables, one of which is a factor variable for storeId. I know that there are levels in the test set that are NOT in the training set. I'm trying to get a prediction for only levels present in the training set but can't get it to look past the new factor levels. Here's what I've tried so far: require(randomForest) train <- data.frame(sales = runif(10)*1000,

Different results with formula and non-formula for caret training

别来无恙 提交于 2019-12-03 08:36:40
I noticed that using formula and non-formula methods in caret while training produces different results. Also, the time taken for formula method is almost 10x the time taken for the non-formula method. Is this expected ? > z <- data.table(c1=sample(1:1000,1000, replace=T), c2=as.factor(sample(LETTERS, 1000, replace=T))) # SYSTEM TIME WITH FORMULA METHOD # ------------------------------- > system.time(r <- train(c1 ~ ., z, method="rf", importance=T)) user system elapsed 376.233 9.241 18.190 > r 1000 samples 1 predictors No pre-processing Resampling: Bootstrap (25 reps) Summary of sample sizes:

R: using ranger with caret, tuneGrid argument

ε祈祈猫儿з 提交于 2019-12-03 07:50:10
I'm using the caret package to analyse Random Forest models built using ranger . I can't figure out how to call the train function using the tuneGrid argument to tune the model parameters. I think I'm calling the tuneGrid argument wrong, but can't figure out why it's wrong. Any help would be appreciated. data(iris) library(ranger) model_ranger <- ranger(Species ~ ., data = iris, num.trees = 500, mtry = 4, importance = 'impurity') library(caret) # my tuneGrid object: tgrid <- expand.grid( num.trees = c(200, 500, 1000), mtry = 2:4 ) model_caret <- train(Species ~ ., data = iris, method = "ranger

Issues with tuneGrid parameter in random forest

ぐ巨炮叔叔 提交于 2019-12-03 07:14:26
I've been dealing with some extremely imbalanced data and I would like to use stratified sampling to created more balanced random forests Right now, I'm using the caret package, mainly to for tuning the random forests. So I try to setup a tuneGrid to pass in the mtry and sampsize parameters into caret train method as follows. mtryGrid <- data.frame(.mtry = 100),.sampsize=80) rfTune<- train(x = trainX, y = trainY, method = "rf", trControl = ctrl, metric = "Kappa", ntree = 1000, tuneGrid = mtryGrid, importance = TRUE) When I run this example, I get the following error The tuning parameter grid

matplotlib: Plot Feature Importance with feature names

爷,独闯天下 提交于 2019-12-03 07:02:24
In R there are pre-built functions to plot feature importance of Random Forest model. But in python such method seems to be missing. I search for a method in matplotlib . model.feature_importances gives me following: array([ 2.32421835e-03, 7.21472336e-04, 2.70491223e-03, 3.34521084e-03, 4.19443238e-03, 1.50108737e-03, 3.29160540e-03, 4.82320256e-01, 3.14117333e-03]) Then using following plotting function: >> pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) >> pyplot.show() I get a barplot but I would like to get barplot with labels while importance showing

How to compute ROC and AUC under ROC after training using caret in R?

一曲冷凌霜 提交于 2019-12-03 05:17:59
问题 I have used caret package's train function with 10-fold cross validation. I also have got class probabilities for predicted classes by setting classProbs = TRUE in trControl , as follows: myTrainingControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE, classProbs = TRUE, verboseIter = TRUE) randomForestFit = train(x = input[3:154], y = as.factor(input$Target), method = "rf", trControl = myTrainingControl, preProcess = c("center","scale"), ntree = 50) The output

Using randomForest package in R, how to get probabilities from classification model?

风流意气都作罢 提交于 2019-12-03 04:20:50
问题 TL;DR : Is there something I can flag in the original randomForest call to avoid having to re-run the predict function to get predicted categorical probabilities, instead of just the likely category? Details: I am using the randomForest package. I have a model something like: model <- randomForest(x=out.data[train.rows, feature.cols], y=out.data[train.rows, response.col], xtest=out.data[test.rows, feature.cols], ytest=out.data[test.rows, response.col], importance= TRUE) where out.data is a

Different results with randomForest() and caret's randomForest (method = “rf”)

青春壹個敷衍的年華 提交于 2019-12-03 03:15:38
I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking. I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call