random-forest | 易学教程

When using multiple classifiers - How to measure the ensemble's performance? [SciKit Learn]

阅读更多关于 When using multiple classifiers - How to measure the ensemble's performance? [SciKit Learn]

问题 I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives. (The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) . Now, the classifiers have roughly similar performance metrics

My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

阅读更多关于 My r-squared score is coming negative but my accuracy score using k-fold cross validation is coming to about 92%

For the code below, my r-squared score is coming out to be negative but my accuracies score using k-fold cross validation is coming out to be 92%. How's this possible? Im using random forest regression algorithm to predict some data. The link to the dataset is given in the link below: https://www.kaggle.com/ludobenistant/hr-analytics import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder,OneHotEncoder dataset = pd.read_csv("HR_comma_sep.csv") x = dataset.iloc[:,:-1].values ##Independent variable y = dataset.iloc[:,9].values ##Dependent variable ##Encoding the

R randomForest subsetting can't get rid of factor levels [duplicate]

阅读更多关于 R randomForest subsetting can't get rid of factor levels [duplicate]

This question already has answers here : Drop factor levels in a subsetted data frame (14 answers) Possible Duplicate: dropping factor levels in a subsetted data frame in R I'm trying to use a randomForest to predict sales. I have 3 variables, one of which is a factor variable for storeId. I know that there are levels in the test set that are NOT in the training set. I'm trying to get a prediction for only levels present in the training set but can't get it to look past the new factor levels. Here's what I've tried so far: require(randomForest) train <- data.frame(sales = runif(10)*1000,

Different results with formula and non-formula for caret training

阅读更多关于 Different results with formula and non-formula for caret training

I noticed that using formula and non-formula methods in caret while training produces different results. Also, the time taken for formula method is almost 10x the time taken for the non-formula method. Is this expected ? > z <- data.table(c1=sample(1:1000,1000, replace=T), c2=as.factor(sample(LETTERS, 1000, replace=T))) # SYSTEM TIME WITH FORMULA METHOD # ------------------------------- > system.time(r <- train(c1 ~ ., z, method="rf", importance=T)) user system elapsed 376.233 9.241 18.190 > r 1000 samples 1 predictors No pre-processing Resampling: Bootstrap (25 reps) Summary of sample sizes:

R: using ranger with caret, tuneGrid argument

阅读更多关于 R: using ranger with caret, tuneGrid argument

I'm using the caret package to analyse Random Forest models built using ranger . I can't figure out how to call the train function using the tuneGrid argument to tune the model parameters. I think I'm calling the tuneGrid argument wrong, but can't figure out why it's wrong. Any help would be appreciated. data(iris) library(ranger) model_ranger <- ranger(Species ~ ., data = iris, num.trees = 500, mtry = 4, importance = 'impurity') library(caret) # my tuneGrid object: tgrid <- expand.grid( num.trees = c(200, 500, 1000), mtry = 2:4 ) model_caret <- train(Species ~ ., data = iris, method = "ranger

Issues with tuneGrid parameter in random forest

阅读更多关于 Issues with tuneGrid parameter in random forest

I've been dealing with some extremely imbalanced data and I would like to use stratified sampling to created more balanced random forests Right now, I'm using the caret package, mainly to for tuning the random forests. So I try to setup a tuneGrid to pass in the mtry and sampsize parameters into caret train method as follows. mtryGrid <- data.frame(.mtry = 100),.sampsize=80) rfTune<- train(x = trainX, y = trainY, method = "rf", trControl = ctrl, metric = "Kappa", ntree = 1000, tuneGrid = mtryGrid, importance = TRUE) When I run this example, I get the following error The tuning parameter grid

matplotlib: Plot Feature Importance with feature names

阅读更多关于 matplotlib: Plot Feature Importance with feature names

In R there are pre-built functions to plot feature importance of Random Forest model. But in python such method seems to be missing. I search for a method in matplotlib . model.feature_importances gives me following: array([ 2.32421835e-03, 7.21472336e-04, 2.70491223e-03, 3.34521084e-03, 4.19443238e-03, 1.50108737e-03, 3.29160540e-03, 4.82320256e-01, 3.14117333e-03]) Then using following plotting function: >> pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) >> pyplot.show() I get a barplot but I would like to get barplot with labels while importance showing

How to compute ROC and AUC under ROC after training using caret in R?

阅读更多关于 How to compute ROC and AUC under ROC after training using caret in R?

问题 I have used caret package's train function with 10-fold cross validation. I also have got class probabilities for predicted classes by setting classProbs = TRUE in trControl , as follows: myTrainingControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE, classProbs = TRUE, verboseIter = TRUE) randomForestFit = train(x = input[3:154], y = as.factor(input$Target), method = "rf", trControl = myTrainingControl, preProcess = c("center","scale"), ntree = 50) The output

Using randomForest package in R, how to get probabilities from classification model?

阅读更多关于 Using randomForest package in R, how to get probabilities from classification model?

问题 TL;DR : Is there something I can flag in the original randomForest call to avoid having to re-run the predict function to get predicted categorical probabilities, instead of just the likely category? Details: I am using the randomForest package. I have a model something like: model <- randomForest(x=out.data[train.rows, feature.cols], y=out.data[train.rows, response.col], xtest=out.data[test.rows, feature.cols], ytest=out.data[test.rows, response.col], importance= TRUE) where out.data is a

Different results with randomForest() and caret's randomForest (method = “rf”)

阅读更多关于 Different results with randomForest() and caret's randomForest (method = “rf”)

I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking. I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call

订阅 random-forest