random-forest | 易学教程

Why does sklearn preprocessing LabelEncoder inverse_transform apply from only one column?

阅读更多关于 Why does sklearn preprocessing LabelEncoder inverse_transform apply from only one column?

问题 I have a random forest model built with sklearn. The model is built in one file, and I have a second file where I use joblib to load the model and apply it to new data. The data has categorical fields that are converted via sklearn's preprocessing LabelEncoder.fit_transform . Once the prediction is made, I am attempting to reverse this conversion with LabelEncoder.inverse_transform . Here is the code: #transform the categorical rf inputs df["method"] = le.fit_transform(df["method"]) df[

Incremental training of random forest model using python sklearn

阅读更多关于 Incremental training of random forest model using python sklearn

问题 I am using the below code to save a random forest model. I am using cPickle to save the trained model. As I see new data, can I train the model incrementally. Currently, the train set has about 2 years data. Is there a way to train on another 2 years and (kind of) append it to the existing saved model. rf = RandomForestRegressor(n_estimators=100) print ("Trying to fit the Random Forest model --> ") if os.path.exists('rf.pkl'): print ("Trained model already pickled -- >") with open('rf.pkl',

Choosing random_state for sklearn algorithms

阅读更多关于 Choosing random_state for sklearn algorithms

问题 I understand that random_state is used in various sklearn algorithms to break tie between different predictors (trees) with same metric value (say for example in GradientBoosting ). But the documentation does not clarify or detail on this. Like 1 ) where else are these seeds used for random number generation ? Say for RandomForestClassifier , random number can be used to find a set of random features to build a predictor. Algorithms which use sub sampling, can use random numbers to get

Exact implementation of RandomForest in Weka 3.7

阅读更多关于 Exact implementation of RandomForest in Weka 3.7

问题 Having reviewed the original Breiman (2001) paper as well as some other board posts, I am slightly confused with the actual procedure used by WEKAs random forest implementation. None of the sources was sufficiently elaborate, many even contradict each other. How does it work in detail, which steps are carried out? My understanding till now: For each tree a bootstrap sample of the same size as the training data is created Only a random subset of the available features of defined size

R randomForest subsetting can't get rid of factor levels [duplicate]

阅读更多关于 R randomForest subsetting can't get rid of factor levels [duplicate]

问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: dropping factor levels in a subsetted data frame in R I'm trying to use a randomForest to predict sales. I have 3 variables, one of which is a factor variable for storeId. I know that there are levels in the test set that are NOT in the training set. I'm trying to get a prediction for only levels present in the training set but can't get it to look past the new factor levels. Here's what I've tried so far:

Different results with formula and non-formula for caret training

阅读更多关于 Different results with formula and non-formula for caret training

问题 I noticed that using formula and non-formula methods in caret while training produces different results. Also, the time taken for formula method is almost 10x the time taken for the non-formula method. Is this expected ? > z <- data.table(c1=sample(1:1000,1000, replace=T), c2=as.factor(sample(LETTERS, 1000, replace=T))) # SYSTEM TIME WITH FORMULA METHOD # ------------------------------- > system.time(r <- train(c1 ~ ., z, method="rf", importance=T)) user system elapsed 376.233 9.241 18.190 >

Use of scikit Random Forest sample_weights

阅读更多关于 Use of scikit Random Forest sample_weights

问题 I've been trying to figure out scikit's Random Forest sample_weight use and I cannot explain some of the results I'm seeing. Fundamentally I need it to balance a classification problem with unbalanced classes. In particular, I was expecting that if I used a sample_weights array of all 1's I would get the same result as w sample_weights=None . Additionally, I was expeting that any array of equal weights (i.e. all 1s, or all 10s or all 0.8s...) would provide the same result. Perhaps my

Difference between varImp (caret) and importance (randomForest) for Random Forest

阅读更多关于 Difference between varImp (caret) and importance (randomForest) for Random Forest

问题 I do not understand which is the difference between varImp function ( caret package) and importance function ( randomForest package) for a Random Forest model: I computed a simple RF classification model and when computing variable importance, I found that the "ranking" of predictors was not the same for both functions: Here is my code: rfImp <- randomForest(Origin ~ ., data = TAll_CS, ntree = 2000, importance = TRUE) importance(rfImp) BREAST LUNG MeanDecreaseAccuracy MeanDecreaseGini Energy

What is the difference between cross_val_score with scoring='roc_auc' and roc_auc_score?

阅读更多关于 What is the difference between cross_val_score with scoring='roc_auc' and roc_auc_score?

问题 I am confused about the difference between the cross_val_score scoring metric 'roc_auc' and the roc_auc_score that I can just import and call directly. The documentation (http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) indicates that specifying scoring='roc_auc' will use the sklearn.metrics.roc_auc_score. However, when I implement GridSearchCV or cross_val_score with scoring='roc_auc' I receive very different numbers that when I call roc_auc_score directly.

How to perform random forest/cross validation in R

阅读更多关于 How to perform random forest/cross validation in R

问题 I'm unable to find a way of performing cross validation on a regression random forest model that I'm trying to produce. So I have a dataset containing 1664 explanatory variables (different chemical properties), with one response variable (retention time). I'm trying to produce a regression random forest model in order to be able to predict the chemical properties of something given its retention time. ID RT (seconds) 1_MW 2_AMW 3_Sv 4_Se 4281 38 145.29 5.01 14.76 28.37 4952 40 132.19 6.29 11