random-forest | 易学教程

The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

阅读更多关于 The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

问题 While using the predict function in R to get the predictions from a Random Forest model, I misspecified the training data as newdata as follows: RF1pred <- predict(RF1, newdata=TrainS1, type = "class") Used like this, I get extremely high accuracy and AUC, which I am sure is not right, but I couldn't find a good explanation for it. This thread is the closest I got, but I can's say I fully understand the explanation there. If someone could elaborate, I will be grateful. Thank you! EDIT:

The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

阅读更多关于 The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

permutation importance in h2o random Forest

阅读更多关于 permutation importance in h2o random Forest

问题 The CRAN implementation of random forests offers both variable importance measures: the Gini importance as well as the widely used permutation importance defined as For classification, it is the increase in percent of times a case is OOB and misclassified when the variable is permuted. For regression, it is the average increase in squared OOB residuals when the variable is permuted By default h2o.varimp() computes only the former. Is there really no option in h2o to get the alternative

How to predict correctly in sklearn RandomForestRegressor?

阅读更多关于 How to predict correctly in sklearn RandomForestRegressor?

问题 I'm working on a big data project for my school project. My dataset looks like this: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv I'm trying to predict the next values of "LandAverageTemperature". First, I've imported the csv into pandas and made it DataFrame named "df1". After taking errors on my first tries in sklearn, I converted the "dt" column into datetime64 from string then added a column named "year" that shows only the years in the date values.-Its probably

R Caret Random Forest AUC too good to be true?

阅读更多关于 R Caret Random Forest AUC too good to be true?

问题 Relative newbie to predictive modeling--most of my training/experience is in inferential stats. I'm trying to predict student college graduation in 4 years. Basic issue is that I've done data cleaning (imputing, centering, scaling); split that processed/transformed data into training (70%) and testing (30%) sets; balanced the data using two approaches (because data was 65%=0, 35%=1--and I've found inconsistent advice on what classifies as unbalanced, but one source suggested anything not

Same probability for every hour in a loop with randomForest

阅读更多关于 Same probability for every hour in a loop with randomForest

问题 I am predicting probabilities per hour for every observation with a random forest model. But for some reason the prediction for every hour within a observation is the same. This shouldn't be the case since the probability is different for every hour. I have masked some data for privacy reasons. heres a sample of my data, where ti is the hours variable: $ y : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 2 1 ... $ geslacht : Factor w/ 2 levels "Dhr.","Mevr.": 2 2 1 1 1 2 1 1 2 2 ... $ ti :

Parallelizing random forests

阅读更多关于 Parallelizing random forests

问题 Through searching and asking, I've found many packages I can use to make use of all the cores of my server, and many packages that can do random forest. I'm quite new at this, and I'm getting lost between all the ways to parallelize the training of my random forest. Could you give some advice on reasons to use and/or avoid each of them, or some specific combinations of them (and with or without caret ?) that have made their proof ? Packages for parallelization : doParallel , doSNOW , doSMP

class_weight hyperparameter in Random Forest change the amounts of samples in confusion matrix

阅读更多关于 class_weight hyperparameter in Random Forest change the amounts of samples in confusion matrix

问题 I'm currently working on a Random Forest Classification model which contains 24,000 samples where 20,000 of them belong to class 0 and 4,000 of them belong to class 1 . I made a train_test_split where test_set is 0.2 of the whole dataset (around 4,800 samples in test_set ). Since I'm dealing with imbalanced data, I looked at the hyperparameter class_weight which is aimed to solve this issue. The problem I'm facing the moment I'm setting class_weight='balanced' and look at the confusion_matrix

Random Forest Black Box with CleverHans

阅读更多关于 Random Forest Black Box with CleverHans

问题 I am new to this stuff and trying to attack Random Forest with Black Box FGSM (from clever hans) But I'm not sure how to implement it. They've a blackbox example for Mnist data but I dont understand where should I put my random forest and where should I attack. Any help would be appreciated. 回答1: In the current tutorial, the black-box model is a neural network implemented with TensorFlow and its predictions (the labels) are used to train a substitute model (a copy of the black-box model). The

Tagging columns as Categorical in Spark

阅读更多关于 Tagging columns as Categorical in Spark

问题 I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process. Some queries are How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical.