random-forest

The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

岁酱吖の 提交于 2020-01-07 01:31:10
问题 While using the predict function in R to get the predictions from a Random Forest model, I misspecified the training data as newdata as follows: RF1pred <- predict(RF1, newdata=TrainS1, type = "class") Used like this, I get extremely high accuracy and AUC, which I am sure is not right, but I couldn't find a good explanation for it. This thread is the closest I got, but I can's say I fully understand the explanation there. If someone could elaborate, I will be grateful. Thank you! EDIT:

The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

不问归期 提交于 2020-01-07 01:31:01
问题 While using the predict function in R to get the predictions from a Random Forest model, I misspecified the training data as newdata as follows: RF1pred <- predict(RF1, newdata=TrainS1, type = "class") Used like this, I get extremely high accuracy and AUC, which I am sure is not right, but I couldn't find a good explanation for it. This thread is the closest I got, but I can's say I fully understand the explanation there. If someone could elaborate, I will be grateful. Thank you! EDIT:

permutation importance in h2o random Forest

混江龙づ霸主 提交于 2020-01-06 05:47:08
问题 The CRAN implementation of random forests offers both variable importance measures: the Gini importance as well as the widely used permutation importance defined as For classification, it is the increase in percent of times a case is OOB and misclassified when the variable is permuted. For regression, it is the average increase in squared OOB residuals when the variable is permuted By default h2o.varimp() computes only the former. Is there really no option in h2o to get the alternative

How to predict correctly in sklearn RandomForestRegressor?

天大地大妈咪最大 提交于 2020-01-06 04:54:06
问题 I'm working on a big data project for my school project. My dataset looks like this: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv I'm trying to predict the next values of "LandAverageTemperature". First, I've imported the csv into pandas and made it DataFrame named "df1". After taking errors on my first tries in sklearn, I converted the "dt" column into datetime64 from string then added a column named "year" that shows only the years in the date values.-Its probably

R Caret Random Forest AUC too good to be true?

时光毁灭记忆、已成空白 提交于 2020-01-05 02:27:09
问题 Relative newbie to predictive modeling--most of my training/experience is in inferential stats. I'm trying to predict student college graduation in 4 years. Basic issue is that I've done data cleaning (imputing, centering, scaling); split that processed/transformed data into training (70%) and testing (30%) sets; balanced the data using two approaches (because data was 65%=0, 35%=1--and I've found inconsistent advice on what classifies as unbalanced, but one source suggested anything not

Same probability for every hour in a loop with randomForest

大憨熊 提交于 2020-01-04 07:52:35
问题 I am predicting probabilities per hour for every observation with a random forest model. But for some reason the prediction for every hour within a observation is the same. This shouldn't be the case since the probability is different for every hour. I have masked some data for privacy reasons. heres a sample of my data, where ti is the hours variable: $ y : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 2 1 ... $ geslacht : Factor w/ 2 levels "Dhr.","Mevr.": 2 2 1 1 1 2 1 1 2 2 ... $ ti :

Parallelizing random forests

99封情书 提交于 2020-01-04 07:46:22
问题 Through searching and asking, I've found many packages I can use to make use of all the cores of my server, and many packages that can do random forest. I'm quite new at this, and I'm getting lost between all the ways to parallelize the training of my random forest. Could you give some advice on reasons to use and/or avoid each of them, or some specific combinations of them (and with or without caret ?) that have made their proof ? Packages for parallelization : doParallel , doSNOW , doSMP

class_weight hyperparameter in Random Forest change the amounts of samples in confusion matrix

社会主义新天地 提交于 2020-01-03 05:28:07
问题 I'm currently working on a Random Forest Classification model which contains 24,000 samples where 20,000 of them belong to class 0 and 4,000 of them belong to class 1 . I made a train_test_split where test_set is 0.2 of the whole dataset (around 4,800 samples in test_set ). Since I'm dealing with imbalanced data, I looked at the hyperparameter class_weight which is aimed to solve this issue. The problem I'm facing the moment I'm setting class_weight='balanced' and look at the confusion_matrix

Random Forest Black Box with CleverHans

浪子不回头ぞ 提交于 2020-01-03 02:45:07
问题 I am new to this stuff and trying to attack Random Forest with Black Box FGSM (from clever hans) But I'm not sure how to implement it. They've a blackbox example for Mnist data but I dont understand where should I put my random forest and where should I attack. Any help would be appreciated. 回答1: In the current tutorial, the black-box model is a neural network implemented with TensorFlow and its predictions (the labels) are used to train a substitute model (a copy of the black-box model). The

Tagging columns as Categorical in Spark

五迷三道 提交于 2020-01-02 10:18:34
问题 I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process. Some queries are How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical.