random-forest

Why I can not find lowest mean absolute error using Random Forest?

十年热恋 提交于 2019-12-24 08:01:39
问题 I am doing Kaggle competition with the following dataset: https://www.kaggle.com/c/home-data-for-ml-course/download/train.csv According to the theory, by increasing number of estimators in Random Forest model the mean absolute error would drop only until some number (sweet spot) and further increase would cause overfitting. By plotting number of estimators and mean absolute errors we should get this red graph, were lowest point marks the best number of estimators. I try to find best number of

All binary predictors in a classification task

天大地大妈咪最大 提交于 2019-12-24 07:24:21
问题 I am performing my analysis using R, I will be implementing four algorithms. 1. RF 2. Log Reg 3. SVM 4. LDA I have 50 predictors and 1 target variable. All my predictors and target variable are only binary numbers 0s and 1s. I have the following questions: Should I convert them all into factors? Converting them into factors, and applying RF algorithms give 100% accuracy, I am very much surprised to see that as well. Also, for other algorithms, how should i treat my variables priorly, before

Difference in ROC-AUC scores in sklearn RandomForestClassifier vs. auc methods

亡梦爱人 提交于 2019-12-24 03:54:19
问题 I am receiving different ROC-AUC scores from sklearn's RandomForestClassifier and roc_curve, auc methods, respectively. The following code got me an ROC-AUC (i.e. gs.best_score_) of 0.878: def train_model(mod = None, params = None, features = None, outcome = ...outcomes array..., metric = 'roc_auc'): gs = GridSearchCV(mod, params, scoring=metric, loss_func=None, score_func=None, fit_params=None, n_jobs=-1, iid=True, refit=True, cv=10, verbose=0, pre_dispatch='2*n_jobs', error_score='raise')

Predict/estimate values using randomForest in R

早过忘川 提交于 2019-12-24 00:57:04
问题 I want to predict values for my Pop_avg field in my unsurveyed areas based on surveyed areas. I am using randomForest based on a suggestion to my earlier question. My surveyed areas: > surveyed <- read.csv("summer_surveyed.csv", header = T) > surveyed_1 <- surveyed[, -c(1,2,3,5,6,7,9,10,11,12,13,15)] > head(surveyed_1, n=1) VEGETATION Pop_avg Acres_1 1 Acer rubrum-Vaccinium corymbosum-Amelanchier spp. 0 27.68884 My unsurveyed areas: > unsurveyed <- read.csv("summer_unsurveyed.csv", header = T

Handle null/NaN values in spark mllib classifier

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-23 23:04:05
问题 I have a set of categorical columns (strings), that I'm parsing and converting into Vectors of features to pass to a mllib classifier (random forest). In my input data, some columns have null values. Say, in one of those columns, I have p values + a null value : How should I build my feature Vectors, and the categoricalFeaturesInfo map of the classifier ? option 1 : I tell p values in categoricalFeaturesInfo, and I use Double.NaN in my input Vectors ? side question : How NaNs are handled by

Python Scikit Random Forest Regressor Error

≡放荡痞女 提交于 2019-12-23 10:25:34
问题 I am trying to load training and test data from a csv, run the random forest regressor in scikit/sklearn, and then predict the output from the test file. The TrainLoanData.csv file contains 5 columns; the first column is the output and the next 4 columns are the features. The TestLoanData.csv contains 4 columns - the features. When I run the code, I get error: predicted_probs = ["%f" % x[1] for x in predicted_probs] IndexError: invalid index to scalar variable. What does this mean? Here is my

Manual tree fitting memory consumption in sklearn

。_饼干妹妹 提交于 2019-12-23 05:35:24
问题 I'm using sklearn's RandomForestClassifier for a classification problem. I would like to train the trees of the a forest individually as I am grabbing subsets of a (VERY) large set for each tree. However, when I fit trees manually, memory consumption bloats. Here's a line-by-line memory profile using memory_profiler of a custom fit vs using the RandomForestClassifier 's fit function. As far as I can tell the source fit function performs the same steps as the custom fit. So what gives with all

caret: using random forest and include cross-validation

为君一笑 提交于 2019-12-22 15:14:04
问题 I used the caret package to train a random forest, including repeated cross-validation. I’d like to know whether the OOB, as in the original RF by Breiman, is used or whether this is replaced by the cross-validation. If it is replaced, do I have the same advantages as described in Breiman 2001, like increased accuracy by reducing the correlation between input data? As OOB is drawn with replacement and CV is drawn without replacement, are both procedures comparable? What is the OOB estimate of

caret: using random forest and include cross-validation

僤鯓⒐⒋嵵緔 提交于 2019-12-22 15:13:07
问题 I used the caret package to train a random forest, including repeated cross-validation. I’d like to know whether the OOB, as in the original RF by Breiman, is used or whether this is replaced by the cross-validation. If it is replaced, do I have the same advantages as described in Breiman 2001, like increased accuracy by reducing the correlation between input data? As OOB is drawn with replacement and CV is drawn without replacement, are both procedures comparable? What is the OOB estimate of

Neural Network - Working with a imbalanced dataset

拟墨画扇 提交于 2019-12-22 10:57:04
问题 I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem). The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'. Considering the big number of training samples I have, I didn't consider SVM. I also read about