random-forest

How do I replace the bootstrap step in the package randomForest r

╄→尐↘猪︶ㄣ 提交于 2019-12-05 16:53:13
First some background info, which is probably more interesting on stats.stackexchange: In my data analysis I try to compare the performance of different machine learning methods on time series data (regression, not classification). So for example I have trained a Boosting trained model and compare this with a Random Forest trained model (R package randomForest). I use time series data where the explanatory variables are lagged values of other data and the dependent variable. For some reason the Random Forest severely underperforms. One of the problems I could think of is that the Random Forest

How to install BigMemory and bigrf on windows OS

天大地大妈咪最大 提交于 2019-12-05 11:06:18
I have been trying to install bigmemory on my R installation. My OS is windows 7 64 bit and I have tried it on R V2.15.1,2.15.2 and 3.0.1 64 bit but I cant get it to work. I have tried several options download the current source and run the command in R v3.0.1 install.packages("D:/Downloads/bigmemory_4.4.3.tar.gz", repos = NULL, type="source") but this gives an error "ERROR: Unix-only package" download older sources and run a similar commands, in the various installations of R V2 V3 etc, This gives me an error "ERROR: configuration failed for package 'bigmemory'" Any ideas? I am actually

randomForest does not work when training set has more different factor levels than test set

孤者浪人 提交于 2019-12-05 10:44:49
When trying to test my trained model on new test data that has fewer factor levels than my training data, predict() returns the following: Type of predictors in new data do not match that of the training data. My training data has a variable with 7 factor levels and my test data has that same variable with 6 factor levels (all 6 ARE in the training data). When I add an observation containing the "missing" 7th factor, the model runs, so I'm not sure why this happens or even the logic behind it. I could see if the test set had more/different factor levels, then randomForest would choke, but why

Workaround for 32-/64-bit serialization exception on sklearn RandomForest model

时光毁灭记忆、已成空白 提交于 2019-12-05 09:44:20
If we serialize randomforest model using joblib on a 64-bit machine, and then unpack on a 32-bit machine, there is an exception: ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long long' This question has been asked before: Scikits-Learn RandomForrest trained on 64bit python wont open on 32bit python . But the question has not been answered from since 2014. Sample code to learn the model (On a 64-bit machine): modelPath="../" featureVec=... labelVec = ... forest = RandomForestClassifier() randomSearch = RandomizedSearchCV(forest, param_distributions=param_dict, cv=10, scoring=

R random forest : data (x) has 0 rows

大兔子大兔子 提交于 2019-12-05 05:55:29
I am using randomForest function from randomForest package to find the most important variable: my dataframe is called urban and my response variable is revenue which is numeric. urban.random.forest <- randomForest(revenue ~ .,y=urban$revenue, data = urban, ntree=500, keep.forest=FALSE,importance=TRUE,na.action = na.omit) I get the following error: Error in randomForest.default(m, y, ...) : data (x) has 0 rows on the source code it is related to x variable: n <- nrow(x) p <- ncol(x) if (n == 0) stop("data (x) has 0 rows") but I cannot understand what is x . I solved that. I had some columns

Split data set and pass the subsets in parallel to function then recombine the results

随声附和 提交于 2019-12-05 03:32:36
问题 Here is what I am trying to do using the foreach package. I have data set with 600 rows and 58000 column with lots of missing values. We need to impute the missing values using package called "missForest" in which it is not parallel, it takes to much time to run this data at once. so, I am thinking to divide the data into 7 data sets (I have 7 cores) with the same number of rows (my lines) and different number of col ( markers). Then using %dopar% to pass the data sets in parallel to

caret train rf model - inexplicably long execution

不羁的心 提交于 2019-12-05 01:46:19
问题 While trying to train random forest model with caret package, I noticed that execution time is inexplicably long: > set.seed = 1; > n = 500; > m = 30; > x = matrix(rnorm(n * m), nrow = n); > y = factor(sample.int(2, n, replace = T), labels = c("yes", "no")) > require(caret); > require(randomForest); > print(system.time({rf <- randomForest(x, y);})); user system elapsed 0.99 0.00 0.98 > print(system.time({rfmod <- train(x = x, y = y, + method = "rf", + metric = "Accuracy", + trControl =

Use of randomforest() for classification in R?

牧云@^-^@ 提交于 2019-12-05 01:33:45
I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with training <- sapply(training.temp,as.numeric) But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did training[,"Class"] <- factor(training[,ncol(training)]) I proceed to creating the tree with training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100) But I'm getting two errors: 1: In Ops.factor(training[, "Status"], factor(training[, ncol

Error with Sklearn Random Forest Regressor

点点圈 提交于 2019-12-05 00:46:41
When trying to fit a Random Forest Regressor model with y data that looks like this: [ 0.00000000e+00 1.36094276e+02 4.46608221e+03 8.72660888e+03 1.31375786e+04 1.73580193e+04 2.29420671e+04 3.12216341e+04 4.11395711e+04 5.07972062e+04 6.14904935e+04 7.34275322e+04 7.87333933e+04 8.46302456e+04 9.71074959e+04 1.07146672e+05 1.17187952e+05 1.26953374e+05 1.37736003e+05 1.47239359e+05 1.53943242e+05 1.78806710e+05 1.92657725e+05 2.08912711e+05 2.22855152e+05 2.34532982e+05 2.41391255e+05 2.48699216e+05 2.62421197e+05 2.79544300e+05 2.95550971e+05 3.13524275e+05 3.23365158e+05 3.24069067e+05 3

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

戏子无情 提交于 2019-12-04 22:29:28
问题 In the documentation of SciKit-Learn Random Forest classifier , it is stated that The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default). What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training. Am I missing something here?