random-forest

Python: How can we smooth a noisy signal using moving average?

落花浮王杯 提交于 2019-12-02 02:37:13
For an evaluation of a random forest regression, I am trying to improve a result using a moving average filter after fitting a model using a RandomForestRegressor for a dataset found in this link import pandas as pd import math import matplotlib import matplotlib.pyplot as plt import numpy as np from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.model_selection import GridSearchCV from sklearn.metrics import r2_score, mean_squared_error, make_scorer from sklearn.model_selection import train_test_split from math import sqrt from sklearn.cross_validation

Caret Model random forest into PMML error

旧巷老猫 提交于 2019-12-02 02:29:33
I would like to export a Caret random forest model using the pmml library so I can use it for predictions in Java. Here is a reproduction of the error I am getting. data(iris) require(caret) require(pmml) rfGrid2 <- expand.grid(.mtry = c(1,2)) fitControl2 <- trainControl( method = "repeatedcv", number = NUMBER_OF_CV, repeats = REPEATES) model.Test <- train(Species ~ ., data = iris, method ="rf", trControl = fitControl2, ntree = NUMBER_OF_TREES, importance = TRUE, tuneGrid = rfGrid2) print(model.Test) pmml(model.Test) Error in UseMethod("pmml") : no applicable method for 'pmml' applied to an

maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml

被刻印的时光 ゝ 提交于 2019-12-01 17:56:28
Background: I'm doing a simple binary classification, using RandomForestClassifier from pyspark.ml. Before feeding the data to training, I managed to use VectorIndexer to decide whether features would be numerical or categorical by providing the argument maxCategories. Problem: Even if I have used the VectorIndexer with maxCategories setting to 30, I was still getting an error during training pipeline: An error occurred while calling o15371.fit. : java.lang.IllegalArgumentException: requirement failed: DecisionTree requires maxBins (= 32) to be at least as large as the number of values in each

Variable importance using the caret package (error); RandomForest algorithm

回眸只為那壹抹淺笑 提交于 2019-12-01 15:06:54
I am trying to obtain the variable importance of a rf model in any way. This is the approach I have tried so far, but alternate suggestions are very welcome. I have trained a model in R: require(caret) require(randomForest) myControl = trainControl(method='cv',number=5,repeats=2,returnResamp='none') model2 = train(increaseInAssessedLevel~., data=trainData, method = 'rf', trControl=myControl) The dataset is fairly large, but the model runs fine. I can access its parts and run commands such as: > model2[3] $results mtry RMSE Rsquared RMSESD RsquaredSD 1 2 0.1901304 0.3342449 0.004586902 0

How to pass a character vector in the train function caret R

心已入冬 提交于 2019-12-01 14:08:27
I want to reduce the number of variables when i train my model. I have a total of 784 features that I want to reduce to lets say 500. I can make a long string with the selected featuees with the Paste command collapsed with + to have a long string. For example, lets say this is my vector val <- "pixel40+pixel46+pixel48+pixel65+pixel66+pixel67" then I would like to pass it to the train function like so Rf_model <- train(label~val, data =training, method="rf", ntree=200, na.action=na.omit) but I get the error model.frame.default(form = label ~ val, data = training, na.action = na.omit) Thanks!

Variable importance using the caret package (error); RandomForest algorithm

情到浓时终转凉″ 提交于 2019-12-01 13:58:35
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 6 years ago . I am trying to obtain the variable importance of a rf model in any way. This is the approach I have tried so far, but alternate suggestions are very welcome. I have trained a model in R: require(caret) require(randomForest) myControl = trainControl(method='cv',number=5,repeats=2,returnResamp='none') model2 = train(increaseInAssessedLevel~., data=trainData, method = 'rf',

Improve h2o DRF runtime on a multi-node cluster

白昼怎懂夜的黑 提交于 2019-12-01 12:48:11
I am currently running h2o 's DRF algorithm an a 3-node EC2 cluster (the h2o server spans across all 3 nodes). My data set has 1m rows and 41 columns (40 predictors and 1 response). I use the R bindings to control the cluster and the RF call is as follows model=h2o.randomForest(x=x, y=y, ignore_const_cols=TRUE, training_frame=train_data, seed=1234, mtries=7, ntrees=2000, max_depth=15, min_rows=50, stopping_rounds=3, stopping_metric="MSE", stopping_tolerance=2e-5) For the 3-node cluster (c4.8xlarge, enhanced networking turned on), this takes about 240sec; the CPU utilization is between 10-20%;

How to pass a character vector in the train function caret R

こ雲淡風輕ζ 提交于 2019-12-01 12:16:17
问题 I want to reduce the number of variables when i train my model. I have a total of 784 features that I want to reduce to lets say 500. I can make a long string with the selected featuees with the Paste command collapsed with + to have a long string. For example, lets say this is my vector val <- "pixel40+pixel46+pixel48+pixel65+pixel66+pixel67" then I would like to pass it to the train function like so Rf_model <- train(label~val, data =training, method="rf", ntree=200, na.action=na.omit) but

Retrieve list of training features names from classifier

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-01 11:06:21
Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data. The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier . Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting. Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest

Scikit Learn - ValueError: Array contains NaN or infinity

好久不见. 提交于 2019-12-01 09:34:46
There are no NaNs in my dataset, I have checked thoroughly. Any reason why I keep getting this error when trying to fit my classifier? Some of the numbers in the data set are rather large and some decimal places go out 10 decimal points but I wouldn't thing that should cause an error. I included some of my pandas DataFrame info below as well as the error itself. Any ideas? <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 6244 entries, 1985-02-06 00:00:00 to 2009-11-05 00:00:00 Data columns (total 86 columns): dtypes: float64(86) clf = RandomForestClassifier(n_estimators=100,min_samples