random-forest

How to get the n_estimator and max_features of the minimum OOB error and use them for optimization?

ε祈祈猫儿з 提交于 2019-12-13 17:19:12
问题 I want to optimize a Random Forest classifier. So, I plotted an OOB error(the code is available in scikit). From this plot, I want to pick the 2 variables (n_estimators and max_features) that give the lowest OOB error. And then use those them to optimize the classifier (a clf.fit). From the curve it can be seen that with 170 n_estimators and 5 max_features, I get the lowest OOB. But how can I send these 2 values as a clf.fit to the RandomForest? I want to use this technique instead of

RandomForestClassifier import

你离开我真会死。 提交于 2019-12-13 16:36:17
问题 I've installed Anaconda Python distribution with scikit-learn. While importing RandomForestClassifier: from sklearn.ensemble import RandomForestClassifier I have the following error: File "C:\Anaconda\lib\site-packages\sklearn\tree\tree.py", line 36, in <module> from . import _tree ImportError: cannot import name _tree What the problem can be there? 回答1: In sklearn version '0.18.1' from sklearn.ensemble.forest import RandomForestClassifier 回答2: The problem was that I had the 64bit version of

SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?

懵懂的女人 提交于 2019-12-13 15:40:27
问题 I've got a LabeledPoint on witch I want to run a decision tree (and later random forest) scala> transformedData.collect res8: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064

Error with Sklearn Random Forest Regressor

无人久伴 提交于 2019-12-13 12:24:20
问题 When trying to fit a Random Forest Regressor model with y data that looks like this: [ 0.00000000e+00 1.36094276e+02 4.46608221e+03 8.72660888e+03 1.31375786e+04 1.73580193e+04 2.29420671e+04 3.12216341e+04 4.11395711e+04 5.07972062e+04 6.14904935e+04 7.34275322e+04 7.87333933e+04 8.46302456e+04 9.71074959e+04 1.07146672e+05 1.17187952e+05 1.26953374e+05 1.37736003e+05 1.47239359e+05 1.53943242e+05 1.78806710e+05 1.92657725e+05 2.08912711e+05 2.22855152e+05 2.34532982e+05 2.41391255e+05 2

How to construct dataframe for time series data using ensemble learning methods

白昼怎懂夜的黑 提交于 2019-12-13 09:16:18
问题 I am trying to predict the Bitcoin price at t+5, i.e. 5 minutes ahead, using 11 technical indicators up to time t which can all be calculated from the open, high, low, close and volume values from the Bitcoin time series (see my full data set here). As far as I know, it is not necessary to manipulate the data frame when using algorithms like regression trees, support vector machines or artificial neural networks, but when using ensemble methods like random forests (RF) and Boosting, I heard

Randomforest classification weka

ぃ、小莉子 提交于 2019-12-13 04:35:12
问题 The attributes have been saved in 11 columns in csv file. If the order of columns change, Do Randomforest & RandomTree could give different accuracy in each time? 回答1: Ordering of the features does not affect any of classifiers I know (except those which are specially designed to do so - like specialistic classifiers for time series and other temporal features), no matter if it is Neural Network, SVM, RandomForest, RandomTree or NaiveBayes - it is just a numerical simplification, as arrays

ValueError when doing validation with random forests

☆樱花仙子☆ 提交于 2019-12-13 03:47:31
问题 I'm trying to build a model that would predict the caco-2 coefficient of a molecule given its smiles string representation. My solution is based on this example. Since I need to predict a real value, I use a RandomForestRegressor . With some molecules added to the code manually, everything works (although the predictions themselves are wildly wrong): from rdkit import Chem, DataStructs #all the nice chemical stuff, ConvertToNumpyArray from rdkit.Chem import AllChem from sklearn.ensemble

AWS sagemaker RandomCutForest (RCF) vs scikit lean RandomForest (RF)?

丶灬走出姿态 提交于 2019-12-13 03:10:10
问题 Is there a difference between the two, or are they different names for the same algorithm? 回答1: RandomCutForest (RCF) is an unsupervised method primarily used for anomaly detection, while RandomForest (RF) is a supervised method that can be used for regression or classification. For RCF, see documentation (here) and notebook example (here) 来源: https://stackoverflow.com/questions/56728230/aws-sagemaker-randomcutforest-rcf-vs-scikit-lean-randomforest-rf

Add separate vlines to ggplot for each factor group (dotplot for variable importance random forest)

人盡茶涼 提交于 2019-12-12 21:35:00
问题 I am using ggplot2 to make a dotplot of six related variable importance results from a random forest. My data (which I have already converted to long format using reshape2) look like this (my real dataset is a bit bigger): Factor Group Value Gender A 0.000127 Age A 0.000383 Informant A -0.000191 Gender B -0.000255 Age B 0.000389 Informant B -0.000312 Gender C -0.000285 Age C 0.000389 Informant C -0.000282 I can make the dotplot like this: ggplot(mydata, aes(x = Value, y = Factor, colour =

Oversampling or SMOTE in Pyspark

瘦欲@ 提交于 2019-12-12 17:14:51
问题 I have 7 classes and the total number of records are 115 and I wanted to run Random Forest model over this data. But as the data is not enough to get a high accuracy. So i wanted to apply oversampling over all the classes in a way that the majority class itself get higher count and then minority accordingly. Is this possible in PySpark? +---------+-----+ | SubTribe|count| +---------+-----+ | Chill| 10| | Cool| 18| |Adventure| 18| | Quirk| 13| | Mystery| 25| | Party| 18| |Glamorous| 13| +-----