random-forest

memory efficient prediction with randomForest in R

佐手、 提交于 2019-12-10 19:17:26
问题 TL;DR I want to know memory efficient ways of performing a batch prediction with randomForest models built on large datasets (hundreds of features, 10's of thousands of rows). Details : I'm working with a large data-set (over 3GB, in memory) and want to do a simple binary classification using randomForest . Since my data is proprietary, I cannot share it, but lets say the following code runs library(randomForest) library(data.table) myData <- fread("largeDataset.tsv") myFeatures <- myData[,

Getting the distribution of values at the leaf node for a DecisionTreeRegressor in scikit-learn

和自甴很熟 提交于 2019-12-10 18:29:48
问题 By default, a scikit-learn DecisionTreeRegressor returns the mean of all target values from the training set in a given leaf node. However, I am interested in getting back the list of target values from my training set that fell into the predicted leaf node. This will allow me to quantify the distribution, and also calculate other metrics like standard deviation. Is this possible using scikit-learn? 回答1: I think what you're looking for is the apply method of the tree object. See here for the

scikit-learn: How to calculate root-mean-square error (RMSE) in percentage?

血红的双手。 提交于 2019-12-10 16:37:02
问题 I have a dataset (found in this link: https://drive.google.com/open?id=0B2Iv8dfU4fTUY2ltNGVkMG05V00) of the following format. time X Y 0.000543 0 10 0.000575 0 10 0.041324 1 10 0.041331 2 10 0.041336 3 10 0.04134 4 10 ... 9.987735 55 239 9.987739 56 239 9.987744 57 239 9.987749 58 239 9.987938 59 239 The third column (Y) in my dataset is my true value - that's what I wanted to predict (estimate). I want to do a prediction of Y (i.e. predict the current value of Y according to the previous 100

Random Forest not working in opencv python (cv2)

孤街醉人 提交于 2019-12-10 14:54:25
问题 I can't seem to correctly pass in the parameters to train a Random Forest classifier in opencv from python. I wrote an implementation in C++ which worked correctly, but do not get the same results in python. I found some sample code here: http://fossies.org/linux/misc/opencv-2.4.7.tar.gz:a/opencv-2.4.7/samples/python2/letter_recog.py which seems to indicate that you should pass in the parameters in a dict. Here is the code I am using: rtree_params = dict(max_depth=11, min_sample_count=5, use

Difference between random forest implementation

ぃ、小莉子 提交于 2019-12-10 12:21:20
问题 Is there a performance difference between the implementation of Random Forest in H2O and standard Random Forest library? Has anybody performed or done some analysis for these two implementations. 回答1: Here is an open benchmark you can start with. https://github.com/szilard/benchm-ml 回答2: I suppose you are looking for this: http://www.wise.io/tech/benchmarking-random-forest-part-1 来源: https://stackoverflow.com/questions/45190787/difference-between-random-forest-implementation

Python vectorization for classification [duplicate]

别说谁变了你拦得住时间么 提交于 2019-12-10 09:41:02
问题 This question already has an answer here : Scikit learn - fit_transform on the test set (1 answer) Closed 5 years ago . I am currently trying to build a text classification model (document classification) with roughly 80 classes. When I build and train the model using random forest (after vectorizing the text into a TF-IDF matrix), the model works well. However, when I introduce new data, the same words that I used to build my RF aren't necessarily identical to the training set. This is a

Error in running randomForest : object not found

吃可爱长大的小学妹 提交于 2019-12-10 05:14:52
问题 So i am trying to fit a random forest classifier for my dataset. I am very new to R and i imagine this is a simple formatting issue. I read in a text file and transform my dataset so it is of this format: (taking out confidential info) >head(df.train,2) GOLGA8A ITPR3 GPR174 SNORA63 GIMAP8 LEF1 PDE4B LOC100507043 TGFB1I1 SPINT1 Sample1 3.726046 3.4013711 3.794364 4.265287 -1.514573 7.725775 2.162616 -1.514573 -1.5145732 -1.514573 Sample2 4.262779 0.9261892 4.744096 7.276971 -1.514573 4.694769

randomForest does not work when training set has more different factor levels than test set

岁酱吖の 提交于 2019-12-10 05:08:36
问题 When trying to test my trained model on new test data that has fewer factor levels than my training data, predict() returns the following: Type of predictors in new data do not match that of the training data. My training data has a variable with 7 factor levels and my test data has that same variable with 6 factor levels (all 6 ARE in the training data). When I add an observation containing the "missing" 7th factor, the model runs, so I'm not sure why this happens or even the logic behind it

ROC for random forest

有些话、适合烂在心里 提交于 2019-12-09 23:59:14
问题 I understand that ROC is drawn between tpr and fpr , but I am having difficulty in determining which parameters I should vary to get different tpr / fpr pairs. 回答1: I wrote this answer on a similar question. Basicly you can increase weighting on certain classes and/or downsample other classes and/or change vote aggregating rule. [[EDITED 13.15PM CEST 1st July 2015]] @ "the two classes are very balanced – Suryavansh" In such case your data is balanced you should mainly go with option 3

Why connection is terminating

做~自己de王妃 提交于 2019-12-09 18:13:39
问题 I'm trying a random forest classification model by using H2O library inside R on a training set having 70 million rows and 25 numeric features.The total file size is 5.6 GB. The validation file's size is 1 GB. I have 16 GB RAM and 8 core CPU on my system. The system successfully able to read both of the files in H2O object. Then I'm giving below command to build the model: model <- h2o.randomForest(x = c(1:18,20:25), y = 19, training_frame = traindata, validation_frame = testdata, ntrees =