random-forest

Fitting sklearn GridSearchCV model

南楼画角 提交于 2019-12-12 12:07:18
问题 I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters. Problem 1 Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters. OR Should I fit it on X, y to get best parameters.(X, y = entire dataset) Problem 2 Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters. Now how should I train this new model on ?

sklearn random forest: .oob_score_ too low?

百般思念 提交于 2019-12-12 11:29:35
问题 I was searching for applications for random forests, and I found the following knowledge competition on Kaggle: https://www.kaggle.com/c/forest-cover-type-prediction. Following the advice at https://www.kaggle.com/c/forest-cover-type-prediction/forums/t/8182/first-try-with-random-forests-scikit-learn, I used sklearn to build a random forest with 500 trees. The .oob_score_ was ~2%, but the score on the holdout set was ~75%. There are only seven classes to classify, so 2% is really low. I also

Does sklearn support a cost matrix?

你离开我真会死。 提交于 2019-12-12 10:35:35
问题 Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j. The main classifier I am using is a Random Forest. Thanks. 回答1: The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have. 回答2: One way to circumvent this limitation is to use under or oversampling. E.g.,

Extracting the terminal nodes of each tree associated with a new observation

寵の児 提交于 2019-12-12 10:20:12
问题 I would like to extract the terminal nodes of the random forest R implementation. As I have understood random forest, you have a sequence of orthogonal trees. When you predict a new observation (In regression), it enters all these trees and then you average the prediction of each individual tree. If I wanted to not average but maybe do a linear regression with these corresponding observations I would need, say, a list of the observations that are "associated" with this new observation. I have

random forest package prediction, newdata argument?

北战南征 提交于 2019-12-12 10:12:50
问题 I've just recently started playing around with the random forest package in R. After growing my forest, I tried predicting the response using the same dataset (ie the training dataset) which gave me a confusion matrix different from the one that was printed with the forest object itself. I thought there might be something wrong with the newdata argument but I followed the example given in the documentation to the t and it gave the same problem. Here's an example using the Species dataset.

what is the difference between fit_transform and transform in sklearn countvectorizer?

烈酒焚心 提交于 2019-12-12 09:41:16
问题 i have just started learning random forest , so if this sounds stupid i am very sorry for it I was recently practicing bag of words introduction : kaggle , i want to clear few things : using vectorizer.fit_transform( " * on the list of cleaned reviews* " ) Now when we were preparing the bag of words array on train reviews we used fit_predict on the list of train reviews , now i know that fit_predict does two things , > first it fits on the data and knows the vocabulary and then it makes

Scikit - changing the threshold to create multiple confusion matrixes

旧街凉风 提交于 2019-12-12 07:52:53
问题 I'm building a classifier that goes through lending club data, and selects the best X loans. I've trained a Random Forest, and created the usual ROC curves, Confusion Matrices, etc. The confusion matrix takes as an argument the predictions of the classifier (the majority prediction of the trees in the forest). However, I wish to print multiple confusion matrices at different thresholds, to know what happens if I choose the 10% best loans, the 20% best loans, etc. I know from reading other

How to set cutoff while training the data in Random Forest in Spark

放肆的年华 提交于 2019-12-12 06:01:09
问题 I am using Spark Mlib to train the data for classification using Random Forest Algorithm. The MLib provides a RandomForest Class which has trainClassifier Method which does the required. Can I set a threshold value while training the data set, similar to the cutoff option provided in R's randomForest Package. http://cran.r-project.org/web/packages/randomForest/randomForest.pdf I found the RandomForest Class of MLib provides options only to pass number of trees, impurity, number of classes etc

Predicting the “no class” / unrecognised class in Weka Machine Learning

白昼怎懂夜的黑 提交于 2019-12-12 03:27:34
问题 I am using Weka 3.7 to classify text documents based on their content. I have a set of text files in folders and they all belong to a certain category. Category A: 100 txt files Category B: 100 txt files ... Category X: 100 txt files I want to predict if a document falls into one of the categories A-X, OR if it falls in the category UNRECOGNISED (for all other documents). I am getting the total set of Instances programatically like this: private Instances getTotalSet(){ ArrayList<Attribute>

randomForest in R object not found error

江枫思渺然 提交于 2019-12-12 03:24:27
问题 # init libs <- c("tm", "plyr", "class", "RTextTools", "randomForest") lapply(libs, require, character.only = TRUE) # set options options(stringsAsFactors = FALSE) # set parameters labels <- read.table('labels.txt') path <- paste(getwd(), "/data", sep="") # clean text cleanCorpus <- function(corpus) { corpus.tmp <- tm_map(corpus, removePunctuation) corpus.tmp <- tm_map(corpus.tmp, removeNumbers) corpus.tmp <- tm_map(corpus.tmp, stripWhitespace) corpus.tmp <- tm_map(corpus.tmp, content