cross-validation

How to cross validate RandomForest model?

房东的猫 提交于 2019-11-28 08:16:14
I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? zero323 ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

(Python - sklearn) How to pass parameters to the customize ModelTransformer class by gridsearchcv

末鹿安然 提交于 2019-11-28 04:00:57
Below is my pipeline and it seems that I can't pass the parameters to my models by using the ModelTransformer class, which I take it from the link ( http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html ) The error message makes sense to me, but I don't know how to fix this. Any idea how to fix this? Thanks. # define a pipeline pipeline = Pipeline([ ('vect', DictVectorizer(sparse=False)), ('scale', preprocessing.MinMaxScaler()), ('ess', FeatureUnion(n_jobs=-1, transformer_list=[ ('rfc', ModelTransformer(RandomForestClassifier(n_jobs=-1, random_state=1, n_estimators=100)

Topic models: cross validation with loglikelihood or perplexity

只愿长相守 提交于 2019-11-28 03:04:49
I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60. I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set. I found this code from one of CV's discussion sessions. I really don't understand several lines of codes below. I have dtm matrix using the holdout set (20

Model help using Scikit-learn when using GridSearch

纵饮孤独 提交于 2019-11-27 22:38:06
As part of the Enron project, built the attached model, Below is the summary of the steps, Below model gives highly perfect scores cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42) gcv = GridSearchCV(pipe, clf_params,cv=cv) gcv.fit(features,labels) ---> with the full dataset for train_ind, test_ind in cv.split(features,labels): x_train, x_test = features[train_ind], features[test_ind] y_train, y_test = labels[train_ind],labels[test_ind] gcv.best_estimator_.predict(x_test) Below model gives more reasonable but low scores cv = StratifiedShuffleSplit(n_splits = 100,

How to extract model hyper-parameters from spark.ml in PySpark?

别来无恙 提交于 2019-11-27 20:32:52
问题 I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0),

Applying k-fold Cross Validation model using caret package

霸气de小男生 提交于 2019-11-27 20:10:44
问题 Let me start by saying that I have read many posts on Cross Validation and it seems there is much confusion out there. My understanding of that it is simply this: Perform k-fold Cross Validation i.e. 10 folds to understand the average error across the 10 folds. If acceptable then train the model on the complete data set. I am attempting to build a decision tree using rpart in R and taking advantage of the caret package. Below is the code I am using. # load libraries library(caret) library

Spark CrossValidatorModel access other models than the bestModel?

南笙酒味 提交于 2019-11-27 16:04:55
I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the Cross Validation. Are the other models of the cross validation automatically discarded or can I select a model that performed worse than the bestModel? I am asking because I am using the F1 Score metric for the cross validation but I am also interested in the weighedRecall of all of the models and not just of the model that has performed best during

Difference between cross_val_score and cross_val_predict

落爺英雄遲暮 提交于 2019-11-27 13:09:37
问题 I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score and cross_val_predict I should use. One option would be : cvs = DecisionTreeRegressor(max_depth = depth) scores = cross_val_score(cvs, predictors, target, cv=cvfolds, scoring='r2') print("R2-Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) An other one, to use the cv-predictions with the standard r2_score : cvp =

Using explicit (predefined) validation set for grid search with sklearn

点点圈 提交于 2019-11-27 10:32:32
问题 I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms. I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV() . Below is some code I've previously used for doing K-fold cross-validation on the training set. However, for this

How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

跟風遠走 提交于 2019-11-27 10:16:43
问题 I'm running GridSearch CV to optimize the parameters of a classifier in scikit. Once I'm done, I'd like to know which parameters were chosen as the best. Whenever I do so I get a AttributeError: 'RandomForestClassifier' object has no attribute 'best_estimator_' , and can't tell why, as it seems to be a legitimate attribute on the documentation. from sklearn.grid_search import GridSearchCV X = data[usable_columns] y = data[target] X_train, X_test, y_train, y_test = train_test_split(X, y, test