cross-validation | 易学教程

How to cross validate RandomForest model?

阅读更多关于 How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually? zero323 ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

(Python - sklearn) How to pass parameters to the customize ModelTransformer class by gridsearchcv

阅读更多关于 (Python - sklearn) How to pass parameters to the customize ModelTransformer class by gridsearchcv

Below is my pipeline and it seems that I can't pass the parameters to my models by using the ModelTransformer class, which I take it from the link ( http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html ) The error message makes sense to me, but I don't know how to fix this. Any idea how to fix this? Thanks. # define a pipeline pipeline = Pipeline([ ('vect', DictVectorizer(sparse=False)), ('scale', preprocessing.MinMaxScaler()), ('ess', FeatureUnion(n_jobs=-1, transformer_list=[ ('rfc', ModelTransformer(RandomForestClassifier(n_jobs=-1, random_state=1, n_estimators=100)

Topic models: cross validation with loglikelihood or perplexity

阅读更多关于 Topic models: cross validation with loglikelihood or perplexity

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60. I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set. I found this code from one of CV's discussion sessions. I really don't understand several lines of codes below. I have dtm matrix using the holdout set (20

Model help using Scikit-learn when using GridSearch

阅读更多关于 Model help using Scikit-learn when using GridSearch

As part of the Enron project, built the attached model, Below is the summary of the steps, Below model gives highly perfect scores cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42) gcv = GridSearchCV(pipe, clf_params,cv=cv) gcv.fit(features,labels) ---> with the full dataset for train_ind, test_ind in cv.split(features,labels): x_train, x_test = features[train_ind], features[test_ind] y_train, y_test = labels[train_ind],labels[test_ind] gcv.best_estimator_.predict(x_test) Below model gives more reasonable but low scores cv = StratifiedShuffleSplit(n_splits = 100,

How to extract model hyper-parameters from spark.ml in PySpark?

阅读更多关于 How to extract model hyper-parameters from spark.ml in PySpark?

问题 I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0),

Applying k-fold Cross Validation model using caret package

阅读更多关于 Applying k-fold Cross Validation model using caret package

问题 Let me start by saying that I have read many posts on Cross Validation and it seems there is much confusion out there. My understanding of that it is simply this: Perform k-fold Cross Validation i.e. 10 folds to understand the average error across the 10 folds. If acceptable then train the model on the complete data set. I am attempting to build a decision tree using rpart in R and taking advantage of the caret package. Below is the code I am using. # load libraries library(caret) library

Spark CrossValidatorModel access other models than the bestModel?

阅读更多关于 Spark CrossValidatorModel access other models than the bestModel?

I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the Cross Validation. Are the other models of the cross validation automatically discarded or can I select a model that performed worse than the bestModel? I am asking because I am using the F1 Score metric for the cross validation but I am also interested in the weighedRecall of all of the models and not just of the model that has performed best during

Difference between cross_val_score and cross_val_predict

阅读更多关于 Difference between cross_val_score and cross_val_predict

问题 I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score and cross_val_predict I should use. One option would be : cvs = DecisionTreeRegressor(max_depth = depth) scores = cross_val_score(cvs, predictors, target, cv=cvfolds, scoring='r2') print("R2-Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) An other one, to use the cv-predictions with the standard r2_score : cvp =

Using explicit (predefined) validation set for grid search with sklearn

阅读更多关于 Using explicit (predefined) validation set for grid search with sklearn

问题 I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms. I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV() . Below is some code I've previously used for doing K-fold cross-validation on the training set. However, for this

How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

阅读更多关于 How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

问题 I'm running GridSearch CV to optimize the parameters of a classifier in scikit. Once I'm done, I'd like to know which parameters were chosen as the best. Whenever I do so I get a AttributeError: 'RandomForestClassifier' object has no attribute 'best_estimator_' , and can't tell why, as it seems to be a legitimate attribute on the documentation. from sklearn.grid_search import GridSearchCV X = data[usable_columns] y = data[target] X_train, X_test, y_train, y_test = train_test_split(X, y, test