grid-search

Make grid search functions in sklearn to ignore empty models

人盡茶涼 提交于 2019-12-10 18:24:07
问题 Using python and scikit-learn, I'd like to do a grid search. But some of my models end up being empty. How can I make the grid search function to ignore those models? I guess I can have a scoring function which returns 0 if the models is empty, but I'm not sure how. predictor = sklearn.svm.LinearSVC(penalty='l1', dual=False, class_weight='auto') param_dist = {'C': pow(2.0, np.arange(-10, 11))} learner = sklearn.grid_search.GridSearchCV(estimator=predictor, param_grid=param_dist, n_jobs=self.n

Avoid certain parameter combinations in GridSearchCV

笑着哭i 提交于 2019-12-10 14:42:27
问题 I'm using scikit-learn's GridSearchCV to iterate over a parameter space to tune a model. Specifically, I'm using it to test different hyperparameters in a neural network. The grid is as follows: params = {'num_hidden_layers': [0,1,2], 'hidden_layer_size': [64,128,256], 'activation': ['sigmoid', 'relu', 'tanh']} The problem is that I end up running redundant models when hidden num_hidden_layers is set to 0 . It will run a model with 0 hidden layers and 64 units, another with 128 units, and

pyspark: getting the best model's parameters after a gridsearch is blank {}

拈花ヽ惹草 提交于 2019-12-10 11:17:45
问题 could someone help me extract the best performing model's parameters from my grid search? It's a blank dictionary for some reason. from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator from pyspark.ml.evaluation import BinaryClassificationEvaluator train, test = df.randomSplit([0.66, 0.34], seed=12345) paramGrid = (ParamGridBuilder() .addGrid(lr.regParam, [0.01,0.1]) .addGrid(lr.elasticNetParam, [1.0,]) .addGrid(lr.maxIter, [3,]) .build()) evaluator =

Scikit - Combining scale and grid search

走远了吗. 提交于 2019-12-09 13:34:42
问题 I am new to scikit, and have 2 slight issues to combine a data scale and grid search. Efficient scaler Considering a cross validation using Kfolds, I would like that each time we train the model on the K-1 folds, the data scaler (using preprocessing.StandardScaler() for instance) is fit only on the K-1 folds and then apply to the remaining fold. My impression is that the following code, will fit the scaler on the entire dataset, and therefore I would like to modify it to behave as described

Parallel error with GridSearchCV, works fine with other methods

为君一笑 提交于 2019-12-09 06:44:56
问题 I am encounteringt the following problems using GridSearchCV: it gives me a parallel error while using n_jobs > 1 . At the same time n_jobs > 1 works fine with the single models like RadonmForestClassifier. Below is a simple working example showing the errors: train = np.random.rand(100,10) targ = np.random.randint(0,2,100) clf = ensemble.RandomForestClassifier(n_jobs = 2) clf.fit(train,targ) train = np.random.rand(100,10) targ = np.random.randint(0,2,100) ​ clf = ensemble

Use a metric after a classifier in a Pipeline

不打扰是莪最后的温柔 提交于 2019-12-08 15:28:43
问题 I continue to investigate about pipeline. My aim is to execute each step of machine learning only with pipeline. It will be more flexible and easier to adapt my pipeline with an other use case. So what I do: Step 1: Fill NaN Values Step 2: Transforming Categorical Values into Numbers Step 3: Classifier Step 4: GridSearch Step 5: Add a metrics (failed) Here is my code: import pandas as pd from sklearn.base import BaseEstimator, TransformerMixin from sklearn.feature_selection import SelectKBest

Scikit learn GridSearchCV AUC performance

巧了我就是萌 提交于 2019-12-08 05:15:56
问题 I'm using GridSearchCV to identify the best set of parameters for a random forest classifier. PARAMS = { 'max_depth': [8,None], 'n_estimators': [500,1000] } rf = RandomForestClassifier() clf = grid_search.GridSearchCV(estimator=rf, param_grid=PARAMS, scoring='roc_auc', cv=5, n_jobs=4) clf.fit(data, labels) where data and labels are respectively the full dataset and the corresponding labels. Now, I compared the performance returned by the GridSearchCV (from clf.grid_scores_ ) with a "manual"

Alternate different models in Pipeline for GridSearchCV

懵懂的女人 提交于 2019-12-07 09:28:29
问题 I want to build a Pipeline in sklearn and test different models using GridSearchCV. Just an example (please do not pay attention on what particular models are chosen): reg = LogisticRegression() proj1 = PCA(n_components=2) proj2 = MDS() proj3 = TSNE() pipe = [('proj', proj1), ('reg' , reg)] pipe = Pipeline(pipe) param_grid = { 'reg__c': [0.01, 0.1, 1], } clf = GridSearchCV(pipe, param_grid = param_grid) Here if I want to try different models for dimensionality reduction, I need to code

Pipeline and GridSearch for Doc2Vec

六月ゝ 毕业季﹏ 提交于 2019-12-07 03:05:39
问题 I currently have following script that helps to find the best model for a doc2vec model. It works like this: First train a few models based on given parameters and then test against a classifier. Finally, it outputs the best model and classifier (I hope). Data Example data (data.csv) can be downloaded here: https://pastebin.com/takYp6T8 Note that the data has a structure that should make an ideal classifier with 1.0 accuracy. Script import sys import os from time import time from operator

specify scoring metric in GridSearch function with hypopt package in python

孤人 提交于 2019-12-06 08:06:16
I'm using Gridsearch function from hypopt package to do my hyperparameter searching using specified validation set. The default metric for classification seems to be accuracy (not very sure). Here I want to use F1 score as the metric. I do not know where I should specify the metric. I looked at the documentation but kind of confused. Does anyone who are familiar with hypopt package know how I can do this? Thanks a lot in advance. from hypopt import GridSearch log_reg_params = {"penalty": ['l1'], 'C': [0.001, 0.01]} opt = GridSearch(model=LogisticRegression()) opt.fit(X_train, y_train, log_reg