Pipeline: Multiple classifiers?

前端 未结 3 1546
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-09 12:14

I read following example on Pipelines and GridSearchCV in Python: http://www.davidsbatista.net/blog/2017/04/01/document_classification/

Logistic Regression:<

3条回答
  •  借酒劲吻你
    2020-12-09 12:52

    Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.

    Create a switcher class that works for any estimator

    from sklearn.base import BaseEstimator
    class ClfSwitcher(BaseEstimator):
    
    def __init__(
        self, 
        estimator = SGDClassifier(),
    ):
        """
        A Custom BaseEstimator that can switch between classifiers.
        :param estimator: sklearn object - The classifier
        """ 
    
        self.estimator = estimator
    
    
    def fit(self, X, y=None, **kwargs):
        self.estimator.fit(X, y)
        return self
    
    
    def predict(self, X, y=None):
        return self.estimator.predict(X)
    
    
    def predict_proba(self, X):
        return self.estimator.predict_proba(X)
    
    
    def score(self, X, y):
        return self.estimator.score(X, y)
    

    Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:

    Perform hyper-parameter optimization

    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.linear_model import SGDClassifier
    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import GridSearchCV
    
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('clf', ClfSwitcher()),
    ])
    
    parameters = [
        {
            'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
            'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
            'tfidf__stop_words': ['english', None],
            'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
            'clf__estimator__max_iter': [50, 80],
            'clf__estimator__tol': [1e-4],
            'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
        },
        {
            'clf__estimator': [MultinomialNB()],
            'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
            'tfidf__stop_words': [None],
            'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
        },
    ]
    
    gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
    gscv.fit(train_data, train_labels)
    

    How to interpret clf__estimator__loss

    clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.

提交回复
热议问题