How to subclass a vectorizer in scikit-learn without repeating all parameters in the constructor

别来无恙 提交于 2020-02-02 16:06:50

问题


I am trying to create a custom vectorizer by subclassing the CountVectorizer. The vectorizer will stem all the words in the sentence before counting the word frequency. I then use this vectorizer in a pipeline which works fine when I do pipeline.fit(X,y).

However, when I try to set a parameter with pipeline.set_params(rf__verbose=1).fit(X,y), I get the following error:

RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'features.extraction.labels.StemmedCountVectorizer'> with constructor (self, *args, **kwargs) doesn't  follow this convention.

Here is my custom vectorizer:

class StemmedCountVectorizer(CountVectorizer):
    def __init__(self, *args, **kwargs):
        self.stemmer = SnowballStemmer("english", ignore_stopwords=True)
        super(StemmedCountVectorizer, self).__init__(*args, **kwargs)

    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([' '.join([self.stemmer.stem(w) for w in word_tokenize(word)]) for word in analyzer(doc)])

I understand that I could set every single parameter of the CountVectorizer class but it doesn't seem to follow the DRY principle.

Thanks for your help!


回答1:


I have no experience with vectorizers in sklearn, however I ran into a similar problem. I've implemented a custom estimator, let's call it MyBaseEstimator for now, extending sklearn.base.BaseEstimator. Then I've implemted a few other custom sub-estimators extending MyBaseEstimator. The MyBaseEstimator class defined multiple arguments in its __init__, and I didn't want to have the same arguments in the __init__ methods of each of the sub-estimators.

However, without re-defining the arguments in the subclasses, much of sklearn functionality didn't work, specificlaly, setting these parameters for cross-validation. It seems that sklearn expects that all the relevant parameters for an estimator can be retrieved and modified using the BaseEstimator.get_params() and BaseEstimator.set_params() methods. And these methods, when invoked on one of the subclasses, do not return any parameters defined in the baseclass.

To work around this I implemented an overriding get_params() in MyBaseEstimator that uses an ugly hack to merge the parameters of the dynamic type (one of it's sub-calsses) with the parameters defined by its own __init__.

Here's the same ugly hack applied to your CountVectorizer...

import copy
from sklearn.feature_extraction.text import CountVectorizer


class SubCountVectorizer(CountVectorizer):
    def __init__(self, p1=1, p2=2, **kwargs):
        super().__init__(**kwargs)

    def get_params(self, deep=True):
        params = super().get_params(deep)
        # Hack to make get_params return base class params...
        cp = copy.copy(self)
        cp.__class__ = CountVectorizer
        params.update(CountVectorizer.get_params(cp, deep))
        return params


if __name__ == '__main__':
    scv = SubCountVectorizer(p1='foo', input='bar', encoding='baz')
    scv.set_params(**{'p2': 'foo2', 'analyzer': 'bar2'})
    print(scv.get_params())

Running the above code prints the following:

{'p1': None, 'p2': 'foo2',
'analyzer': 'bar2', 'binary': False,
'decode_error': 'strict', 'dtype': <class 'numpy.int64'>,
'encoding': 'baz', 'input': 'bar',
'lowercase': True, 'max_df': 1.0, 'max_features': None,
'min_df': 1, 'ngram_range': (1, 1), 'preprocessor': None,
'stop_words': None, 'strip_accents': None,
'token_pattern': '(?u)\\b\\w\\w+\\b',
'tokenizer': None, 'vocabulary': None}

which shows that sklearn's get_params() and set_params() now both work and also passing keyword-arguments of both the subbclass and the baseclass to the subclass __init__ works.

Not sure how robust this is and whether it solves your exact issue, but it may be of use to someone.




回答2:


Assuming you are subclassing to add some additional methods (or override existing ones) I would expect you would only be instantiating the subclass, and therefore would have to provide the initalization arguments.

If you are creating several instances which all share mostly the same initialization data, with one or two instance specific changes, then one solution could be to "freeze" the common data using partial. For example (generic contrived example follows):

from functools import partial

class Person():
    def __init__(self,name,age):
        self.name = name
        self.age = age

class Bob(Person):
    def __init__(self,name,age,weight):
        super().__init__(name,age)
        self.weight = weight

    def refer_to_thyself(self):
        print('My name is {} and I am {} years old and weigh {} lbs.'.format(
            self.name,self.age,self.weight))

Bob_cloner = partial(Bob,'Bob',20)
Bob1 = Bob_cloner(175)
Bob2 = Bob_cloner(185)
Bob1.refer_to_thyself()
Bob2.refer_to_thyself()

Here we freeze the name and age using partial, and then just let the weight vary amongst the Bobs.



来源:https://stackoverflow.com/questions/51430484/how-to-subclass-a-vectorizer-in-scikit-learn-without-repeating-all-parameters-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!