customized transformerMixin with data labels in sklearn

后端 未结 1 639
野趣味
野趣味 2021-01-05 18:39

I\'m working on a small project where I\'m trying to apply SMOTE \"Synthetic Minority Over-sampling Technique\", where my data is imbalanced ..

I created a customize

相关标签:
1条回答
  • 2021-01-05 19:12

    fit() mehtod should return self, not the transformed values. If you need the functioning only for train data and not test, then implement the fit_transform() method.

    class smote(BaseEstimator, TransformerMixin):
        def fit(self, X, y=None):
            print(X.shape, ' ', type(X)) # (57, 28)   <class 'numpy.ndarray'>
            print(len(y), ' ', type)     #    57      <class 'list'>
            self.smote = SMOTE(kind='regular', n_jobs=-1).fit(X, y)
    
            return self
    
        def fit_transform(self, X, y=None):
            self.fit(X, y)
            return self.smote.sample(X, y)
    
        def transform(self, X):
            return X
    

    Explanation: On the train data (i.e. when pipeline.fit() is called) Pipeline will first try to call fit_transform() on the internal objects. If not found, then it will call fit() and transform() separately.

    On the test data, only the transform() is called for each internal object, so here your supplied test data should not be changed.

    Update: The above code will still throw error. You see, when you oversample the supplied data, the number of samples in X and y both change. But the pipeline will only work on the X data. It will not change the y. So either you will get error about unmatched samples to labels if I correct the above error. If by chance, the generated samples are equal to previous samples, then also the y values will not correspond to the new samples.

    Working solution: Silly me.

    You can just use the Pipeline from the imblearn package in place of scikit-learn Pipeline. It takes care automatically to re-sample when called fit() on the pipeline, and does not re-sample test data (when called transform() or predict()).

    Actually I knew that imblearn.Pipeline handles sample() method, but was thrown off when you implemented a custom class and said that test data must not change. It did not come to my mind that thats the default behaviour.

    Just replace

    from sklearn.pipeline import Pipeline
    

    with

    from imblearn.pipeline import Pipeline
    

    and you are all set. No need to make a custom class as you did. Just use original SMOTE. Something like:

    random_state = 38
    model = Pipeline([
            ('posFeat1', featureVECTOR()),
            ('sca1', StandardScaler()),
    
            # Original SMOTE class
            ('smote', SMOTE(random_state=random_state)),
            ('classification', SGDClassifier(loss='hinge', max_iter=1, random_state=random_state, tol=None))
        ])
    
    0 讨论(0)
提交回复
热议问题