customized transformerMixin with data labels in sklearn

后端未结

关注

 1  644

I\'m working on a small project where I\'m trying to apply SMOTE \"Synthetic Minority Over-sampling Technique\", where my data is imbalanced ..

I created a customize

相关标签:

1条回答

情书的邮戳

2021-01-05 19:12
fit() mehtod should return self, not the transformed values. If you need the functioning only for train data and not test, then implement the fit_transform() method.
```
class smote(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        print(X.shape, ' ', type(X)) # (57, 28)   <class 'numpy.ndarray'>
        print(len(y), ' ', type)     #    57      <class 'list'>
        self.smote = SMOTE(kind='regular', n_jobs=-1).fit(X, y)

        return self

    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.smote.sample(X, y)

    def transform(self, X):
        return X
```
Explanation: On the train data (i.e. when pipeline.fit() is called) Pipeline will first try to call fit_transform() on the internal objects. If not found, then it will call fit() and transform() separately.

On the test data, only the transform() is called for each internal object, so here your supplied test data should not be changed.

Update: The above code will still throw error. You see, when you oversample the supplied data, the number of samples in X and y both change. But the pipeline will only work on the X data. It will not change the y. So either you will get error about unmatched samples to labels if I correct the above error. If by chance, the generated samples are equal to previous samples, then also the y values will not correspond to the new samples.

Working solution: Silly me.

You can just use the Pipeline from the imblearn package in place of scikit-learn Pipeline. It takes care automatically to re-sample when called fit() on the pipeline, and does not re-sample test data (when called transform() or predict()).

Actually I knew that imblearn.Pipeline handles sample() method, but was thrown off when you implemented a custom class and said that test data must not change. It did not come to my mind that thats the default behaviour.

Just replace
```
from sklearn.pipeline import Pipeline
```
with
```
from imblearn.pipeline import Pipeline
```
and you are all set. No need to make a custom class as you did. Just use original SMOTE. Something like:
```
random_state = 38
model = Pipeline([
        ('posFeat1', featureVECTOR()),
        ('sca1', StandardScaler()),

        # Original SMOTE class
        ('smote', SMOTE(random_state=random_state)),
        ('classification', SGDClassifier(loss='hinge', max_iter=1, random_state=random_state, tol=None))
    ])
```
0 讨论(0)
发布评论:

提交评论
- 加载中...