I\'m relatively new to Python. Can you help me improve my implementation of SMOTE to a proper pipeline? What I want is to apply the over and under sampling on the training set o
This looks like it would fit the bill http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html
You'll want to create your own transformer (http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) that upon calling fit returns a balanced data set (presumably the one gotten from StratifiedKFold), but upon calling predict, which is that is going to happen for the test data, calls into SMOTE.
You need to look at the pipeline object. imbalanced-learn has a Pipeline which extends the scikit-learn Pipeline, to adapt for the fit_sample() and sample() methods in addition to fit_predict(), fit_transform() and predict() methods of scikit-learn.
Have a look at this example here:
For your code, you would want to do this:
from imblearn.pipeline import make_pipeline, Pipeline
smote_enn = SMOTEENN(smote = sm)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
pipeline = make_pipeline(smote_enn, clf_rf)
OR
pipeline = Pipeline([('smote_enn', smote_enn),
('clf_rf', clf_rf)])
Then you can pass this pipeline object to GridSearchCV, RandomizedSearchCV or other cross validation tools in the scikit-learn as a regular object.
kf = StratifiedKFold(n_splits=n_splits)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist,
n_iter=1000,
cv = kf)