Let\'s say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by u
I didn't find the accepted answer very clear, so here is my solution for others.
Basically, the idea is making a new class based on BaseEstimator and TransformerMixin
The following is a feature selector based on percentage of NAs within a column. The perc value corresponds to the percentage of NAs.
from sklearn.base import TransformerMixin, BaseEstimator
class NonNAselector(BaseEstimator, TransformerMixin):
"""Extract columns with less than x percentage NA to impute further
in the line
Class to use in the pipline
-----
attributes
fit : identify columns - in the training set
transform : only use those columns
"""
def __init__(self, perc=0.1):
self.perc = perc
self.columns_with_less_than_x_na_id = None
def fit(self, X, y=None):
self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist()
return self
def transform(self, X, y=None, **kwargs):
return X[self.columns_with_less_than_x_na_id]
def get_params(self, deep=False):
return {"perc": self.perc}