How can I use a custom feature selection function in scikit-learn's `pipeline`

后端 未结 5 1985
闹比i
闹比i 2021-01-30 18:47

Let\'s say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by u

5条回答
  •  别跟我提以往
    2021-01-30 19:19

    I didn't find the accepted answer very clear, so here is my solution for others. Basically, the idea is making a new class based on BaseEstimator and TransformerMixin

    The following is a feature selector based on percentage of NAs within a column. The perc value corresponds to the percentage of NAs.

    from sklearn.base import TransformerMixin, BaseEstimator
    
    class NonNAselector(BaseEstimator, TransformerMixin):
    
        """Extract columns with less than x percentage NA to impute further
        in the line
        Class to use in the pipline
        -----
        attributes 
        fit : identify columns - in the training set
        transform : only use those columns
        """
    
        def __init__(self, perc=0.1):
            self.perc = perc
            self.columns_with_less_than_x_na_id = None
    
        def fit(self, X, y=None):
            self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist()
            return self
    
        def transform(self, X, y=None, **kwargs):
            return X[self.columns_with_less_than_x_na_id]
    
        def get_params(self, deep=False):
            return {"perc": self.perc}
    

提交回复
热议问题