How can I use a custom feature selection function in scikit-learn's `pipeline`

后端未结

关注

 5  1985

闹比i 2021-01-30 18:47

Let\'s say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by u

5条回答

别跟我提以往 (楼主)

2021-01-30 19:19

I didn't find the accepted answer very clear, so here is my solution for others. Basically, the idea is making a new class based on BaseEstimator and TransformerMixin

The following is a feature selector based on percentage of NAs within a column. The perc value corresponds to the percentage of NAs.

from sklearn.base import TransformerMixin, BaseEstimator

class NonNAselector(BaseEstimator, TransformerMixin):

    """Extract columns with less than x percentage NA to impute further
    in the line
    Class to use in the pipline
    -----
    attributes 
    fit : identify columns - in the training set
    transform : only use those columns
    """

    def __init__(self, perc=0.1):
        self.perc = perc
        self.columns_with_less_than_x_na_id = None

    def fit(self, X, y=None):
        self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist()
        return self

    def transform(self, X, y=None, **kwargs):
        return X[self.columns_with_less_than_x_na_id]

    def get_params(self, deep=False):
        return {"perc": self.perc}

0 讨论(0)

查看其它5个回答