fit_transform() takes 2 positional arguments but 3 were given with LabelBinarizer

前端 未结 13 1928
时光取名叫无心
时光取名叫无心 2020-12-07 16:35

I am totally new to Machine Learning and I have been working with unsupervised learning technique.

Image shows my sample Data(After all Cleaning) Screenshot : Sample

相关标签:
13条回答
  • 2020-12-07 16:54

    I have also faced the same issue. Following link helped me in fixing this issue. https://github.com/ageron/handson-ml/issues/75

    Summarizing changes to be made

    1) Define following class in your notebook

    class SupervisionFriendlyLabelBinarizer(LabelBinarizer):
        def fit_transform(self, X, y=None):
            return super(SupervisionFriendlyLabelBinarizer,self).fit_transform(X)
    
    

    2) Modify following piece of code

    cat_pipeline = Pipeline([('selector', DataFrameSelector(cat_attribs)),
                             ('label_binarizer', SupervisionFriendlyLabelBinarizer()),])
    

    3) Re-run the notebook. You will be able to run now

    0 讨论(0)
  • 2020-12-07 16:56

    To perform one-hot encoding for multiple categorical features, we can create a new class which customizes our own multiple categorical features binarizer and plug it into categorical pipeline as follows.

    Suppose CAT_FEATURES = ['cat_feature1', 'cat_feature2'] is a list of categorical features. The following scripts shall resolve the issue and produce what we want.

    import pandas as pd
    from sklearn.pipeline import Pipeline
    from sklearn.base import BaseEstimator, TransformerMixin
    
    class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
        """Perform one-hot encoding to categorical features."""
        def __init__(self, cat_features):
            self.cat_features = cat_features
    
        def fit(self, X_cat, y=None):
            return self
    
        def transform(self, X_cat):
            X_cat_df = pd.DataFrame(X_cat, columns=self.cat_features)
            X_onehot_df = pd.get_dummies(X_cat_df, columns=self.cat_features)
            return X_onehot_df.values
    
    # Pipeline for categorical features.
    cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(CAT_FEATURES)),
        ('onehot_encoder', CustomLabelBinarizer(CAT_FEATURES))
    ])
    
    0 讨论(0)
  • 2020-12-07 17:00

    The Problem:

    The pipeline is assuming LabelBinarizer's fit_transform method is defined to take three positional arguments:

    def fit_transform(self, x, y)
        ...rest of the code
    

    while it is defined to take only two:

    def fit_transform(self, x):
        ...rest of the code
    

    Possible Solution:

    This can be solved by making a custom transformer that can handle 3 positional arguments:

    1. Import and make a new class:

      from sklearn.base import TransformerMixin #gives fit_transform method for free
      class MyLabelBinarizer(TransformerMixin):
          def __init__(self, *args, **kwargs):
              self.encoder = LabelBinarizer(*args, **kwargs)
          def fit(self, x, y=0):
              self.encoder.fit(x)
              return self
          def transform(self, x, y=0):
              return self.encoder.transform(x)
      
    2. Keep your code the same only instead of using LabelBinarizer(), use the class we created : MyLabelBinarizer().


    Note: If you want access to LabelBinarizer Attributes (e.g. classes_), add the following line to the fit method:

        self.classes_, self.y_type_, self.sparse_input_ = self.encoder.classes_, self.encoder.y_type_, self.encoder.sparse_input_
    
    0 讨论(0)
  • 2020-12-07 17:00

    I ran into the same problem and got it working by applying the workaround specified in the book's Github repo.

    Warning: earlier versions of the book used the LabelBinarizer class at this point. Again, this was incorrect: just like the LabelEncoder class, the LabelBinarizer class was designed to preprocess labels, not input features. A better solution is to use Scikit-Learn's upcoming CategoricalEncoder class: it will soon be added to Scikit-Learn, and in the meantime you can use the code below (copied from Pull Request #9151).

    To save you some grepping here's the workaround, just paste and run it in a previous cell:

    # Definition of the CategoricalEncoder class, copied from PR #9151.
    # Just run this cell, or copy it to your code, do not try to understand it (yet).
    
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.utils import check_array
    from sklearn.preprocessing import LabelEncoder
    from scipy import sparse
    
    class CategoricalEncoder(BaseEstimator, TransformerMixin):
        def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
                     handle_unknown='error'):
            self.encoding = encoding
            self.categories = categories
            self.dtype = dtype
            self.handle_unknown = handle_unknown
    
        def fit(self, X, y=None):
            """Fit the CategoricalEncoder to X.
            Parameters
            ----------
            X : array-like, shape [n_samples, n_feature]
                The data to determine the categories of each feature.
            Returns
            -------
            self
            """
    
            if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
                template = ("encoding should be either 'onehot', 'onehot-dense' "
                            "or 'ordinal', got %s")
                raise ValueError(template % self.handle_unknown)
    
            if self.handle_unknown not in ['error', 'ignore']:
                template = ("handle_unknown should be either 'error' or "
                            "'ignore', got %s")
                raise ValueError(template % self.handle_unknown)
    
            if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
                raise ValueError("handle_unknown='ignore' is not supported for"
                                 " encoding='ordinal'")
    
            X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
            n_samples, n_features = X.shape
    
            self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]
    
            for i in range(n_features):
                le = self._label_encoders_[i]
                Xi = X[:, i]
                if self.categories == 'auto':
                    le.fit(Xi)
                else:
                    valid_mask = np.in1d(Xi, self.categories[i])
                    if not np.all(valid_mask):
                        if self.handle_unknown == 'error':
                            diff = np.unique(Xi[~valid_mask])
                            msg = ("Found unknown categories {0} in column {1}"
                                   " during fit".format(diff, i))
                            raise ValueError(msg)
                    le.classes_ = np.array(np.sort(self.categories[i]))
    
            self.categories_ = [le.classes_ for le in self._label_encoders_]
    
            return self
    
        def transform(self, X):
            """Transform X using one-hot encoding.
            Parameters
            ----------
            X : array-like, shape [n_samples, n_features]
                The data to encode.
            Returns
            -------
            X_out : sparse matrix or a 2-d array
                Transformed input.
            """
            X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
            n_samples, n_features = X.shape
            X_int = np.zeros_like(X, dtype=np.int)
            X_mask = np.ones_like(X, dtype=np.bool)
    
            for i in range(n_features):
                valid_mask = np.in1d(X[:, i], self.categories_[i])
    
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(X[~valid_mask, i])
                        msg = ("Found unknown categories {0} in column {1}"
                               " during transform".format(diff, i))
                        raise ValueError(msg)
                    else:
                        # Set the problematic rows to an acceptable value and
                        # continue `The rows are marked `X_mask` and will be
                        # removed later.
                        X_mask[:, i] = valid_mask
                        X[:, i][~valid_mask] = self.categories_[i][0]
                X_int[:, i] = self._label_encoders_[i].transform(X[:, i])
    
            if self.encoding == 'ordinal':
                return X_int.astype(self.dtype, copy=False)
    
            mask = X_mask.ravel()
            n_values = [cats.shape[0] for cats in self.categories_]
            n_values = np.array([0] + n_values)
            indices = np.cumsum(n_values)
    
            column_indices = (X_int + indices[:-1]).ravel()[mask]
            row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                                    n_features)[mask]
            data = np.ones(n_samples * n_features)[mask]
    
            out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                    shape=(n_samples, indices[-1]),
                                    dtype=self.dtype).tocsr()
            if self.encoding == 'onehot-dense':
                return out.toarray()
            else:
                return out
    
    0 讨论(0)
  • 2020-12-07 17:02

    Since LabelBinarizer doesn't allow more than 2 positional arguments you should create your custom binarizer like

    class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
        def __init__(self, sparse_output=False):
            self.sparse_output = sparse_output
        def fit(self, X, y=None):
            return self
        def transform(self, X, y=None):
            enc = LabelBinarizer(sparse_output=self.sparse_output)
            return enc.fit_transform(X)
    
    num_attribs = list(housing_num)
    cat_attribs = ['ocean_proximity']
    
    num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', Imputer(strategy='median')),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scalar', StandardScaler())
    ])
    
    cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('label_binarizer', CustomLabelBinarizer())
    ])
    
    full_pipeline = FeatureUnion(transformer_list=[
        ('num_pipeline', num_pipeline),
        ('cat_pipeline', cat_pipeline)
    ])
    
    housing_prepared = full_pipeline.fit_transform(new_housing)
    
    0 讨论(0)
  • 2020-12-07 17:04

    You can create one more Custom Transformer which does the encoding for you.

    class CustomLabelEncode(BaseEstimator, TransformerMixin):
        def fit(self, X, y=None):
            return self
        def transform(self, X):
            return LabelEncoder().fit_transform(X);
    

    In this example, we have done LabelEncoding but you can use LabelBinarizer as well

    0 讨论(0)
提交回复
热议问题