Custom transformer mixin with FeatureUnion in scikit-learn

蹲街弑〆低调 提交于 2021-01-29 10:46:47

问题


I am writing custom transformers in scikit-learn in order to do specific operations on the array. For that I use inheritance of class TransformerMixin. It works fine when I deal only with one transformer. However when I try to chain them using FeatureUnion (or make_union), the array is replicated n-times. What could I do to avoid that? Am I using scikit-learn as it is supposed to be?

import numpy as np
from sklearn.base import TransformerMixin
from sklearn.pipeline import FeatureUnion

# creation of array
s1 = np.array(['foo', 'bar', 'baz'])
s2 = np.array(['a', 'b', 'c'])
X = np.column_stack([s1, s2])
print('base array: \n', X, '\n')

# A fake example that appends a column (Could be a score, ...) calculated on specific columns from X
class DummyTransformer(TransformerMixin):
    def __init__(self, value=None):
        TransformerMixin.__init__(self)
        self.value = value

    def fit(self, *_):
        return self

    def transform(self, X):
        # appends a column (in this case, a constant) to X
        s = np.full(X.shape[0], self.value)
        X = np.column_stack([X, s])
        return X

# as such, the transformer gives what I need first
transfo = DummyTransformer(value=1)
print('single transformer: \n', transfo.fit_transform(X), '\n')

# but when I try to chain them and create a pipeline I run into the replication of existing columns
stages = []
for i in range(2):
    transfo = DummyTransformer(value=i+1)
    stages.append(('step'+str(i+1),transfo))
pipeunion = FeatureUnion(stages)
print('Given result of the Feature union pipeline: \n', pipeunion.fit_transform(X), '\n')
# columns 1&2 from X are replicated

# I would expect:
expected = np.column_stack([X, np.full(X.shape[0], 1), np.full(X.shape[0], 2) ])
print('Expected result of the Feature Union pipeline: \n', expected, '\n')

Output:

base array: 
 [['foo' 'a']
 ['bar' 'b']
 ['baz' 'c']] 

single transformer: 
 [['foo' 'a' '1']
 ['bar' 'b' '1']
 ['baz' 'c' '1']] 

Given result of the Feature union pipeline: 
 [['foo' 'a' '1' 'foo' 'a' '2']
 ['bar' 'b' '1' 'bar' 'b' '2']
 ['baz' 'c' '1' 'baz' 'c' '2']] 

Expected result of the Feature Union pipeline: 
   [['foo' 'a' '1' '2']
   ['bar' 'b' '1' '2']
   ['baz' 'c' '1' '2']] 

Many thanks


回答1:


FeatureUnion will just concatenate what its getting from internal transformers. Now in your internal transformers, you are sending same columns from each one. Its upon the transformers to correctly send the correct data forward.

I would advise you to just return the new data from the internal transformers, and then concatenate the remaining columns either from outside or inside the FeatureUnion.

Look at this example if you havent already:

  • http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

You can do this for example:

# This dont do anything, just pass the data as it is
class DataPasser(TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X

# Your transformer
class DummyTransformer(TransformerMixin):
    def __init__(self, value=None):
        TransformerMixin.__init__(self)
        self.value = value

    def fit(self, *_):
        return self

    # Changed this to only return new column after some operation on X
    def transform(self, X):
        s = np.full(X.shape[0], self.value)
        return s.reshape(-1,1)

After that, further down in your code, change this:

stages = []    

# Append our DataPasser here, so original data is at the beginning
stages.append(('no_change', DataPasser()))


for i in range(2):
    transfo = DummyTransformer(value=i+1)
    stages.append(('step'+str(i+1),transfo))

pipeunion = FeatureUnion(stages)

Running this new code has the result:

('Given result of the Feature union pipeline: \n', 
array([['foo', 'a', '1', '2'],
       ['bar', 'b', '1', '2'],
       ['baz', 'c', '1', '2']], dtype='|S21'), '\n')
('Expected result of the Feature Union pipeline: \n', 
array([['foo', 'a', '1', '2'],
       ['bar', 'b', '1', '2'],
       ['baz', 'c', '1', '2']], dtype='|S21'), '\n')


来源:https://stackoverflow.com/questions/52116786/custom-transformer-mixin-with-featureunion-in-scikit-learn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!