Consistent ColumnTransformer for intersecting lists of columns

ε祈祈猫儿з 提交于 2021-01-24 08:17:31

问题


I want to use sklearn.compose.ColumnTransformer consistently (not parallel, so, the second transformer should be executed only after the first) for intersecting lists of columns in this way:

log_transformer = p.FunctionTransformer(lambda x: np.log(x))
df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]})
compose.ColumnTransformer(n_jobs=1,
                         transformers=[
                             ('num', impute.SimpleImputer() , ['a', 'b']),
                             ('log', log_transformer, ['b', 'c']),
                             ('scale', p.StandardScaler(), ['a', 'b', 'c'])
                         ]).fit_transform(df)

So, I want to use SimpleImputer for 'a', 'b', then log for 'b', 'c', and then StandardScaler for 'a', 'b', 'c'.

But:

  1. I get array of (4, 7) shape.
  2. I still get Nan in a and b columns.

So, how can I use ColumnTransformer for different columns in the manner of Pipeline?

UPD:

pipe_1 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=42)),
])

pipe_2 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=24)),
])

pipe_3 = pipeline.Pipeline(steps=[
    ('scl', p.StandardScaler()),
])

# in the real situation I don't know exactly what cols these arrays contain, so they are not static: 
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

proc = compose.ColumnTransformer(remainder='passthrough', transformers=[
    ('1', pipe_1, cols_1),
    ('2', pipe_2, cols_2),
    ('3', pipe_3, cols_3),
])
proc.fit_transform(df).T

Output:

array([[ 1.        ,  2.        , 42.        ,  4.        ],
       [ 1.        , 24.        ,  3.        ,  4.        ],
       [-1.06904497, -0.26726124,         nan,  1.33630621],
       [-1.33630621,         nan,  0.26726124,  1.06904497],
       [-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]])

I understood why I have cols duplicates, nans and not scaled values, but how can I fix this in the correct way when cols are not static?

UPD2:

A problem may arise when the columns change their order. So, I want to use FunctionTransformer for columns selection:

def select_col(X, cols=None):
    return X[cols]

ct1 = compose.make_column_transformer(
    (p.OneHotEncoder(), p.FunctionTransformer(select_col, kw_args=dict(cols=['a', 'b']))),
    remainder='passthrough'
)

ct1.fit(df)

But get this output:

ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

How can I fix it?


回答1:


The intended usage of ColumnTransformer is that the different transformers are applied in parallel, not sequentially. To accomplish your desired outcome, three approaches come to mind:

First approach:

pipe_a = Pipeline(steps=[('imp', SimpleImputer()),
                         ('scale', StandardScaler())])
pipe_b = Pipeline(steps=[('imp', SimpleImputer()),
                         ('log', log_transformer),
                         ('scale', StandardScaler())])
pipe_c = Pipeline(steps=[('log', log_transformer),
                         ('scale', StandardScaler())])
proc = ColumnTransformer(transformers=[
    ('a', pipe_a, ['a']),
    ('b', pipe_b, ['b']),
    ('c', pipe_c, ['c'])]
)

This second one actually won't work, because the ColumnTransformer will rearrange the columns and forget the names*, so that the later ones will fail or apply to the wrong columns. When sklearn finalizes how to pass along dataframes or feature names, this may be salvaged, or you may be able to tweak it for your specific usecase now. (* ColumnTransformer does already have a get_feature_names, but the actual data passed through the pipeline doesn't have that information.)

imp_tfm = ColumnTransformer(
    transformers=[('num', impute.SimpleImputer() , ['a', 'b'])],
    remainder='passthrough'
    )
log_tfm = ColumnTransformer(
    transformers=[('log', log_transformer, ['b', 'c'])],
    remainder='passthrough'
    )
scl_tfm = ColumnTransformer(
    transformers=[('scale', StandardScaler(), ['a', 'b', 'c'])
    )
proc = Pipeline(steps=[
    ('imp', imp_tfm),
    ('log', log_tfm),
    ('scale', scl_tfm)]
)

Third, there may be a way to use the Pipeline slicing feature to have one "master" pipeline that you cut down for each feature... this would work mostly like the first approach, might save some coding in the case of larger pipelines, but seems a little hacky. For example, here you can:

pipe_a = clone(pipe_b)[1:]
pipe_c = clone(pipe_b)
pipe_c.steps[1] = ('nolog', 'passthrough')

(Without cloning or otherwise deep-copying pipe_b, the last line would change both pipe_c and pipe_b. The slicing mechanism returns a copy, so pipe_a doesn't strictly need to be cloned, but I've left it in to feel safer. Unfortunately you can't provide a discontinuous slice, so pipe_c = pipe_b[0,2] doesn't work, but you can set the individual slices as I've done above to "passthrough" to disable them.)




回答2:


We can use little columns_name_to_index hack to convert column names to index and then we can pass the dataframe to the pipeline like this:

def columns_name_to_index(arr_of_names, df):
    return [df.columns.get_loc(c) for c in arr_of_names if c in df]

cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

ct1 = compose.ColumnTransformer(remainder='passthrough', transformers=[
    (impute.SimpleImputer(strategy='constant', fill_value=42), columns_name_to_index(cols_1, df)),
    (impute.SimpleImputer(strategy='constant', fill_value=24), columns_name_to_index(cols_2, df)),
])

ct2 = compose.ColumnTransformer(remainder='passthrough', transformers=[
    (p.StandardScaler(), columns_name_to_index(cols_3, df)),
])

pipe = pipeline.Pipeline(steps=[
    ('ct1', ct1),
    ('ct2', ct2),
])

pipe.fit_transform(df).T


来源:https://stackoverflow.com/questions/62225230/consistent-columntransformer-for-intersecting-lists-of-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!