Getting feature names from within a FeatureUnion + Pipeline

五迷三道 提交于 2019-11-29 16:56:33

问题


I am using a FeatureUnion to join features found from the title and description of events:

union = FeatureUnion(
    transformer_list=[
    # Pipeline for pulling features from the event's title
        ('title', Pipeline([
            ('selector', TextSelector(key='title')),
            ('count', CountVectorizer(stop_words='english')),
        ])),

        # Pipeline for standard bag-of-words model for description
        ('description', Pipeline([
            ('selector', TextSelector(key='description_snippet')),
            ('count', TfidfVectorizer(stop_words='english')),
        ])),
    ],

    transformer_weights ={
            'title': 1.0,
            'description': 0.2
    },
)

However, calling union.get_feature_names() gives me an error: "Transformer title (type Pipeline) does not provide get_feature_names." I'd like to see some of the features that are generated by my different Vectorizers. How do I do this?


回答1:


Its because you are using a custom transfomer called TextSelector. Did you implement get_feature_names in TextSelector?

You are going to have to implement this method within your custom transform if you want this to work.

Here is a concrete example for you:

from sklearn.datasets import load_boston
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin
import pandas as pd

dat = load_boston()
X = pd.DataFrame(dat['data'], columns=dat['feature_names'])
y = dat['target']

# define first custom transformer
class first_transform(TransformerMixin):
    def transform(self, df):
        return df

    def get_feature_names(self):
        return df.columns.tolist()


class second_transform(TransformerMixin):
    def transform(self, df):
        return df

    def get_feature_names(self):
        return df.columns.tolist()



pipe = Pipeline([
       ('features', FeatureUnion([
                    ('custom_transform_first', first_transform()),
                    ('custom_transform_second', second_transform())
                ])
        )])

>>> pipe.named_steps['features']_.get_feature_names()
['custom_transform_first__CRIM',
 'custom_transform_first__ZN',
 'custom_transform_first__INDUS',
 'custom_transform_first__CHAS',
 'custom_transform_first__NOX',
 'custom_transform_first__RM',
 'custom_transform_first__AGE',
 'custom_transform_first__DIS',
 'custom_transform_first__RAD',
 'custom_transform_first__TAX',
 'custom_transform_first__PTRATIO',
 'custom_transform_first__B',
 'custom_transform_first__LSTAT',
 'custom_transform_second__CRIM',
 'custom_transform_second__ZN',
 'custom_transform_second__INDUS',
 'custom_transform_second__CHAS',
 'custom_transform_second__NOX',
 'custom_transform_second__RM',
 'custom_transform_second__AGE',
 'custom_transform_second__DIS',
 'custom_transform_second__RAD',
 'custom_transform_second__TAX',
 'custom_transform_second__PTRATIO',
 'custom_transform_second__B',
 'custom_transform_second__LSTAT']

Keep in mind that Feature Union is going to concatenate the two lists emitted from the respective get_feature_names from each of your transformers. this is why you are getting an error when one or more of your transformers do not have this method.

However, I can see that this alone will not fix your problem, as Pipeline objects don't have a get_feature_names method in them, and you have nested pipelines (pipelines within Feature Unions.). So you have two options:

  1. Subclass Pipeline and add it get_feature_names method yourself, which gets the feature names from the last transformer in the chain.

  2. Extract the feature names yourself from each of the transformers, which will require you to grab those transformers out of the pipeline yourself and call get_feature_names on them.

Also, keep in mind that many sklearn built in transformers don't operate on DataFrame but pass numpy arrays around, so just watch out for it if you are going to be chaining lots of transformers together. But I think this gives you enough information to give you an idea of what is happening.

One more thing, have a look at sklearn-pandas. I haven't used it myself but it might provide a solution for you.




回答2:


You can call your different Vectorizers as a nested feature by this (thanks edesz):

pipevect= dict(pipeline.named_steps['union'].transformer_list).get('title').named_steps['count']

And then you got the TfidfVectorizer() instance to pass in another function:

Show_most_informative_features(pipevect,
       pipeline.named_steps['classifier'], n=MostIF)
# or direct   
print(pipevect.get_feature_names())


来源:https://stackoverflow.com/questions/42479370/getting-feature-names-from-within-a-featureunion-pipeline

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!