How to make pipeline for multiple dataframe columns?

前端 未结 3 1495
日久生厌
日久生厌 2020-12-21 08:29

I have Dataframe which can be simplified to this:

import pandas as pd

df = pd.DataFrame([{
\'title\': \'batman\',
\'text\': \'man bat man bat\', 
\'url\': \         


        
3条回答
  •  独厮守ぢ
    2020-12-21 09:02

    Take a look at the following link: http://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html

    class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, data_dict):
        return data_dict[self.key]
    

    The key value accepts a panda dataframe column label. When using it in your pipeline it can be applied as:

    ('tfidf_word', Pipeline([
                ('selector', ItemSelector(key='column_name')),
                ('tfidf', TfidfVectorizer())), 
                ]))
    

提交回复
热议问题