ColumnTransformer with TfidfVectorizer produces “empty vocabulary” error

后端 未结 2 1676
灰色年华
灰色年华 2020-12-06 08:19

I am running a very simple experiment with ColumnTransformer with an intent to transform an array of columns, [\"a\"] in this example:

from skle         


        
2条回答
  •  旧巷少年郎
    2020-12-06 08:34

    we can create a custom tfidf transformer, which can take a array of columns and then join them before applying .fit() or .transform().

    Try this!

    from sklearn.base import BaseEstimator,TransformerMixin
    
    class custom_tfidf(BaseEstimator,TransformerMixin):
        def __init__(self,tfidf):
            self.tfidf = tfidf
    
        def fit(self, X, y=None):
            joined_X = X.apply(lambda x: ' '.join(x), axis=1)
            self.tfidf.fit(joined_X)        
            return self
    
        def transform(self, X):
            joined_X = X.apply(lambda x: ' '.join(x), axis=1)
    
            return self.tfidf.transform(joined_X)        
    
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.compose import ColumnTransformer
    dataset = pd.DataFrame({"a":["word gone wild","word gone with wind"],
                            "b":[" gone fhgf wild","gone with wind"],
                            "c":[1,2]})
    tfidf = TfidfVectorizer(min_df=0)
    
    clmn = ColumnTransformer([("tfidf", custom_tfidf(tfidf), ['a','b'])],remainder="passthrough")
    clmn.fit_transform(dataset)
    
    #
    array([[0.36439074, 0.51853403, 0.72878149, 0.        , 0.        ,
            0.25926702, 1.        ],
           [0.        , 0.438501  , 0.        , 0.61629785, 0.61629785,
            0.2192505 , 2.        ]])
    

    P.S. : May be you might want to create a tfidf vectorizer for each column, then create a dictionary with key as column name and value as fitted vectorizer. This dictionary can be used during transform of corresponding columns

提交回复
热议问题