ColumnTransformer with TfidfVectorizer produces “empty vocabulary” error

后端 未结 2 1675
灰色年华
灰色年华 2020-12-06 08:19

I am running a very simple experiment with ColumnTransformer with an intent to transform an array of columns, [\"a\"] in this example:

from skle         


        
2条回答
  •  粉色の甜心
    2020-12-06 08:26

    That's because you are providing ["a"] instead of "a" in ColumnTransformer. According to the documentation:

    A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.

    Now, TfidfVectorizer requires a single iterator of strings for input (so a 1-d array of strings). But since you are sending a list of column names in ColumnTransformer (even though that list only contains a single column), it will be 2-d array that will be passed to TfidfVectorizer. And hence the error.

    Change that to:

    clmn = ColumnTransformer([("tfidf", tfidf, "a")],
                             remainder="passthrough")
    

    For more understanding, try using the above things to select data from a pandas DataFrame. Check the format (dtype, shape) of the returned data when you do:

    dataset['a']
    
    vs 
    
    dataset[['a']]
    

    Update: @SergeyBushmanov, Regarding your comment on the other answer, I think that you are misinterpreting the documentation. If you want to do tfidf on two columns, then you need to pass two transformers. Something like this:

    tfidf_1 = TfidfVectorizer(min_df=0)
    tfidf_2 = TfidfVectorizer(min_df=0)
    clmn = ColumnTransformer([("tfidf_1", tfidf_1, "a"), 
                              ("tfidf_2", tfidf_2, "b")
                             ],
                             remainder="passthrough")
    

提交回复
热议问题