Pass tokens to CountVectorizer

前端 未结 3 1427
天涯浪人
天涯浪人 2021-02-13 22:27

I have a text classification problem where i have two types of features:

  • features which are n-grams (extracted by CountVectorizer)
  • other textual features
3条回答
  •  没有蜡笔的小新
    2021-02-13 22:45

    Summarizing the answers of @user126350 and @miroli and this link:

    from sklearn.feature_extraction.text import CountVectorizer
    
    def dummy(doc):
        return doc
    
    cv = CountVectorizer(
            tokenizer=dummy,
            preprocessor=dummy,
        )  
    
    docs = [
        ['hello', 'world', '.'],
        ['hello', 'world'],
        ['again', 'hello', 'world']
    ]
    
    cv.fit(docs)
    cv.get_feature_names()
    # ['.', 'again', 'hello', 'world']
    

    The one thing to keep in mind is to wrap the new tokenized document into a list before calling the transform() function so that it is handled as a single document instead of interpreting each token as a document:

    new_doc = ['again', 'hello', 'world', '.']
    v_1 = cv.transform(new_doc)
    v_2 = cv.transform([new_doc])
    
    v_1.shape
    # (4, 4)
    
    v_2.shape
    # (1, 4)
    

提交回复
热议问题