Use sklearn TfidfVectorizer with already tokenized inputs?

后端 未结 3 592
闹比i
闹比i 2021-02-05 14:29

I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:

tokenized_list_of_sentences = [[\'this\', \'is\', \'one\'],         


        
3条回答
  •  旧时难觅i
    2021-02-05 14:46

    Like @Jarad said just use a "passthrough" function for your analyzer but it needs to ignore stopwords. You can get stop words from sklearn:

    >>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
    

    or from nltk:

    >>> import nltk
    >>> nltk.download('stopwords')
    >>> from nltk.corpus import stopwords
    >>> stop_words = set(stopwords.words('english'))
    

    or combine both sets:

    stop_words = stop_words.union(ENGLISH_STOP_WORDS)
    

    But then your examples contain only stop words (because all your words are in the sklearn.ENGLISH_STOP_WORDS set).

    Noetheless @Jarad's examples work:

    >>> tokenized_list_of_sentences =  [
    ...     ['this', 'is', 'one', 'cat', 'or', 'dog'],
    ...     ['this', 'is', 'another', 'dog']]
    >>> from sklearn.feature_extraction.text import TfidfVectorizer
    >>> tfidf = TfidfVectorizer(analyzer=lambda x:[w for w in x if w not in stop_words])
    >>> tfidf_vectors = tfidf.fit_transform(tokenized_list_of_sentences)
    

    I like pd.DataFrames for browsing TF-IDF vectors:

    >>> import pandas as pd
    >>> pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.vocabulary_)
            cat       dog 
    0  0.814802  0.579739
    1  0.000000  1.000000
    

提交回复
热议问题