I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:
tokenized_list_of_sentences = [[\'this\', \'is\', \'one\'],
Like @Jarad said just use a "passthrough" function for your analyzer but it needs to ignore stopwords. You can get stop words from sklearn
:
>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
or from nltk
:
>>> import nltk
>>> nltk.download('stopwords')
>>> from nltk.corpus import stopwords
>>> stop_words = set(stopwords.words('english'))
or combine both sets:
stop_words = stop_words.union(ENGLISH_STOP_WORDS)
But then your examples contain only stop words (because all your words are in the sklearn.ENGLISH_STOP_WORDS
set).
Noetheless @Jarad's examples work:
>>> tokenized_list_of_sentences = [
... ['this', 'is', 'one', 'cat', 'or', 'dog'],
... ['this', 'is', 'another', 'dog']]
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf = TfidfVectorizer(analyzer=lambda x:[w for w in x if w not in stop_words])
>>> tfidf_vectors = tfidf.fit_transform(tokenized_list_of_sentences)
I like pd.DataFrame
s for browsing TF-IDF vectors:
>>> import pandas as pd
>>> pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.vocabulary_)
cat dog
0 0.814802 0.579739
1 0.000000 1.000000