tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

后端 未结 2 1244
粉色の甜心
粉色の甜心 2020-12-07 15:37

this page: http://scikit-learn.org/stable/modules/feature_extraction.html mentions:

As tf–idf is a very often used for text features, there is also an

相关标签:
2条回答
  • 2020-12-07 16:18

    See also this on how to get the TF-IDF values of all the documents:

    feature_names = tf.get_feature_names()
    doc = 0
    feature_index = X[doc,:].nonzero()[1]
    tfidf_scores = zip(feature_index, [X[doc, x] for x in feature_index])
    for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
        print w, s
    
    this 0.448320873199
    is 0.448320873199
    very 0.448320873199
    strange 0.630099344518
    
    #and for doc=1
    this 0.448320873199
    is 0.448320873199
    very 0.448320873199
    nice 0.630099344518
    

    I think the results are normalized by document:

    >>>0.4483208731992+0.4483208731992+0.4483208731992+0.6300993445182 0.9999999999997548

    0 讨论(0)
  • 2020-12-07 16:35

    Since version 0.15, the tf-idf score of each feature can be retrieved via the attribute idf_ of the TfidfVectorizer object:

    from sklearn.feature_extraction.text import TfidfVectorizer
    corpus = ["This is very strange",
              "This is very nice"]
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(corpus)
    idf = vectorizer.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))
    

    Output:

    {u'is': 1.0,
     u'nice': 1.4054651081081644,
     u'strange': 1.4054651081081644,
     u'this': 1.0,
     u'very': 1.0}
    

    As discussed in the comments, prior to version 0.15, a workaround is to access the attribute idf_ via the supposedly hidden _tfidf (an instance of TfidfTransformer) of the vectorizer:

    idf = vectorizer._tfidf.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))
    

    which should give the same output as above.

    0 讨论(0)
提交回复
热议问题