How to see top n entries of term-document matrix after tfidf in scikit-learn

前端 未结 1 1455
走了就别回头了
走了就别回头了 2020-12-22 20:05

I am new to scikit-learn, and I was using TfidfVectorizer to find the tfidf values of terms in a set of documents. I used the following code to obtain the same.

相关标签:
1条回答
  • 2020-12-22 20:42

    Since version 0.15, the global term weighting of the features learnt by a TfidfVectorizer can be accessed through the attribute idf_, which will return an array of length equal to the feature dimension. Sort the features by this weighting to get the top weighted features:

    from sklearn.feature_extraction.text import TfidfVectorizer
    import numpy as np
    
    lectures = ["this is some food", "this is some drink"]
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(lectures)
    indices = np.argsort(vectorizer.idf_)[::-1]
    features = vectorizer.get_feature_names()
    top_n = 2
    top_features = [features[i] for i in indices[:top_n]]
    print top_features
    

    Output:

    [u'food', u'drink']
    

    The second problem of getting the top features by ngram can be done using the same idea, with some extra steps of splitting the features into different groups:

    from sklearn.feature_extraction.text import TfidfVectorizer
    from collections import defaultdict
    
    lectures = ["this is some food", "this is some drink"]
    vectorizer = TfidfVectorizer(ngram_range=(1,2))
    X = vectorizer.fit_transform(lectures)
    features_by_gram = defaultdict(list)
    for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_):
        features_by_gram[len(f.split(' '))].append((f, w))
    top_n = 2
    for gram, features in features_by_gram.iteritems():
        top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n]
        top_features = [f[0] for f in top_features]
        print '{}-gram top:'.format(gram), top_features
    

    Output:

    1-gram top: [u'drink', u'food']
    2-gram top: [u'some drink', u'some food']
    
    0 讨论(0)
提交回复
热议问题