Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

前端 未结 2 2026
独厮守ぢ
独厮守ぢ 2020-12-07 18:46

I am working on keyword extraction problem. Consider the very general case

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words=\'english\')

t = \"\"\"Two         


        
2条回答
  •  盖世英雄少女心
    2020-12-07 19:46

    Solution using sparse matrix itself (without .toarray())!

    import numpy as np
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    tfidf = TfidfVectorizer(stop_words='english')
    corpus = [
        'I would like to check this document',
        'How about one more document',
        'Aim is to capture the key words from the corpus',
        'frequency of words in a document is called term frequency'
    ]
    
    X = tfidf.fit_transform(corpus)
    feature_names = np.array(tfidf.get_feature_names())
    
    
    new_doc = ['can key words in this new document be identified?',
               'idf is the inverse document frequency caculcated for each of the words']
    responses = tfidf.transform(new_doc)
    
    
    def get_top_tf_idf_words(response, top_n=2):
        sorted_nzs = np.argsort(response.data)[:-(top_n+1):-1]
        return feature_names[response.indices[sorted_nzs]]
      
    print([get_top_tf_idf_words(response,2) for response in responses])
    
    #[array(['key', 'words'], dtype='

提交回复
热议问题