List the words in a vocabulary according to occurrence in a text corpus , Scikit-Learn

笑着哭i 提交于 2019-12-29 04:21:19

问题


I have fitted a CountVectorizer to some documents in scikit-learn. I would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example

'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on

Is there any built-in function for this?


回答1:


If cv is your CountVectorizer and X is the vectorized corpus, then

zip(cv.get_feature_names(),
    np.asarray(X.sum(axis=0)).ravel())

returns a list of (term, frequency) pairs for each distinct term in the corpus that the CountVectorizer extracted.

(The little asarray + ravel dance is needed to work around some quirks in scipy.sparse.)




回答2:


There is no built-in. I have found a faster way to do it based on Ando Saabas's answer:

from sklearn.feature_extraction.text import CountVectorizer 
texts = ["Hello world", "Python makes a better world"]
vec = CountVectorizer().fit(texts)
bag_of_words = vec.transform(texts)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
sorted(words_freq, key = lambda x: x[1], reverse=True)

output

[('world', 2), ('python', 1), ('hello', 1), ('better', 1), ('makes', 1)]


来源:https://stackoverflow.com/questions/16078015/list-the-words-in-a-vocabulary-according-to-occurrence-in-a-text-corpus-scikit

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!