Using countVectorizer to compute word occurrence for my own vocabulary in python

倾然丶 夕夏残阳落幕 提交于 2019-12-05 18:30:24

You can build a vocabulary of all possible bi-grams and tri-grams by specifying the ngram_range parameter in CountVectorizer. After fit_tranform you can view the vocabulary and frequency using the get_feature_names() and toarray() methods. The latter returns a frequency matrix for each document. Further information: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

from sklearn.feature_extraction.text import CountVectorizer

Doc1 = 'And that was the fallacy. Once I was free to talk with staff members'
Doc2 = 'In the new, stripped-down, every-job-counts business climate, these human'
Doc3 = 'Another reality makes emotional intelligence ever more crucial'
Doc4 = 'The globalization of the workforce puts a particular premium on emotional'
Doc5 = 'As business changes, so do the traits needed to excel. Data tracking'
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]

vectorizer = CountVectorizer(ngram_range=(2, 3))
tf = vectorizer.fit_transform(doc_set)
vectorizer.vocabulary_
vectorizer.get_feature_names()
tf.toarray()

As for what you have tried to do, it would work if you train CountVectorizer on your vocabulary and then transform the documents.

my_vocabulary= ['was the fallacy', 'more crucial', 'particular premium', 'to excel', 'data tracking', 'another reality']

vectorizer = CountVectorizer(ngram_range=(2, 3))
vectorizer.fit_transform(my_vocabulary)
tf = vectorizer.transform(doc_set)

vectorizer.vocabulary_
Out[26]: 
{'another reality': 0,
 'data tracking': 1,
 'more crucial': 2,
 'particular premium': 3,
 'the fallacy': 4,
 'to excel': 5,
 'was the': 6,
 'was the fallacy': 7}

tf.toarray()
Out[25]: 
array([[0, 0, 0, 0, 1, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0]], dtype=int64)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!