Using countVectorizer to compute word occurrence for my own vocabulary in python

余生长醉 提交于 2019-12-07 15:41:18

问题


Doc1: ['And that was the fallacy. Once I was free to talk with staff members']

Doc2: ['In the new, stripped-down, every-job-counts business climate, these human']

Doc3 : ['Another reality makes emotional intelligence ever more crucial']

Doc4: ['The globalization of the workforce puts a particular premium on emotional']

Doc5: ['As business changes, so do the traits needed to excel. Data tracking']

and this is a sample of my vocabulary:

my_vocabulary= [‘was the fallacy’, ‘free to’, ‘stripped-down’, ‘ever more’, ‘of the workforce’, ‘the traits needed’]

The point is every word in my vocabulary is a bigram or trigram. My vocabulary includes all possible bigram and trigrams in my document set, I just gave you a sample here. Based on the application this is how my vocab should be. I am trying to use countVectorizer as following to:

from sklearn.feature_extraction.text import CountVectorizer
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]
vectorizer = CountVectorizer( vocabulary=my_vocabulary)
tf = vectorizer.fit_transform(doc_set) 

I am expecting to get something like this :

print tf:
(0, 126)    1
(0, 6804)   1
(0, 5619)   1
(0, 5019)   2
(0, 5012)   1
(0, 999)    1
(0, 996)    1
(0, 4756)   4

where the first column is the document ID, the second column is the word ID in the vocabulary and the third column is the occurrence number of that word in that document. But tf is empty. I know at the end of the day, I can write a code that goes through all the words in the vocabulary and computes the occurrence and makes the matrix, but can I use the countVectorizer for this input that I have and save time? Am I doing something wrong here? If countVectorizer is not the right way to do it, any recommendation will be appreciated.


回答1:


You can build a vocabulary of all possible bi-grams and tri-grams by specifying the ngram_range parameter in CountVectorizer. After fit_tranform you can view the vocabulary and frequency using the get_feature_names() and toarray() methods. The latter returns a frequency matrix for each document. Further information: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

from sklearn.feature_extraction.text import CountVectorizer

Doc1 = 'And that was the fallacy. Once I was free to talk with staff members'
Doc2 = 'In the new, stripped-down, every-job-counts business climate, these human'
Doc3 = 'Another reality makes emotional intelligence ever more crucial'
Doc4 = 'The globalization of the workforce puts a particular premium on emotional'
Doc5 = 'As business changes, so do the traits needed to excel. Data tracking'
doc_set = [Doc1, Doc2, Doc3, Doc4, Doc5]

vectorizer = CountVectorizer(ngram_range=(2, 3))
tf = vectorizer.fit_transform(doc_set)
vectorizer.vocabulary_
vectorizer.get_feature_names()
tf.toarray()

As for what you have tried to do, it would work if you train CountVectorizer on your vocabulary and then transform the documents.

my_vocabulary= ['was the fallacy', 'more crucial', 'particular premium', 'to excel', 'data tracking', 'another reality']

vectorizer = CountVectorizer(ngram_range=(2, 3))
vectorizer.fit_transform(my_vocabulary)
tf = vectorizer.transform(doc_set)

vectorizer.vocabulary_
Out[26]: 
{'another reality': 0,
 'data tracking': 1,
 'more crucial': 2,
 'particular premium': 3,
 'the fallacy': 4,
 'to excel': 5,
 'was the': 6,
 'was the fallacy': 7}

tf.toarray()
Out[25]: 
array([[0, 0, 0, 0, 1, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0]], dtype=int64)


来源:https://stackoverflow.com/questions/49618950/using-countvectorizer-to-compute-word-occurrence-for-my-own-vocabulary-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!