nltk function to count occurrences of certain words

牧云@^-^@ 提交于 2020-01-02 03:45:08

问题


In the nltk book there is the question "Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?"

I thought I could use a function like state_union('1945-Truman.txt').count('men') However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.


回答1:


You can use the .words() function in the corpus to returns a list of strings (i.e. tokens/words):

>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]

Then use the Counter() object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter:

>>> wordcounts = Counter(brown.words())

But do note that the Counter is case-sensitive, see:

>>> from nltk.corpus import brown
>>> from collections import Counter
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> wordcounts = Counter(brown.words())
>>> wordcounts['the']
62713
>>> wordcounts['The']
7258
>>> wordcounts_lower = Counter(i.lower() for i in brown.words())
>>> wordcounts_lower['The']
0
>>> wordcounts_lower['the']
69971


来源:https://stackoverflow.com/questions/22762893/nltk-function-to-count-occurrences-of-certain-words

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!