I\'d like to count frequencies of all words in a text file.
>>> countInFile(\'test.txt\')
should return {\'aaa\':1, \'bbb\':
Combining every ones else's views and some of my own :) Here is what I have for you
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
text='''Note that if you use RegexpTokenizer option, you lose
natural language features special to word_tokenize
like splitting apart contractions. You can naively
split on the regex \w+ without any need for the NLTK.
'''
# tokenize
raw = ' '.join(word_tokenize(text.lower()))
tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)
# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]
# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common
(All ones)
[('note', 1), ('use', 1), ('regexptokenizer', 1), ('option', 1), ('lose', 1), ('natural', 1), ('language', 1), ('features', 1), ('special', 1), ('word', 1), ('tokenize', 1), ('like', 1), ('splitting', 1), ('apart', 1), ('contractions', 1), ('naively', 1), ('split', 1), ('regex', 1), ('without', 1), ('need', 1)]
One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.