I\'d like to count frequencies of all words in a text file.
>>> countInFile(\'test.txt\')
should return {\'aaa\':1, \'bbb\':
Combining every ones else's views and some of my own :) Here is what I have for you
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
text='''Note that if you use RegexpTokenizer option, you lose
natural language features special to word_tokenize
like splitting apart contractions. You can naively
split on the regex \w+ without any need for the NLTK.
'''
# tokenize
raw = ' '.join(word_tokenize(text.lower()))
tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)
# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]
# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common
(All ones)
[('note', 1),
('use', 1),
('regexptokenizer', 1),
('option', 1),
('lose', 1),
('natural', 1),
('language', 1),
('features', 1),
('special', 1),
('word', 1),
('tokenize', 1),
('like', 1),
('splitting', 1),
('apart', 1),
('contractions', 1),
('naively', 1),
('split', 1),
('regex', 1),
('without', 1),
('need', 1)]
One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.