I\'ve been playing around with different ways (in Python 2.7) to extract a list of (word, frequency) tuples from a corpus, or list of strings, and comp
One source of inefficiency in the OP's code (which several answers fixed without commenting on) is the over-reliance on intermediate lists. There is no reason to create a temporary list of millions of words just to iterate over them, when a generator will do.
So instead of
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
cnt[word] += 1
it should be just
cnt = Counter(token.lower().strip(drop) for token in corpus)
And if you really want to sort the word counts alphabetically (what on earth for?), replace this
wordfreqs = sorted([(word, cnt[word]) for word in cnt])
with this:
wordfreqs = sorted(cnt.items()) # In Python 2: cnt.iteritems()
This should remove much of the inefficiency around the use of Counter (or any dictionary class used in a similar way).