I need to write a program in NLTK that breaks a corpus (a large collection of txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams. I have already written co
The answer of @hellpander above correct, but not efficient for a very large corpus (I faced difficulties with ~650K documents). The code would slow down considerably everytime frequencies are updated, due to the expensive lookup of the dictionary as the content grows. So you will need to have additional buffer variable to help cache the frequencies Counter of @hellpander answer. Hence, isntead of doing key lookup for a very large frequencies (dictionary) everytime a new document is iterated, you would add it to the temporary, smaller Counter dict. Then, after some iterations, it will be add up to the global frequencies. This way it'll be much faster because the huge dictionary lookup is done much less frequently.
import os
corpus = []
path = '.'
for i in os.walk(path).next()[2]:
if i.endswith('.txt'):
f = open(os.path.join(path,i))
corpus.append(f.read())
frequencies = Counter([])
for i in range(0, len(corpus)):
token = nltk.word_tokenize(corpus[i])
bigrams = ngrams(token, 2)
f += Counter(bigrams)
if (i%10000 == 0):
# store to global frequencies counter and clear up f every 10000 docs.
frequencies += Counter(bigrams)
f = Counter([])