I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:
\"Cystic fibrosis affects 30,000 children and young adults in the US a
Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):
def ngrams(input, n):
input = input.split(' ')
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]
If you want those joined back into strings, you might call something like:
[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']
Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:
for g in (' '.join(x) for x in ngrams(input, 2)):
grams.setdefault(g, 0)
grams[g] += 1
Putting that all together into one final function gives:
def ngrams(input, n):
input = input.split(' ')
output = {}
for i in range(len(input)-n+1):
g = ' '.join(input[i:i+n])
output.setdefault(g, 0)
output[g] += 1
return output
ngrams('a a a a', 2) # {'a a': 3}