Computing N Grams using Python

后端 未结 8 1887
情歌与酒
情歌与酒 2020-11-28 06:02

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:

\"Cystic fibrosis affects 30,000 children and young adults in the US a

8条回答
  •  广开言路
    2020-11-28 06:29

    Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

    def ngrams(input, n):
        input = input.split(' ')
        output = []
        for i in range(len(input)-n+1):
            output.append(input[i:i+n])
        return output
    
    ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]
    

    If you want those joined back into strings, you might call something like:

    [' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']
    

    Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

    for g in (' '.join(x) for x in ngrams(input, 2)):
        grams.setdefault(g, 0)
        grams[g] += 1
    

    Putting that all together into one final function gives:

    def ngrams(input, n):
       input = input.split(' ')
       output = {}
       for i in range(len(input)-n+1):
           g = ' '.join(input[i:i+n])
           output.setdefault(g, 0)
           output[g] += 1
        return output
    
    ngrams('a a a a', 2) # {'a a': 3}
    

提交回复
热议问题