Fast n-gram calculation

后端 未结 3 1729
谎友^
谎友^ 2020-12-02 12:47

I\'m using NLTK to search for n-grams in a corpus but it\'s taking a very long time in some cases. I\'ve noticed calculating n-grams isn\'t an uncommon feature in other pack

相关标签:
3条回答
  • 2020-12-02 13:16

    You might find a pythonic, elegant and fast ngram generation function using zip and splat (*) operator here :

    def find_ngrams(input_list, n):
      return zip(*[input_list[i:] for i in range(n)])
    
    0 讨论(0)
  • 2020-12-02 13:17

    Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

    I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

    def ngrams(tokens, MIN_N, MAX_N):
        n_tokens = len(tokens)
        for i in xrange(n_tokens):
            for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
                yield tokens[i:j]
    

    Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

    Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

    def ngrams(tokens, int MIN_N, int MAX_N):
        cdef Py_ssize_t i, j, n_tokens
    
        count = defaultdict(int)
    
        join_spaces = " ".join
    
        n_tokens = len(tokens)
        for i in xrange(n_tokens):
            for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
                count[join_spaces(tokens[i:j])] += 1
    
        return count
    
    0 讨论(0)
  • 2020-12-02 13:19

    For character-level n-grams you could use the following function

    def ngrams(text, n):
        n-=1
        return [text[i-n:i+1] for i,char in enumerate(text)][n:] 
    
    0 讨论(0)
提交回复
热议问题