Fast/Optimize N-gram implementations in python

后端 未结 3 1840
情深已故
情深已故 2020-11-29 08:48

Which ngram implementation is fastest in python?

I\'ve tried to profile nltk\'s vs scott\'s zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-

3条回答
  •  北海茫月
    2020-11-29 09:18

    Extending M4rtini's code, I made three additional versions with a hardcoded n=2 parameter:

    def bigram1(text):
        words = iter(text.split())
        last = words.next()
        for piece in words:
            yield (last, piece)
            last = piece
    
    def bigram2(text):
        words = text.split()
        return zip(words, islice(words, 1, None))
    
    def bigram3(text):
        words = text.split()
        return izip(words, islice(words, 1, None))
    

    Using timeit, I get these results:

    zipngram(s, 2):        3.854871988296509
    list(zipngram2(s, 2)): 2.0733611583709717
    zipngram3(s, 2):       2.6574149131774902
    list(zipngram4(s, 2)): 4.668303966522217
    list(bigram1(s)):      2.2748169898986816
    bigram2(s):            1.979405164718628
    list(bigram3(s)):      1.891601800918579
    

    bigram3 is the fastest for my tests. There does seem to be a slight benefit to hardcoding and from using iterators if they're used throughout (at least for this parameter value). We see the benefit from iterators throughout in the bigger difference between zipngram2 and zipngram3 for n=2.

    I also tried getting a boost from using PyPy, but it seemed to actually make things slower here (this included attempts to warm up the JIT by calling it 10k times on functions before doing the timing test). Still, I'm very new to PyPy so I may be doing something wrong. Possibly using Pyrex or Cython would enable greater speedups.

提交回复
热议问题