Fast/Optimize N-gram implementations in python

后端未结

关注

 3  1840

情深已故 2020-11-29 08:48

Which ngram implementation is fastest in python?

I\'ve tried to profile nltk\'s vs scott\'s zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-

3条回答

北海茫月 (楼主)

2020-11-29 09:18
Extending M4rtini's code, I made three additional versions with a hardcoded n=2 parameter:
```
def bigram1(text):
    words = iter(text.split())
    last = words.next()
    for piece in words:
        yield (last, piece)
        last = piece

def bigram2(text):
    words = text.split()
    return zip(words, islice(words, 1, None))

def bigram3(text):
    words = text.split()
    return izip(words, islice(words, 1, None))
```
Using timeit, I get these results:
```
zipngram(s, 2):        3.854871988296509
list(zipngram2(s, 2)): 2.0733611583709717
zipngram3(s, 2):       2.6574149131774902
list(zipngram4(s, 2)): 4.668303966522217
list(bigram1(s)):      2.2748169898986816
bigram2(s):            1.979405164718628
list(bigram3(s)):      1.891601800918579
```
bigram3 is the fastest for my tests. There does seem to be a slight benefit to hardcoding and from using iterators if they're used throughout (at least for this parameter value). We see the benefit from iterators throughout in the bigger difference between zipngram2 and zipngram3 for n=2.

I also tried getting a boost from using PyPy, but it seemed to actually make things slower here (this included attempts to warm up the JIT by calling it 10k times on functions before doing the timing test). Still, I'm very new to PyPy so I may be doing something wrong. Possibly using Pyrex or Cython would enable greater speedups.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...