Which ngram implementation is fastest in python?
I\'ve tried to profile nltk\'s vs scott\'s zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-
Extending M4rtini's code, I made three additional versions with a hardcoded n=2
parameter:
def bigram1(text):
words = iter(text.split())
last = words.next()
for piece in words:
yield (last, piece)
last = piece
def bigram2(text):
words = text.split()
return zip(words, islice(words, 1, None))
def bigram3(text):
words = text.split()
return izip(words, islice(words, 1, None))
Using timeit
, I get these results:
zipngram(s, 2): 3.854871988296509
list(zipngram2(s, 2)): 2.0733611583709717
zipngram3(s, 2): 2.6574149131774902
list(zipngram4(s, 2)): 4.668303966522217
list(bigram1(s)): 2.2748169898986816
bigram2(s): 1.979405164718628
list(bigram3(s)): 1.891601800918579
bigram3
is the fastest for my tests. There does seem to be a slight benefit to hardcoding and from using iterators if they're used throughout (at least for this parameter value). We see the benefit from iterators throughout in the bigger difference between zipngram2
and zipngram3
for n=2
.
I also tried getting a boost from using PyPy, but it seemed to actually make things slower here (this included attempts to warm up the JIT by calling it 10k times on functions before doing the timing test). Still, I'm very new to PyPy so I may be doing something wrong. Possibly using Pyrex or Cython would enable greater speedups.