Fast/Optimize N-gram implementations in python

后端 未结 3 1838
情深已故
情深已故 2020-11-29 08:48

Which ngram implementation is fastest in python?

I\'ve tried to profile nltk\'s vs scott\'s zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-

相关标签:
3条回答
  • 2020-11-29 09:18

    Extending M4rtini's code, I made three additional versions with a hardcoded n=2 parameter:

    def bigram1(text):
        words = iter(text.split())
        last = words.next()
        for piece in words:
            yield (last, piece)
            last = piece
    
    def bigram2(text):
        words = text.split()
        return zip(words, islice(words, 1, None))
    
    def bigram3(text):
        words = text.split()
        return izip(words, islice(words, 1, None))
    

    Using timeit, I get these results:

    zipngram(s, 2):        3.854871988296509
    list(zipngram2(s, 2)): 2.0733611583709717
    zipngram3(s, 2):       2.6574149131774902
    list(zipngram4(s, 2)): 4.668303966522217
    list(bigram1(s)):      2.2748169898986816
    bigram2(s):            1.979405164718628
    list(bigram3(s)):      1.891601800918579
    

    bigram3 is the fastest for my tests. There does seem to be a slight benefit to hardcoding and from using iterators if they're used throughout (at least for this parameter value). We see the benefit from iterators throughout in the bigger difference between zipngram2 and zipngram3 for n=2.

    I also tried getting a boost from using PyPy, but it seemed to actually make things slower here (this included attempts to warm up the JIT by calling it 10k times on functions before doing the timing test). Still, I'm very new to PyPy so I may be doing something wrong. Possibly using Pyrex or Cython would enable greater speedups.

    0 讨论(0)
  • 2020-11-29 09:25

    Some attempts with some profiling. I thought using generators could improve the speed here. But the improvement was not noticeable compared to a slight modification of the original. But if you don't need the full list at the same time, the generator functions should be faster.

    import timeit
    from itertools import tee, izip, islice
    
    def isplit(source, sep):
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize
    
    def pairwise(iterable, n=2):
        return izip(*(islice(it, pos, None) for pos, it in enumerate(tee(iterable, n))))
    
    def zipngram(text, n=2):
        return zip(*[text.split()[i:] for i in range(n)])
    
    def zipngram2(text, n=2):
        words = text.split()
        return pairwise(words, n)
    
    
    def zipngram3(text, n=2):
        words = text.split()
        return zip(*[words[i:] for i in range(n)])
    
    def zipngram4(text, n=2):
        words = isplit(text, ' ')
        return pairwise(words, n)
    
    
    s = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
    s = s * 10 ** 3
    
    res = []
    for n in range(15):
    
        a = timeit.timeit('zipngram(s, n)', 'from __main__ import zipngram, s, n', number=100)
        b = timeit.timeit('list(zipngram2(s, n))', 'from __main__ import zipngram2, s, n', number=100)
        c = timeit.timeit('zipngram3(s, n)', 'from __main__ import zipngram3, s, n', number=100)
        d = timeit.timeit('list(zipngram4(s, n))', 'from __main__ import zipngram4, s, n', number=100)
    
        res.append((a, b, c, d))
    
    a, b, c, d = zip(*res)
    
    import matplotlib.pyplot as plt
    
    plt.plot(a, label="zipngram")
    plt.plot(b, label="zipngram2")
    plt.plot(c, label="zipngram3")
    plt.plot(d, label="zipngram4")
    plt.legend(loc=0)
    plt.show()
    

    For this test data, zipngram2 and zipngram3 seems to be the fastest by a good margin.

    enter image description here

    0 讨论(0)
  • 2020-11-29 09:26

    Extending M4rtini's Code

    Using Python3.6.5, nltk == 3.3

    from nltk import ngrams
    def get_n_gramlist(text,n=2):        
        nngramlist=[]
        for s in ngrams(text.split(),n=n):        
            nngramlist.append(s)                
        return nngramlist
    

    Timeit results

    0 讨论(0)
提交回复
热议问题