Fastest way to check if a string contains specific characters in any of the items in a list

前端 未结 4 1104
隐瞒了意图╮
隐瞒了意图╮ 2021-01-02 04:33

What is the fastest way to check if a string contains some characters from any items of a list?

Currently, I\'m using this method:

lestring = \"Text1         


        
4条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-01-02 05:02

    The esmre library does the trick. In your case, the simpler, esm (part of esmre) is what you want.

    https://pypi.python.org/pypi/esmre/

    https://code.google.com/p/esmre/

    They have good documentation and examples: Taken from their examples:

    >>> import esm
    >>> index = esm.Index()
    >>> index.enter("he")
    >>> index.enter("she")
    >>> index.enter("his")
    >>> index.enter("hers")
    >>> index.fix()
    >>> index.query("this here is history")
    [((1, 4), 'his'), ((5, 7), 'he'), ((13, 16), 'his')]
    >>> index.query("Those are his sheep!")
    [((10, 13), 'his'), ((14, 17), 'she'), ((15, 17), 'he')]
    >>> 
    

    I ran some performance tests:

    import random, timeit, string, esm
    
    def uz(lelist, lestring):
        for x in lelist:
            if lestring.count(x):
                return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)
    
    
    
    def ab(lelist, lestring):
        return [e for e in lelist if e in lestring]
    
    
    def use_esm(index, lestring):
        return index.query(lestring)
    
    for TEXT_LEN in [5, 50, 1000]:
        for SEARCH_LEN in [5, 20]:
            for N in [5, 50, 1000, 10000]:
                if TEXT_LEN < SEARCH_LEN:
                    continue
    
                print 'TEXT_LEN:', TEXT_LEN, 'SEARCH_LEN:', SEARCH_LEN, 'N:', N
    
                lestring = ''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(TEXT_LEN)))
                lelist = [''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(SEARCH_LEN))) for _
                          in range(N)]
    
                index = esm.Index()
                for i in lelist:
                    index.enter(i)
                index.fix()
    
                t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab")
                t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz")
                t_esm = timeit.Timer("use_esm(index, lestring)", setup="from __main__ import index, lestring, use_esm")
    
                ab_time = t_ab.timeit(1000)
                uz_time = t_uz.timeit(1000)
                esm_time = t_esm.timeit(1000)
    
                min_time = min(ab_time, uz_time, esm_time)
                print '  ab%s: %f' % ('*' if ab_time == min_time else '', ab_time)
                print '  uz%s: %f' % ('*' if uz_time == min_time else '', uz_time)
                print '  esm%s %f:' % ('*' if esm_time == min_time else '', esm_time)
    

    And got that results depends mostly on the number of items that one is looking for (in my case, 'N'):

    TEXT_LEN: 1000 SEARCH_LEN: 20 N: 5
      ab*: 0.001733
      uz: 0.002512
      esm 0.126853:
    
    TEXT_LEN: 1000 SEARCH_LEN: 20 N: 50
      ab*: 0.017564
      uz: 0.023701
      esm 0.079925:
    
    TEXT_LEN: 1000 SEARCH_LEN: 20 N: 1000
      ab: 0.370371
      uz: 0.489523
      esm* 0.133783:
    
    TEXT_LEN: 1000 SEARCH_LEN: 20 N: 10000
      ab: 3.678790
      uz: 4.883575
      esm* 0.259605:
    

提交回复
热议问题