Fastest way to check if a string contains specific characters in any of the items in a list

前端未结

关注

 4  1104

隐瞒了意图╮ 2021-01-02 04:33

What is the fastest way to check if a string contains some characters from any items of a list?

Currently, I\'m using this method:

lestring = \"Text1


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   轻奢々
                                             
                
                
                (楼主)
            
              
              
                2021-01-02 05:02
              

            
            
                        
The esmre library does the trick. In your case, the simpler, esm (part of esmre) is what you want.

https://pypi.python.org/pypi/esmre/

https://code.google.com/p/esmre/

They have good documentation and examples:
Taken from their examples:

>>> import esm
>>> index = esm.Index()
>>> index.enter("he")
>>> index.enter("she")
>>> index.enter("his")
>>> index.enter("hers")
>>> index.fix()
>>> index.query("this here is history")
[((1, 4), 'his'), ((5, 7), 'he'), ((13, 16), 'his')]
>>> index.query("Those are his sheep!")
[((10, 13), 'his'), ((14, 17), 'she'), ((15, 17), 'he')]
>>> 


I ran some performance tests:

import random, timeit, string, esm

def uz(lelist, lestring):
    for x in lelist:
        if lestring.count(x):
            return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)



def ab(lelist, lestring):
    return [e for e in lelist if e in lestring]


def use_esm(index, lestring):
    return index.query(lestring)

for TEXT_LEN in [5, 50, 1000]:
    for SEARCH_LEN in [5, 20]:
        for N in [5, 50, 1000, 10000]:
            if TEXT_LEN < SEARCH_LEN:
                continue

            print 'TEXT_LEN:', TEXT_LEN, 'SEARCH_LEN:', SEARCH_LEN, 'N:', N

            lestring = ''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(TEXT_LEN)))
            lelist = [''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(SEARCH_LEN))) for _
                      in range(N)]

            index = esm.Index()
            for i in lelist:
                index.enter(i)
            index.fix()

            t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab")
            t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz")
            t_esm = timeit.Timer("use_esm(index, lestring)", setup="from __main__ import index, lestring, use_esm")

            ab_time = t_ab.timeit(1000)
            uz_time = t_uz.timeit(1000)
            esm_time = t_esm.timeit(1000)

            min_time = min(ab_time, uz_time, esm_time)
            print '  ab%s: %f' % ('*' if ab_time == min_time else '', ab_time)
            print '  uz%s: %f' % ('*' if uz_time == min_time else '', uz_time)
            print '  esm%s %f:' % ('*' if esm_time == min_time else '', esm_time)


And got that results depends mostly on the number of items that one is looking for (in my case, 'N'):

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 5
  ab*: 0.001733
  uz: 0.002512
  esm 0.126853:

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 50
  ab*: 0.017564
  uz: 0.023701
  esm 0.079925:

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 1000
  ab: 0.370371
  uz: 0.489523
  esm* 0.133783:

TEXT_LEN: 1000 SEARCH_LEN: 20 N: 10000
  ab: 3.678790
  uz: 4.883575
  esm* 0.259605:

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复