Python: optimal search for substring in list of strings

前端 未结 5 597
失恋的感觉
失恋的感觉 2020-12-06 16:10

I have a particular problem where I want to search for many substrings in a list of many strings. The following is the gist of what I am trying to do:

listSt         


        
5条回答
  •  南笙
    南笙 (楼主)
    2020-12-06 16:44

    You can speed up the inner loop significantly by joining listString into one long string (Or read the strings from a file without splitting it on line breaks).

    with open('./testStrings.txt') as f:
        longString = f.read()               # string with seqs separated by \n
    
    with open('./testSubstrings.txt') as f:
        listSubstrings = list(f)
    
    def search(longString, listSubstrings):
        for n, substring in enumerate(listSubstrings):
            offset = longString.find(substring)
            while offset >= 0:
                yield (substring, offset)
                offset = longString.find(substring, offset + 1)
    
    matches = list(search(longString, listSubstrings))
    

    The offsets can be mapped beck to the string index.

    from bisect import bisect_left
    breaks = [n for n,c in enumerate(longString) if c=='\n']
    
    for substring, offset in matches:
        stringindex = bisect_left(breaks, offset)
    

    My test shows a 7x speed up versus the nested for loops (11 sec vs 77 sec).

提交回复
热议问题