I have a particular problem where I want to search for many substrings in a list of many strings. The following is the gist of what I am trying to do:
listSt
You can speed up the inner loop significantly by joining listString into one long string (Or read the strings from a file without splitting it on line breaks).
with open('./testStrings.txt') as f:
longString = f.read() # string with seqs separated by \n
with open('./testSubstrings.txt') as f:
listSubstrings = list(f)
def search(longString, listSubstrings):
for n, substring in enumerate(listSubstrings):
offset = longString.find(substring)
while offset >= 0:
yield (substring, offset)
offset = longString.find(substring, offset + 1)
matches = list(search(longString, listSubstrings))
The offsets can be mapped beck to the string index.
from bisect import bisect_left
breaks = [n for n,c in enumerate(longString) if c=='\n']
for substring, offset in matches:
stringindex = bisect_left(breaks, offset)
My test shows a 7x speed up versus the nested for loops (11 sec vs 77 sec).