I have a particular problem where I want to search for many substrings in a list of many strings. The following is the gist of what I am trying to do:
listSt
Maybe you can try to chunk one of the two list (the biggest ? although intuitively I would cut listStrings) in smaller ones then use threading to run these search in parallel (the Pool class of multiprocessing offers a convenient way to do this) ? I had some significant speed-up using something like :
from multiprocessing import Pool
from itertools import chain, islice
# The function to be run in parallel :
def my_func(strings):
return [j+i for i in strings for j in listSubstrings if i.find(j)>-1]
# A small recipe from itertools to chunk an iterable :
def chunk(it, size):
it = iter(it)
return iter(lambda: tuple(islice(it, size)), ())
# Generating some fake & random value :
from random import randint
listStrings = \
[''.join([chr(randint(65, 90)) for i in range(randint(1, 500))]) for j in range(10000)]
listSubstrings = \
[''.join([chr(randint(65, 90)) for i in range(randint(1, 100))]) for j in range(1000)]
# You have to prepare the searches to be performed:
prep = [strings for strings in chunk(listStrings, round(len(listStrings) / 8))]
with Pool(4) as mp_pool:
# multiprocessing.map is a parallel version of map()
res = mp_pool.map(my_func, prep)
# The `res` variable is a list of list, so now you concatenate them
# in order to have a flat result list
result = list(chain.from_iterable(res))
Then you could write the whole result variable (instead of writing it line by lines) :
with open('result_file', 'w') as f:
f.write('\n'.join(result))
Edit 01/05/18: flatten the result using itertools.chain.from_iterable instead of a ugly workaround using map side-effects, following ShadowRanger's advice.