问题
I want to search for pre-defined list of keywords in a given article and increment the score by 1 if keyword is found in article. I want to use multiprocessing since pre-defined list of keyword is very large - 10k keywords and number of article is 100k.
I came across this question but it does not address my question.
I tried this implementation but getting None as result.
keywords = ["threading", "package", "parallelize"]
def search_worker(keyword):
score = 0
article = """
The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
if keyword in article:
score += 1
return score
I tried below two method but getting three None as result.
Method1:
pool = mp.Pool(processes=4)
result = [pool.apply(search_worker, args=(keyword,)) for keyword in keywords]
Method2:
result = pool.map(search_worker, keywords)
print(result)
Actual output: [None, None, None]
Expected output: 3
I think of sending the worker the pre-defined list of keyword and the article all together, but I am not sure if I am going in right direction as I don't have prior experience of multiprocessing.
Thanks in advance.
回答1:
Here's a function using Pool. You can pass text and keyword_list and it will work. You could use Pool.starmap to pass tuples of (text, keyword), but you would need to deal with an iterable that had 10k references to text.
from functools import partial
from multiprocessing import Pool
def search_worker(text, keyword):
return int(keyword in text)
def parallel_search_text(text, keyword_list):
processes = 4
chunk_size = 10
total = 0
func = partial(search_worker, text)
with Pool(processes=processes) as pool:
for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
total += result
return total
if __name__ == '__main__':
texts = [] # a list of texts
keywords = [] # a list of keywords
for text in texts:
print(parallel_search_text(text, keywords))
There is overhead in creating a pool of workers. It might be worth timeit-testing this against a simple single-process text search function. Repeat calls can be sped up by creating one instance of Pool and passing it into the function.
def parallel_search_text2(text, keyword_list, pool):
chunk_size = 10
results = 0
func = partial(search_worker, text)
for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
results += result
return results
if __name__ == '__main__':
pool = Pool(processes=4)
texts = [] # a list of texts
keywords = [] # a list of keywords
for text in texts:
print(parallel_search_text2(text, keywords, pool))
回答2:
User e.s resolved the main problem in his comment but I'm posting a solution to Om Prakash's comment requesting to pass in:
both article and pre-defined list of keywords to worker method
Here is a simple way to do that. All you need to do is construct a tuple containing the arguments that you want the worker to process:
from multiprocessing import Pool
def search_worker(article_and_keyword):
# unpack the tuple
article, keyword = article_and_keyword
# count occurrences
score = 0
if keyword in article:
score += 1
return score
if __name__ == "__main__":
# the article and the keywords
article = """The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
keywords = ["threading", "package", "parallelize"]
# construct the arguments for the search_worker; one keyword per worker but same article
args = [(article, keyword) for keyword in keywords]
# construct the pool and map to the workers
with Pool(3) as pool:
result = pool.map(search_worker, args)
print(result)
If you're on a later version of python I would recommend trying starmap as that will make this a bit cleaner.
来源:https://stackoverflow.com/questions/48162230/how-to-share-data-between-all-process-in-python-multiprocessing