What does the delayed() function do (when used with joblib in Python)

放肆的年华 提交于 2019-12-06 16:32:43

问题


I've read through the documentation, but I don't understand what is meant by: The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax.

I'm using it to iterate over the list I want to operate on (allImages) as follows:

def joblib_loop():
    Parallel(n_jobs=8)(delayed(getHog)(i) for i in allImages)

This returns my HOG features, like I want (and with the speed gain using all my 8 cores), but I'm just not sure what it is actually doing.

My Python knowledge is alright at best, and it's very possible that I'm missing something basic. Any pointers in the right direction would be most appreciated


回答1:


Perhaps things become clearer if we look at what would happen if instead we simply wrote

Parallel(n_jobs=8)(getHog(i) for i in allImages)

The way Python works, getHog(i) for i in allImages creates a list where each element is already evaluated. This means all getHog calls have already returned by the time the list gets passed to your Parallel object, and there is nothing left for Parallel to execute in parallel! All the work has already been done in the thread we're in right now, sequentially.

So we have to delay the execution by preserving a.) the function we want to call and the b.) the arguments we want to call the function with, but without actually executing the function already.

This is what delayed conveniently does for us with a clear syntax. If we want to "preserve" the call foo(2, g=3) for later, we can simply call delayed(foo)(2, g=3) and the tuple (foo, [2], {g: 3}) gets returned, ready to be executed by someone else.


So in your example, in a nutshell, the following happens.

  1. You created a list of delayed(getHog)(i)

  2. Each of those delayed(getHog)(i) returns the tuple (function, args, kwargs) (as you read in the docs) which is in this case the tuple (getHog, [i], {})

  3. Your previously constructed Parallel object creates a new thread for each element in the list and distributes the tuples to them

  4. On each of those new threads, it executes one of the list elements: It calls the first element of the tuple with the second and the third elements unpacked as arguments el[0](*el[1], **el[2]) or function(*args, **kwargs), which in this case results in the call getHog(i).



回答2:


we need a loop to test a list of different model configurations. This is the main function that drives the grid search process and will call the score_model() function for each model configuration. We can dramatically speed up the grid search process by evaluating model configurations in parallel. One way to do that is to use the Joblib library . We can define a Parallel object with the number of cores to use and set it to the number of scores detected in your hardware.

define executor

executor = Parallel(n_jobs=cpu_count(), backend= 'multiprocessing' )

then create a list of tasks to execute in parallel, which will be one call to the score model() function for each model configuration we have.

suppose def score_model(data, n_test, cfg): ........................

define list of tasks

tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list)

we can use the Parallel object to execute the list of tasks in parallel.

scores = executor(tasks)




回答3:


So what you want to be able to do is pile up a set of function calls and their arguments in such a way that you can pass them out efficiently to a scheduler/executor. Delayed is a decorator that takes in a function and its args and wraps them into an object that can be put in a list and popped out as needed. Dask has the same thing which it uses in part to feed into its graph scheduler.




回答4:


From reference https://wiki.python.org/moin/ParallelProcessing The Parallel object creates a multiprocessing pool that forks the Python interpreter in multiple processes to execute each of the items of the list. The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax.

Another thing I would like to suggest is instead of explicitly defining num of cores we can generalize like this:

import multiprocessing
num_core=multiprocessing.cpu_count()


来源:https://stackoverflow.com/questions/42220458/what-does-the-delayed-function-do-when-used-with-joblib-in-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!