Numba `nogil` + dask threading backend results in no speed up (computation is slower!)

问题

I'm trying to use Numba and Dask to speed up a slow computation that is similar to calculating the kernel density estimate of a huge collection of points. My plan was to write the computationally expensive logic in a jited function and then split the work among the CPU cores using dask. I wanted to use the nogil feature of numba.jit function so that I could use the dask threading backend so as to avoid unnecessary memory copies of the input data (which is very large).

Unfortunately, Dask won't result in a speed up unless I use the 'processes' scheduler. If I use a ThreadPoolExector instead then I see the expected speed up.

Here's a simplified example of my problem:

import os
import numpy as np
import numba
import dask

CPU_COUNT = os.cpu_count()

def render_internal(size, mag):
    """mag is the magnification to apply
    generate coordinates internally
    """
    coords = np.random.rand(size, 2)
    img = np.zeros((mag, mag), dtype=np.int64)
    for i in range(len(coords)):
        y0, x0 = coords[i] * mag
        y1, x1 = int(y0), int(x0)
        m = 1
        img[y1, x1] += m

jit_render_internal = numba.jit(render_internal, nogil=True, nopython=True)

args = 10000000, 100

print("Linear time:")
%time linear_compute = [jit_render_internal(*args) for i in range(CPU_COUNT)]

delayed_jit_render_internal = dask.delayed(jit_render_internal)

print()
print("Threads time:")
%time dask_compute_threads = dask.compute(*[delayed_jit_render_internal(*args) for i in range(CPU_COUNT)])

print()
print("Processes time:")
%time dask_compute_processes = dask.compute(*[delayed_jit_render_internal(*args) for i in range(CPU_COUNT)], scheduler="processes")

And here's the output on my machine:

Linear time:
Wall time: 1min 17s

Threads time:
Wall time: 1min 47s

Processes time:
Wall time: 7.79 s

For both the processing and threading backends I see complete utilization of all CPU cores, as expected. But no speed up for the threading backend. I'm pretty sure that the jitted function, jit_render_internal, is not, in fact, releasing the GIL.

My two questions are:

If the nogil keyword is passed to numba.jit and the GIL cannot be released, why isn't an error raised?
Why doesn't the code, as I've written it, release the GIL? All the computation is embedded in the function and there's no return value.

回答1:

Try the following, which is much faster and seems to fix the thread performance issue:

def render_internal(size, mag):
    """mag is the magnification to apply
    generate coordinates internally
    """
    coords = np.random.rand(size, 2)
    img = np.zeros((mag, mag), dtype=np.int64)
    for i in range(len(coords)):
        #y0, x0 = coords[i] * mag
        y0 = coords[i,0] * mag
        x0 = coords[i,1] * mag
        y1, x1 = int(y0), int(x0)
        m = 1
        img[y1, x1] += m

I've split the calculation of x0 and y0 up in the above. On my machine, the threads based solution is actually faster than the processes after the change.

来源：https://stackoverflow.com/questions/56855897/numba-nogil-dask-threading-backend-results-in-no-speed-up-computation-is-sl

标签

python

multithreading

dask

numba