问题
I'm trying to use Numba and Dask to speed up a slow computation that is similar to calculating the kernel density estimate of a huge collection of points. My plan was to write the computationally expensive logic in a jit
ed function and then split the work among the CPU cores using dask
. I wanted to use the nogil
feature of numba.jit
function so that I could use the dask
threading backend so as to avoid unnecessary memory copies of the input data (which is very large).
Unfortunately, Dask won't result in a speed up unless I use the 'processes'
scheduler. If I use a ThreadPoolExector
instead then I see the expected speed up.
Here's a simplified example of my problem:
import os
import numpy as np
import numba
import dask
CPU_COUNT = os.cpu_count()
def render_internal(size, mag):
"""mag is the magnification to apply
generate coordinates internally
"""
coords = np.random.rand(size, 2)
img = np.zeros((mag, mag), dtype=np.int64)
for i in range(len(coords)):
y0, x0 = coords[i] * mag
y1, x1 = int(y0), int(x0)
m = 1
img[y1, x1] += m
jit_render_internal = numba.jit(render_internal, nogil=True, nopython=True)
args = 10000000, 100
print("Linear time:")
%time linear_compute = [jit_render_internal(*args) for i in range(CPU_COUNT)]
delayed_jit_render_internal = dask.delayed(jit_render_internal)
print()
print("Threads time:")
%time dask_compute_threads = dask.compute(*[delayed_jit_render_internal(*args) for i in range(CPU_COUNT)])
print()
print("Processes time:")
%time dask_compute_processes = dask.compute(*[delayed_jit_render_internal(*args) for i in range(CPU_COUNT)], scheduler="processes")
And here's the output on my machine:
Linear time:
Wall time: 1min 17s
Threads time:
Wall time: 1min 47s
Processes time:
Wall time: 7.79 s
For both the processing and threading backends I see complete utilization of all CPU cores, as expected. But no speed up for the threading backend. I'm pretty sure that the jitted function, jit_render_internal
, is not, in fact, releasing the GIL.
My two questions are:
- If the
nogil
keyword is passed tonumba.jit
and the GIL cannot be released, why isn't an error raised? - Why doesn't the code, as I've written it, release the GIL? All the computation is embedded in the function and there's no return value.
回答1:
Try the following, which is much faster and seems to fix the thread performance issue:
def render_internal(size, mag):
"""mag is the magnification to apply
generate coordinates internally
"""
coords = np.random.rand(size, 2)
img = np.zeros((mag, mag), dtype=np.int64)
for i in range(len(coords)):
#y0, x0 = coords[i] * mag
y0 = coords[i,0] * mag
x0 = coords[i,1] * mag
y1, x1 = int(y0), int(x0)
m = 1
img[y1, x1] += m
I've split the calculation of x0
and y0
up in the above. On my machine, the threads based solution is actually faster than the processes after the change.
来源:https://stackoverflow.com/questions/56855897/numba-nogil-dask-threading-backend-results-in-no-speed-up-computation-is-sl