I\'ve been attempting to optimize a piece of python code that involves large multi-dimensional array calculations. I am getting counterintuitive results with numba. I am r
There are two issues with your @guvectorize implementations. The first is that you are are doing all the looping inside your @guvectorize kernel, so there is actually nothing for the Numba parallel target to parallelize. Both @vectorize and @guvectorize parallelize on the broadcast dimensions in a ufunc/gufunc. Since the signature of your gufunc is 2D, and your inputs are 2D, there is only a single call to the inner function, which explains the only 100% CPU usage you saw.
The best way to write the function you have above is to use a regular ufunc:
@vectorize('(float64, float64)', target='parallel')
def add_ufunc(a, b):
return a + b
Then on my system, I see these speeds:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.87 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.81 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 2.43 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
100 loops, best of 3: 2.79 ms per loop
%timeit add_ufunc(A, B, res)
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 2.03 ms per loop
(This is a very similar OS X system to yours, but with OS X 10.11.)
Although Numba's parallel ufunc now beats numexpr (and I see add_ufunc
using about 280% CPU), it doesn't beat the simple single-threaded CPU case. I suspect that the bottleneck is due to memory (or cache) bandwidth, but I haven't done the measurements to check that.
Generally speaking, you will see much more benefit from the parallel ufunc target if you are doing more math operations per memory element (like, say, a cosine).