numba guvectorize target='parallel' slower than target='cpu'

后端 未结 1 2077
闹比i
闹比i 2020-12-15 13:47

I\'ve been attempting to optimize a piece of python code that involves large multi-dimensional array calculations. I am getting counterintuitive results with numba. I am r

相关标签:
1条回答
  • 2020-12-15 13:58

    There are two issues with your @guvectorize implementations. The first is that you are are doing all the looping inside your @guvectorize kernel, so there is actually nothing for the Numba parallel target to parallelize. Both @vectorize and @guvectorize parallelize on the broadcast dimensions in a ufunc/gufunc. Since the signature of your gufunc is 2D, and your inputs are 2D, there is only a single call to the inner function, which explains the only 100% CPU usage you saw.

    The best way to write the function you have above is to use a regular ufunc:

    @vectorize('(float64, float64)', target='parallel')
    def add_ufunc(a, b):
        return a + b
    

    Then on my system, I see these speeds:

    %timeit add_two_2ds_jit(A,B,res)
    1000 loops, best of 3: 1.87 ms per loop
    
    %timeit add_two_2ds_cpu(A,B,res)
    1000 loops, best of 3: 1.81 ms per loop
    
    %timeit add_two_2ds_parallel(A,B,res)
    The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached 
    100 loops, best of 3: 2.43 ms per loop
    
    %timeit add_two_2ds_numexpr(A,B,res)
    100 loops, best of 3: 2.79 ms per loop
    
    %timeit add_ufunc(A, B, res)
    The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 2.03 ms per loop
    

    (This is a very similar OS X system to yours, but with OS X 10.11.)

    Although Numba's parallel ufunc now beats numexpr (and I see add_ufunc using about 280% CPU), it doesn't beat the simple single-threaded CPU case. I suspect that the bottleneck is due to memory (or cache) bandwidth, but I haven't done the measurements to check that.

    Generally speaking, you will see much more benefit from the parallel ufunc target if you are doing more math operations per memory element (like, say, a cosine).

    0 讨论(0)
提交回复
热议问题