Matrix multiplication with iterator dependency - NumPy

后端未结

关注

 2  1763

野的像风 2020-12-17 04:25

Sometime back this question (now deleted but 10K+ rep users can still view it) was posted. It looked interesting to me and I learnt something new there while trying to solve

2条回答

北海茫月 (楼主)

2020-12-17 05:02

I'm not sure if you want it to be all-numpy but I've always used numba for slow but easy to implement loop-based functions. The speedup for loopy-intensive tasks is amazing. First I just numba.njitted your all_loopy variant which already gave me comperative results:

m,n,N = 20,20,20
a = np.random.rand(m,n,N)
b = np.random.rand(n,m,N)

%timeit numba_all_loopy(a,b)
1000 loops, best of 3: 476 µs per loop # 3 times faster than everything else
%timeit tensordot_twoloop(a,b)
100 loops, best of 3: 16.1 ms per loop
%timeit einsum_twoloop(a,b)
100 loops, best of 3: 4.02 ms per loop
%timeit einsum_oneloop(a,b)
1000 loops, best of 3: 1.52 ms per loop
%timeit fully_vectorized(a,b)
1000 loops, best of 3: 1.67 ms per loop

Then I tested it against your 100, 100, 100 case:

m,n,N = 100,100,100
a = np.random.rand(m,n,N)
b = np.random.rand(n,m,N)

%timeit numba_all_loopy(a,b)
1 loop, best of 3: 2.35 s per loop
%timeit tensordot_twoloop(a,b)
1 loop, best of 3: 3.54 s per loop
%timeit einsum_twoloop(a,b)
1 loop, best of 3: 2.58 s per loop
%timeit einsum_oneloop(a,b)
1 loop, best of 3: 2.71 s per loop
%timeit fully_vectorized(a,b)
1 loop, best of 3: 1.08 s per loop

Apart from noticing that my computer is much slower than yours - the numba version is getting slower. What happened?

Numpy uses compiled versions and depending on the compiler options numpy will probably optimize the looping while numba doesn't. So the next logical step is the optimize the looping. Assuming C-contiguous arrays the innermost loops should be the last axis of the arrays. It's the fastest changing axis so the cache locality will be better.

@nb.njit
def numba_all_loopy2(a,b):
    P,Q,N = a.shape
    d = np.zeros(N)
    # First axis a, second axis b
    for k in range(P):
        # first axis b, second axis a
        for n in range(Q):
            # third axis a
            for i in range(N):
                # third axis b
                A = a[k,n,i] # so we have less lookups of the same variable
                for j in range(i):
                    d[i] += A * b[n,k,j]
    return d

so what are the timings of this "optimized" numba function? Can it compare with the others or even beat them?

m = n = N = 20
%timeit numba_all_loopy(a,b)
1000 loops, best of 3: 476 µs per loop
%timeit numba_all_loopy2(a,b)
1000 loops, best of 3: 379 µs per loop # New one is a bit faster

so it's slightly faster for small matrices, what about big ones:

m = n = N = 100
%timeit numba_all_loopy(a,b)
1 loop, best of 3: 2.34 s per loop
%timeit numba_all_loopy2(a,b)
1 loop, best of 3: 213 ms per loop # More then ten times faster now!

So we have an amazing speedup for large arrays. This function is now 4-5 times faster than your vectorized approaches and the result is the same.

But amazingly it seems that the ordering seems somehow dependant on the computer because the fully_vectorized is fastest where the einsum-options are faster on @Divakar's machine. So it might be open if these results are really that much faster.

Just for fun I tried it with n=m=N=200:

%timeit numba_all_loopy2(a,b)
1 loop, best of 3: 3.38 s per loop  # still 5 times faster
%timeit einsum_oneloop(a,b)
1 loop, best of 3: 51.4 s per loop
%timeit fully_vectorized(a,b)
1 loop, best of 3: 16.7 s per loop

0 讨论(0)

查看其它2个回答