Why is B = numpy.dot(A,x) so much slower looping through doing B[i,:,:] = numpy.dot(A[i,:,:],x) )?

前端未结

关注

 3  1178

深忆病人 2020-12-11 16:33

I\'m getting some efficiency test results that I can\'t explain.

I want to assemble a matrix B whose i-th entries B[i,:,:] = A[i,:,:].dot(x), where each A[i,:,:] is

3条回答

失恋的感觉 (楼主)

2020-12-11 16:53
I am not too familiar with numpy's C-API, and the numpy.dot is one such builtin function that used to be under _dotblas in earlier versions.

Nevertheless, here are my thoughts.

1) numpy.dot takes different paths for 2-dimensional arrays and n-dimensional arrays. From the numpy.dot's online documentation:

For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b

dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])

So for 2-D arrays you are always guaranteed to have one call to BLAS's dgemm, however for N-D arrays numpy might choose the multiplication axes for arrays which might not correspond to the fastest changing axis (as you can see from the excerpt I have posted), and as result the full power of dgemm could be missed out on.

2) Your A array is too large to be loaded on to CPU cache. In your example, you use A with dimensions (10,1000,1000) which gives
```
In [1]: A.nbytes
80000000
In [2]: 80000000/1024
78125
```
That is almost 80MB, much larger than your cache size. So again you lose most of dgemm's power right there.

3) You are also timing the functions somewhat imprecisely. The time function in Python is known to be not precise. Use timeit instead.

So having all the above points in mind, let's try experimenting with arrays that can be loaded on to the cache
```
dim1, dim2, dim3 = 20, 20, 20
A = np.random.rand(dim1, dim2, dim2)
x = np.random.rand(dim2, dim3)

def for_dot1(A,x):
    for i in range(A.shape[0]):
        np.dot(A[i,:,:], x)

def for_dot2(A,x):
    for i in range(A.shape[0]):
        np.dot(A[:,i,:], x)    

def for_dot3(A,x):
    for i in range(A.shape[0]):
        np.dot(A[:,:,i], x)  
```
and here are the timings that I get (using numpy 1.9.2 built against OpenBLAS 0.2.14):
```
In [3]: %timeit np.dot(A,x)
10000 loops, best of 3: 174 µs per loop
In [4]: %timeit np.einsum("ijk, kl -> ijl", A, x)
10000 loops, best of 3: 108 µs per loop
In [5]: %timeit np.einsum("ijk, lk -> ijl", A, x)
10000 loops, best of 3: 97.1 µs per loop
In [6]: %timeit np.einsum("ikj, kl -> ijl", A, x)
1000 loops, best of 3: 238 µs per loop
In [7]: %timeit np.einsum("kij, kl -> ijl", A, x)
10000 loops, best of 3: 113 µs per loop
In [8]: %timeit for_dot1(A,x)
10000 loops, best of 3: 101 µs per loop
In [9]: %timeit for_dot2(A,x)
10000 loops, best of 3: 131 µs per loop
In [10]: %timeit for_dot3(A,x)
10000 loops, best of 3: 133 µs per loop
```
Notice that there is still a time difference, but not in orders of magnitude. Also note the importance of choosing the axis of multiplication. Now perhaps, a numpy developer can shed some light on what numpy.dot actually does under the hood for N-D arrays.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...