I\'m getting some efficiency test results that I can\'t explain.
I want to assemble a matrix B whose i-th entries B[i,:,:] = A[i,:,:].dot(x), where each A[i,:,:] is
I am not too familiar with numpy's C-API, and the numpy.dot
is one such builtin function that used to be under _dotblas
in earlier versions.
Nevertheless, here are my thoughts.
1) numpy.dot
takes different paths for 2-dimensional arrays and n-dimensional arrays. From the numpy.dot
's online documentation:
For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
So for 2-D arrays you are always guaranteed to have one call to BLAS's dgemm
, however for N-D arrays numpy might choose the multiplication axes for arrays which might not correspond to the fastest changing axis (as you can see from the excerpt I have posted), and as result the full power of dgemm
could be missed out on.
2) Your A
array is too large to be loaded on to CPU cache. In your example, you use A
with dimensions (10,1000,1000)
which gives
In [1]: A.nbytes
80000000
In [2]: 80000000/1024
78125
That is almost 80MB
, much larger than your cache size. So again you lose most of dgemm
's power right there.
3) You are also timing the functions somewhat imprecisely. The time
function in Python is known to be not precise. Use timeit
instead.
So having all the above points in mind, let's try experimenting with arrays that can be loaded on to the cache
dim1, dim2, dim3 = 20, 20, 20
A = np.random.rand(dim1, dim2, dim2)
x = np.random.rand(dim2, dim3)
def for_dot1(A,x):
for i in range(A.shape[0]):
np.dot(A[i,:,:], x)
def for_dot2(A,x):
for i in range(A.shape[0]):
np.dot(A[:,i,:], x)
def for_dot3(A,x):
for i in range(A.shape[0]):
np.dot(A[:,:,i], x)
and here are the timings that I get (using numpy 1.9.2
built against OpenBLAS 0.2.14
):
In [3]: %timeit np.dot(A,x)
10000 loops, best of 3: 174 µs per loop
In [4]: %timeit np.einsum("ijk, kl -> ijl", A, x)
10000 loops, best of 3: 108 µs per loop
In [5]: %timeit np.einsum("ijk, lk -> ijl", A, x)
10000 loops, best of 3: 97.1 µs per loop
In [6]: %timeit np.einsum("ikj, kl -> ijl", A, x)
1000 loops, best of 3: 238 µs per loop
In [7]: %timeit np.einsum("kij, kl -> ijl", A, x)
10000 loops, best of 3: 113 µs per loop
In [8]: %timeit for_dot1(A,x)
10000 loops, best of 3: 101 µs per loop
In [9]: %timeit for_dot2(A,x)
10000 loops, best of 3: 131 µs per loop
In [10]: %timeit for_dot3(A,x)
10000 loops, best of 3: 133 µs per loop
Notice that there is still a time difference, but not in orders of magnitude. Also note the importance of choosing the axis of multiplication
. Now perhaps, a numpy developer can shed some light on what numpy.dot
actually does under the hood for N-D arrays.