I\'ve usually gotten good performance out of numpy\'s einsum function (and I like it\'s syntax). @Ophion\'s answer to this question shows that - for the cases tested - einsu
You can have the best of both worlds:
def func_dot_einsum(C, X):
Y = X.dot(C)
return np.einsum('ij,ij->i', Y, X)
On my system:
In [7]: %timeit func_dot(C, X)
10 loops, best of 3: 31.1 ms per loop
In [8]: %timeit func_einsum(C, X)
10 loops, best of 3: 105 ms per loop
In [9]: %timeit func_einsum2(C, X)
10 loops, best of 3: 43.5 ms per loop
In [10]: %timeit func_dot_einsum(C, X)
10 loops, best of 3: 21 ms per loop
When available, np.dot
uses BLAS, MKL, or whatever library you have . So the call to np.dot
is almost certainly being multithreaded. np.einsum
has its own loops, so doesn't use any of those optimizations, apart from its own use of SIMD to speed things up over a vanilla C implementation.
Then there's the multi-input einsum call that runs much slower... The numpy source for einsum is very complex and I don't fully understand it. So be advised that the following is speculative at best, but here's what I think is going on...
When you run something like np.einsum('ij,ij->i', a, b)
, the benefit over doing np.sum(a*b, axis=1)
comes from avoiding having to instantiate the intermediate array with all the products, and looping twice over it. So at the low level what goes on is something like:
for i in range(I):
out[i] = 0
for j in range(J):
out[i] += a[i, j] * b[i, j]
Say now that you are after something like:
np.einsum('ij,jk,ik->i', a, b, c)
You could do the same operation as
np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
And what I think einsum does is to run this last code without having to instantiate the huge intermediate array, which certainly makes things faster:
In [29]: a, b, c = np.random.rand(3, 100, 100)
In [30]: %timeit np.einsum('ij,jk,ik->i', a, b, c)
100 loops, best of 3: 2.41 ms per loop
In [31]: %timeit np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
100 loops, best of 3: 12.3 ms per loop
But if you look at it carefully, getting rid of intermediate storage can be a terrible thing. This is what I think einsum is doing at the low level:
for i in range(I):
out[i] = 0
for j in range(J):
for k in range(K):
out[i] += a[i, j] * b[j, k] * c[i, k]
But you are repeating a ton of operations! If you instead did:
for i in range(I):
out[i] = 0
for j in range(J):
temp = 0
for k in range(K):
temp += b[j, k] * c[i, k]
out[i] += a[i, j] * temp
you would be doing I * J * (K-1)
less multiplications (and I * J
extra additions), and save yourself a ton of time. My guess is that einsum is not smart enough to optimize things at this level. In the source code there is a hint that it only optimizes operations with 1 or 2 operands, not 3. In any case automating this for general inputs seems like anything but simple...
einsum
has a specialized case for '2 operands, ndim=2'. In this case there are 3 operands, and a total of 3 dimensions. So it has to use a general nditer
.
While trying to understand how the string input is parsed, I wrote a pure Python einsum simulator, https://github.com/hpaulj/numpy-einsum/blob/master/einsum_py.py
The (stripped down) einsum and sum-of-products functions are:
def myeinsum(subscripts, *ops, **kwargs):
# dropin preplacement for np.einsum (more or less)
<parse subscript strings>
<prepare op_axes>
x = sum_of_prod(ops, op_axes, **kwargs)
return x
def sum_of_prod(ops, op_axes,...):
...
it = np.nditer(ops, flags, op_flags, op_axes)
it.operands[nop][...] = 0
it.reset()
for (x,y,z,w) in it:
w[...] += x*y*z
return it.operands[nop]
Debugging output for myeinsum('ik,km,im->i',X,C,X,debug=True)
with (M,K)=(10,5)
{'max_label': 109,
'min_label': 105,
'nop': 3,
'shapes': [(10, 5), (5, 5), (10, 5)],
....}}
...
iter labels: [105, 107, 109],'ikm'
op_axes [[0, 1, -1], [-1, 0, 1], [0, -1, 1], [0, -1, -1]]
If you write a sum-of-prod
function like this in cython
you should get something close to the generalized einsum
.
With the full (M,K)
, this simulated einsum is 6-7x slower.
Some timings building on the other answers:
In [84]: timeit np.dot(X,C)
1 loops, best of 3: 781 ms per loop
In [85]: timeit np.einsum('ik,km->im',X,C)
1 loops, best of 3: 1.28 s per loop
In [86]: timeit np.einsum('im,im->i',A,X)
10 loops, best of 3: 163 ms per loop
This 'im,im->i' step is substantially faster than the other. The sum dimension,
mis only 20. I suspect
einsum` is treating this as a special case.
In [87]: timeit np.einsum('im,im->i',np.dot(X,C),X)
1 loops, best of 3: 950 ms per loop
In [88]: timeit np.einsum('im,im->i',np.einsum('ik,km->im',X,C),X)
1 loops, best of 3: 1.45 s per loop
The times for these composite calculations are simply sums of the corresponding pieces.