Why is numpy's einsum slower than numpy's built-in functions?

前端未结

关注

 2  1671

I\'ve usually gotten good performance out of numpy\'s einsum function (and I like it\'s syntax). @Ophion\'s answer to this question shows that - for the cases tested - einsu

相关标签:

2条回答

面向向阳花

2020-12-08 23:21
You can have the best of both worlds:
```
def func_dot_einsum(C, X):
    Y = X.dot(C)
    return np.einsum('ij,ij->i', Y, X)
```
On my system:
```
In [7]: %timeit func_dot(C, X)
10 loops, best of 3: 31.1 ms per loop

In [8]: %timeit func_einsum(C, X)
10 loops, best of 3: 105 ms per loop

In [9]: %timeit func_einsum2(C, X)
10 loops, best of 3: 43.5 ms per loop

In [10]: %timeit func_dot_einsum(C, X)
10 loops, best of 3: 21 ms per loop
```
When available, np.dot uses BLAS, MKL, or whatever library you have . So the call to np.dot is almost certainly being multithreaded. np.einsum has its own loops, so doesn't use any of those optimizations, apart from its own use of SIMD to speed things up over a vanilla C implementation.

Then there's the multi-input einsum call that runs much slower... The numpy source for einsum is very complex and I don't fully understand it. So be advised that the following is speculative at best, but here's what I think is going on...

When you run something like np.einsum('ij,ij->i', a, b), the benefit over doing np.sum(a*b, axis=1) comes from avoiding having to instantiate the intermediate array with all the products, and looping twice over it. So at the low level what goes on is something like:
```
for i in range(I):
    out[i] = 0
    for j in range(J):
        out[i] += a[i, j] * b[i, j]
```
Say now that you are after something like:
```
np.einsum('ij,jk,ik->i', a, b, c)
```
You could do the same operation as
```
np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
```
And what I think einsum does is to run this last code without having to instantiate the huge intermediate array, which certainly makes things faster:
```
In [29]: a, b, c = np.random.rand(3, 100, 100)

In [30]: %timeit np.einsum('ij,jk,ik->i', a, b, c)
100 loops, best of 3: 2.41 ms per loop

In [31]: %timeit np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
100 loops, best of 3: 12.3 ms per loop
```
But if you look at it carefully, getting rid of intermediate storage can be a terrible thing. This is what I think einsum is doing at the low level:
```
for i in range(I):
    out[i] = 0
    for j in range(J):
        for k in range(K):
            out[i] += a[i, j] * b[j, k] * c[i, k]
```
But you are repeating a ton of operations! If you instead did:
```
for i in range(I):
    out[i] = 0
    for j in range(J):
        temp = 0
        for k in range(K):
            temp += b[j, k] * c[i, k]
        out[i] += a[i, j] * temp
```
you would be doing I * J * (K-1) less multiplications (and I * J extra additions), and save yourself a ton of time. My guess is that einsum is not smart enough to optimize things at this level. In the source code there is a hint that it only optimizes operations with 1 or 2 operands, not 3. In any case automating this for general inputs seems like anything but simple...
0 讨论(0)
发布评论:

提交评论
- 加载中...

情深已故

2020-12-08 23:30

einsum has a specialized case for '2 operands, ndim=2'. In this case there are 3 operands, and a total of 3 dimensions. So it has to use a general nditer.

While trying to understand how the string input is parsed, I wrote a pure Python einsum simulator, https://github.com/hpaulj/numpy-einsum/blob/master/einsum_py.py

The (stripped down) einsum and sum-of-products functions are:

def myeinsum(subscripts, *ops, **kwargs):
    # dropin preplacement for np.einsum (more or less)
    <parse subscript strings>
    <prepare op_axes>
    x = sum_of_prod(ops, op_axes, **kwargs)
    return x

def sum_of_prod(ops, op_axes,...):
    ...
    it = np.nditer(ops, flags, op_flags, op_axes)
    it.operands[nop][...] = 0
    it.reset()
    for (x,y,z,w) in it:
        w[...] += x*y*z
    return it.operands[nop]

Debugging output for myeinsum('ik,km,im->i',X,C,X,debug=True) with (M,K)=(10,5)

{'max_label': 109, 
 'min_label': 105, 
 'nop': 3, 
 'shapes': [(10, 5), (5, 5), (10, 5)], 
 ....}}
 ...
iter labels: [105, 107, 109],'ikm'

op_axes [[0, 1, -1], [-1, 0, 1], [0, -1, 1], [0, -1, -1]]

If you write a sum-of-prod function like this in cython you should get something close to the generalized einsum.

With the full (M,K), this simulated einsum is 6-7x slower.

Some timings building on the other answers:

In [84]: timeit np.dot(X,C)
1 loops, best of 3: 781 ms per loop

In [85]: timeit np.einsum('ik,km->im',X,C)
1 loops, best of 3: 1.28 s per loop

In [86]: timeit np.einsum('im,im->i',A,X)
10 loops, best of 3: 163 ms per loop

This 'im,im->i' step is substantially faster than the other. The sum dimension,mis only 20. I suspecteinsum` is treating this as a special case.

In [87]: timeit np.einsum('im,im->i',np.dot(X,C),X)
1 loops, best of 3: 950 ms per loop

In [88]: timeit np.einsum('im,im->i',np.einsum('ik,km->im',X,C),X)
1 loops, best of 3: 1.45 s per loop

The times for these composite calculations are simply sums of the corresponding pieces.

0 讨论(0)