问题

How can this next for-loop get a speedup with numpy? I guess some fancy indexing-trick can be used here, but i have no idea which one (can einsum be used here?).

a=0
for i in range(len(b)):
    a+=numpy.mean(C[d,e,f+b[i]])*g[i]

edit: C is a numpy 3D array of shape comparable to (20, 1600, 500). d,e,f are indices of points that are "interesting" (lengths of d,e,f are the same and around 900) b and g have the same length (around 50). The mean is taken over all the points in C with the indices d,e,f+b[i]

回答1:

Timings

Both sessions were initialized with

In [1]: C = np.random.rand(20,1600,500)

In [2]: d = np.random.randint(0, 20, size=900)

In [3]: e = np.random.randint(1600, size=900)

In [4]: f = np.random.randint(400, size=900)

In [5]: b = np.random.randint(100, size=50)

In [6]: g = np.random.rand(50)

Numpy 1.9.0

In [7]: %timeit C[d,e,f + b[:,np.newaxis]].mean(axis=1).dot(g)
1000 loops, best of 3: 942 µs per loop

In [8]: %timeit C[d[:,np.newaxis],e[:, np.newaxis],f[:, np.newaxis] + b].mean(axis=0).dot(g)
1000 loops, best of 3: 762 µs per loop

In [9]: %%timeit                                               
   ...: a = 0
   ...: for i in range(len(b)):                                     
   ...:     a += np.mean(C[d, e, f + b[i]]) * g[i]
   ...: 
100 loops, best of 3: 2.25 ms per loop

In [10]: np.__version__
Out[10]: '1.9.0'

In [11]: %%timeit
(C.ravel()[np.ravel_multi_index((d[:,np.newaxis],
                                 e[:,np.newaxis],
                                 f[:,np.newaxis] + b), dims=C.shape)]
 .mean(axis=0).dot(g))
   ....: 
1000 loops, best of 3: 940 µs per loop

Numpy 1.8.2

In [7]: %timeit C[d,e,f + b[:,np.newaxis]].mean(axis=1).dot(g)
100 loops, best of 3: 2.81 ms per loop

In [8]: %timeit C[d[:,np.newaxis],e[:, np.newaxis],f[:, np.newaxis] + b].mean(axis=0).dot(g)
100 loops, best of 3: 2.7 ms per loop

In [9]: %%timeit                                               
   ...: a = 0
   ...: for i in range(len(b)):                                     
   ...:     a += np.mean(C[d, e, f + b[i]]) * g[i]
   ...: 
100 loops, best of 3: 4.12 ms per loop

In [10]: np.__version__
Out[10]: '1.8.2'

In [51]: %%timeit
(C.ravel()[np.ravel_multi_index((d[:,np.newaxis],
                                 e[:,np.newaxis],
                                 f[:,np.newaxis] + b), dims=C.shape)]
 .mean(axis=0).dot(g))
   ....: 
1000 loops, best of 3: 1.4 ms per loop

Description

You can use coordinate broadcasting trick to flesh out your 50x900 array from the beginning:

In [158]: C[d,e,f + b[:, np.newaxis]].shape
Out[158]: (50, 900)

From that point, mean and dot will get you to the destination:

In [159]: C[d,e,f + b[:, np.newaxis]].mean(axis=1).dot(g)
Out[159]: 13.582349962518611

In [160]: 
a = 0
for i in range(len(b)):       
    a += np.mean(C[d, e, f + b[i]]) * g[i]
print(a)
   .....: 
13.5823499625

And it's about 3.3x faster than the loop version:

In [161]: %timeit C[d,e,f + b[:, np.newaxis]].mean(axis=1).dot(g)
1000 loops, best of 3: 585 µs per loop

In [162]: %%timeit                                               
a = 0
for i in range(len(b)):                                     
    a += np.mean(C[d, e, f + b[i]]) * g[i]
   .....: 
1000 loops, best of 3: 1.95 ms per loop

The array is of significant size, so you must factor in CPU cache. I cannot say I know how np.sum traverses the array, but in 2d arrays there is always a slightly better way (when the next element you pick is adjacent memory-wise) and a slightly worse way (when the next element is found across the stride). Let's see if we can win something more by transposing the array during indexing:

In [196]: C[d[:,np.newaxis], e[:,np.newaxis], f[:,np.newaxis] + b].mean(axis=0).dot(g)
Out[196]: 13.582349962518608

In [197]: %timeit C[d[:,np.newaxis], e[:,np.newaxis], f[:,np.newaxis] + b].mean(axis=0).dot(g)
1000 loops, best of 3: 461 µs per loop

That's 4.2x faster than the loop.

回答2:

You can do the following trick:

C[d, e][:, np.add.outer(f, b)].dot(g).diagonal().mean()

improving even more, by prematurely taking the elements that will form the diagonal:

C[d, e][np.arange(len(d))[:, None], np.add.outer(f, b)].dot(g).mean()

回答3:

It's quite similar to the loopy version:

np.sum(np.mean(C[d,e,f+b[:,None]], axis=1) * g)

And you can combine the summation and multiplication into a dot product:

C[d,e,f+b[:,None]].mean(1).dot(g)

But for the timing that doesn't seem to matter; The indexing operation is by far the most time consuming operation of all (at least on Numpy 1.8.0). Compared to that, the loop overhead in the original code is insignificant.

回答4:

The only speed you could hope for in structural terms would be with the following code:

#Initialize a 4-D array
aggregated = numpy.zeros((len(d), len(e), len(f), len(b)))
#Populate it by the shifted copies of C
for i in range(len(b)):
    aggregated[:, :, :, i] = C[d, e, f + b[i]]

#Compute the mean on the first three axes
means = numpy.mean(aggregated, axis=(0, 1, 2))
#Multiply term-by-term by g (be careful that means and g have the same size!) and sum
a = numpy.sum(means * g)

However this does not guarantee that the calculation will be faster, it may even be slower for the following reasons:

Populating the 4-D array has a non-negligible cost since it copies memory
b is very small, so you won't win much anyway. If b is bigger this can become interesting, as long as d, e, f get smaller too

In any case you should benchmark both solutions. You could also try using something like Cython to perform the for loop, but this seems like an overkill.

来源：https://stackoverflow.com/questions/26403884/speed-up-for-loop-with-numpy

标签

python