Why python broadcasting in the example below is slower than a simple loop?

前端未结

关注

 2  838

难免孤独 2020-12-18 00:04

I have an array of vectors and compute the norm of their diffs vs the first one. When using python broadcasting, the calculation is significantly slower than doing it via a

2条回答

慢半拍i (楼主)

2020-12-18 01:08
Squeezing out maximum performance

On vectorized operations you clearly have a bad cache behaviour. But the calculation itsef is also slow due to not exploiting modern SIMD instructions (AVX2,FMA). Fortunately it isn't really complicated to overcome this issues.

Example
```
import numpy as np
import numba as nb
@nb.njit(fastmath=True,parallel=True)
def norm_loop_improved(M, v):
  n = M.shape[0]
  d = np.empty(n,dtype=M.dtype)

  #enables SIMD-vectorization 
  #if the arrays are not aligned
  M=np.ascontiguousarray(M)
  v=np.ascontiguousarray(v)

  for i in nb.prange(n):
    dT=0.
    for j in range(v.shape[0]):
      dT+=(M[i,j]-v[j])*(M[i,j]-v[j])
    d[i]=dT
  return d
```
Performance
```
M = np.random.random_sample((1000, 1000))
norm_loop_improved: 0.11 ms**, 0.28ms
norm_loop: 6.56 ms 
norm_einsum: 3.84 ms

M = np.random.random_sample((10000, 10000))
norm_loop_improved:34 ms
norm_loop: 223 ms
norm_einsum: 379 ms
```
** Be careful when measuring performance

The first result (0.11ms) comes from calling the function repeadedly with the same data. This would need 77 GB/s reading-throuput from RAM, which is far more than my DDR3 Dualchannel-RAM is capable of. Due to the fact that calling a function with the same input parameters successively isn't realistic at all, we have to modify the measurement.

To avoid this issue we have to call the same function with different data at least twice (8MB L3-cache, 8MB data) and than divide the result by two to clear all the caches.

The relative performance of this methods also differ on array sizes (have a look at the einsum results).
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

Why python broadcasting in the example below is slower than a simple loop?

Squeezing out maximum performance