Cython prange slower for 4 threads then with range

做~自己de王妃 提交于 2019-11-30 21:59:56

1) An important feature of prange (like any other parallel for loop) is that it activates out-of-order execution, which means that the loop can execute in any arbitrary order. Out-of-order execution really pays off when you have no data dependency between iterations.

I do not know the internals of Cython but I reckon that if boundschecking is not turned off, the loop cannot be executed arbitrarily, since the next iteration will depend on whether or not the array is going out of bounds in the current iteration, hence the problem becomes almost serial as threads will have to wait for the result. This is one of the issues with your code. In fact Cython does give me the following warning:

warning: bla.pyx:42:16: Use boundscheck(False) for faster access

So add the following

from cython import boundscheck, wraparound

@boundscheck(False)
@wraparound(False)
def c_array_f(double[:] X):
   # Rest of your code

@boundscheck(False)
@wraparound(False)
def c_array_f_multi(double[:] X):
   # Rest of your code

Let's now time them with your data X = -1 + 2*np.random.rand(10000000).

With Bounds Checking:

In [2]:%timeit array_f(X)
10 loops, best of 3: 189 ms per loop
In [4]:%timeit c_array_f(X)
10 loops, best of 3: 93.6 ms per loop
In [5]:%timeit c_array_f_multi(X)
10 loops, best of 3: 103 ms per loop

Without Bounds Checking:

In [9]:%timeit c_array_f(X)
10 loops, best of 3: 84.2 ms per loop
In [10]:%timeit c_array_f_multi(X)
10 loops, best of 3: 42.3 ms per loop

These results are with num_threads=4 (I have 4 logical cores) and the speed-up is around 2x. Before getting further we can still shave off a few more ms by declaring our arrays to be contiguous i.e. declaring X and Y with double[::1].

Contiguous Arrays:

In [14]:%timeit c_array_f(X)
10 loops, best of 3: 81.8 ms per loop
In [15]:%timeit c_array_f_multi(X)
10 loops, best of 3: 39.3 ms per loop

2) Even more important is job scheduling and this is what your benchmark suffers from. By default chunk sizes are determined at compile time i.e. schedule=static however it is very likely that the environment variables (for instance OMP_SCHEDULE) and work-load of the two machines (yours and the one from the blog post) are different, and they schedule the jobs at runtime, dynmically, guidedly and so on. Let's experiment it with replacing your prange to

for i in prange(N, schedule='static'):
    # static scheduling... 
for i in prange(N, schedule='dynamic'):
    # dynamic scheduling... 

Let's time them now (only the multi-threaded code):

Scheduling Effect:

In [23]:%timeit c_array_f_multi(X) # static
10 loops, best of 3: 39.5 ms per loop
In [28]:%timeit c_array_f_multi(X) # dynamic
1 loops, best of 3: 319 ms per loop

You might be able to replicate this depending on the work-load on your own machine. As a side note, since you are just trying to measure the performance of a parallel vs serial code in a micro-benchmark test and not an actual code, I suggest you get rid of the if-else condition i.e. only keep Y[i] = c_exp(X[i]) within the for loop. This is because if-else statements also adversely affect branch-prediction and out-of-order execution in parallel code. On my machine I get almost 2.7x speed-up over serial code with this change.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!