Cython prange slower for 4 threads then with range

I am currently trying to follow a simple example for parallelizing a loop with cython's prange. I have installed OpenBlas 0.2.14 with openmp allowed and compiled numpy 1.10.1 and scipy 0.16 from source against openblas. To test the performance of the libraries I am following this example: http://nealhughes.net/parallelcomp2/. The functions to be timed are copied form the site:

import numpy as np
from math import exp 
from libc.math cimport exp as c_exp
from cython.parallel import prange,parallel

def array_f(X):

    Y = np.zeros(X.shape)
    index = X > 0.5
    Y[index] = np.exp(X[index])

    return Y

def c_array_f(double[:] X):

    cdef int N = X.shape[0]
    cdef double[:] Y = np.zeros(N)
    cdef int i

    for i in range(N):
        if X[i] > 0.5:
            Y[i] = c_exp(X[i])
        else:
            Y[i] = 0

    return Y


def c_array_f_multi(double[:] X):

    cdef int N = X.shape[0]
    cdef double[:] Y = np.zeros(N)
    cdef int i
    with nogil, parallel():
        for i in prange(N):
            if X[i] > 0.5:
                Y[i] = c_exp(X[i])
            else:
                Y[i] = 0

    return Y

The author of the code reports following speed ups for 4 cores:

from thread_demo import *
import numpy as np
X = -1 + 2*np.random.rand(10000000) 
%timeit array_f(X)
1 loops, best of 3: 222 ms per loop
%timeit c_array_f(X)
10 loops, best of 3: 87.5 ms per loop 
%timeit c_array_f_multi(X)
10 loops, best of 3: 22.4 ms per loop

When I run these example on my machines ( macbook pro with osx 10.10 ), I get the following timings for export OMP_NUM_THREADS=1

In [1]: from bla import *
In [2]: import numpy as np
In [3]: X = -1 + 2*np.random.rand(10000000)
In [4]: %timeit c_array_f(X)
10 loops, best of 3: 89.7 ms per loop
In [5]: %timeit c_array_f_multi(X)
1 loops, best of 3: 343 ms per loop

and for OMP_NUM_THREADS=4

In [1]: from bla import *
In [2]: import numpy as np
In [3]: X = -1 + 2*np.random.rand(10000000)
In [4]: %timeit c_array_f(X)
10 loops, best of 3: 89.5 ms per loop
In [5]: %timeit c_array_f_multi(X)
10 loops, best of 3: 119 ms per loop

I see this same behavior on an openSuse machine, hence my question. How can the author get a 4x speed up while the same code runs slower for 4 threads on 2 of my systems.

The setup script for generating the *.c & .so is also identical to the one used in the blog.

from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy as np

ext_modules=[
    Extension("bla",
              ["bla.pyx"],
              libraries=["m"],
              extra_compile_args = ["-O3", "-ffast-math","-march=native", "-fopenmp" ],
              extra_link_args=['-fopenmp'],
              include_dirs = [np.get_include()]
              ) 
]

setup( 
  name = "bla",
  cmdclass = {"build_ext": build_ext},
  ext_modules = ext_modules
)

Would be great if someone could explain to me why this happens.

1) An important feature of prange (like any other parallel for loop) is that it activates out-of-order execution, which means that the loop can execute in any arbitrary order. Out-of-order execution really pays off when you have no data dependency between iterations.

I do not know the internals of Cython but I reckon that if boundschecking is not turned off, the loop cannot be executed arbitrarily, since the next iteration will depend on whether or not the array is going out of bounds in the current iteration, hence the problem becomes almost serial as threads will have to wait for the result. This is one of the issues with your code. In fact Cython does give me the following warning:

warning: bla.pyx:42:16: Use boundscheck(False) for faster access

So add the following

from cython import boundscheck, wraparound

@boundscheck(False)
@wraparound(False)
def c_array_f(double[:] X):
   # Rest of your code

@boundscheck(False)
@wraparound(False)
def c_array_f_multi(double[:] X):
   # Rest of your code

Let's now time them with your data X = -1 + 2*np.random.rand(10000000).

With Bounds Checking:

In [2]:%timeit array_f(X)
10 loops, best of 3: 189 ms per loop
In [4]:%timeit c_array_f(X)
10 loops, best of 3: 93.6 ms per loop
In [5]:%timeit c_array_f_multi(X)
10 loops, best of 3: 103 ms per loop

Without Bounds Checking:

In [9]:%timeit c_array_f(X)
10 loops, best of 3: 84.2 ms per loop
In [10]:%timeit c_array_f_multi(X)
10 loops, best of 3: 42.3 ms per loop

These results are with num_threads=4 (I have 4 logical cores) and the speed-up is around 2x. Before getting further we can still shave off a few more ms by declaring our arrays to be contiguous i.e. declaring X and Y with double[::1].

Contiguous Arrays:

In [14]:%timeit c_array_f(X)
10 loops, best of 3: 81.8 ms per loop
In [15]:%timeit c_array_f_multi(X)
10 loops, best of 3: 39.3 ms per loop

2) Even more important is job scheduling and this is what your benchmark suffers from. By default chunk sizes are determined at compile time i.e. schedule=static however it is very likely that the environment variables (for instance OMP_SCHEDULE) and work-load of the two machines (yours and the one from the blog post) are different, and they schedule the jobs at runtime, dynmically, guidedly and so on. Let's experiment it with replacing your prange to

for i in prange(N, schedule='static'):
    # static scheduling... 
for i in prange(N, schedule='dynamic'):
    # dynamic scheduling...

Let's time them now (only the multi-threaded code):

Scheduling Effect:

In [23]:%timeit c_array_f_multi(X) # static
10 loops, best of 3: 39.5 ms per loop
In [28]:%timeit c_array_f_multi(X) # dynamic
1 loops, best of 3: 319 ms per loop

You might be able to replicate this depending on the work-load on your own machine. As a side note, since you are just trying to measure the performance of a parallel vs serial code in a micro-benchmark test and not an actual code, I suggest you get rid of the if-else condition i.e. only keep Y[i] = c_exp(X[i]) within the for loop. This is because if-else statements also adversely affect branch-prediction and out-of-order execution in parallel code. On my machine I get almost 2.7x speed-up over serial code with this change.

来源：https://stackoverflow.com/questions/33193851/cython-prange-slower-for-4-threads-then-with-range

标签

python

numpy

cython

openblas