问题
After compiling the following Cython code, I get the html file that looks like this:
import numpy as np
cimport numpy as np
cpdef my_function(np.ndarray[np.double_t, ndim = 1] array_a,
np.ndarray[np.double_t, ndim = 1] array_b,
int n_rows,
int n_columns):
array_a[0:-1:n_columns] = 0
array_a[n_columns - 1:n_rows * n_columns:n_columns] = 0
array_a[0:n_columns] = 0
array_a[n_columns* (n_rows - 1):n_rows * n_columns] = 0
array_b[array_a == 3] = 0
return array_a, array_b
My question is that why those operations of my function are still yellow? Does this mean that the code is still not as fast as it could be using Cython?
回答1:
As you already know, yellow lines mean that some interactions with python happen, i.e. python functionality and not raw c-functionality is used, and you can look into the produced code to see, what happens and if it can/should be fixed/avoided.
Not every interaction with python means a (measurable) slowdown.
Let's take a look at this simplified function:
%%cython
cimport numpy as np
def use_slices(np.ndarray[np.double_t] a):
a[0:len(a)]=0.0
When we look into the produced code we see (I kept only the important parts):
__pyx_t_1 = PyObject_Length(((PyObject *)__pyx_v_a));
__pyx_t_2 = PyInt_FromSsize_t(__pyx_t_1);
__pyx_t_3 = PySlice_New(__pyx_int_0, __pyx_t_2, Py_None);
PyObject_SetItem(((PyObject *)__pyx_v_a)
So basically we get a new slice (which is a numpy-array) and then use numpy's functionality (PyObject_SetItem
) to set all elements to 0.0
, which is C-code under the hood.
Let's take a look at version with hand-written for loop:
cimport numpy as np
def use_for(np.ndarray[np.double_t] a):
cdef int i
for i in range(len(a)):
a[i]=0.0
It still uses PyObject_Length
(because of length
) and bound-checking, but otherwise it is C-code. When we compare times:
>>> import numpy as np
>>> a=np.ones((500,))
>>> %timeit use_slices(a)
100000 loops, best of 3: 1.85 µs per loop
>>> %timeit use_for(a)
1000000 loops, best of 3: 1.42 µs per loop
>>> b=np.ones((250000,))
>>> %timeit use_slices(b)
10000 loops, best of 3: 104 µs per loop
>>> %timeit use_for(b)
1000 loops, best of 3: 364 µs per loop
You can see the additional overhead of creating a slice for small sizes, but the additional checks in the for-version means it has more overhead in the long run.
Let's disable these checks:
%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def use_for_no_checks(np.ndarray[np.double_t] a):
cdef int i
for i in range(len(a)):
a[i]=0.0
In the produced html we can see, that a[i]
gets as simple as it gets:
__pyx_t_3 = __pyx_v_i;
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_double_t *, __pyx_pybuffernd_a.rcbuffer->pybuffer.buf, __pyx_t_3, __pyx_pybuffernd_a.diminfo[0].strides) = 0.0;
}
__Pyx_BufPtrStrided1d(type, buf, i0, s0)
is define for (type)((char*)buf + i0 * s0)
.
And now:
>>> %timeit use_for_no_checks(a)
1000000 loops, best of 3: 1.17 µs per loop
>>> %timeit use_for_no_checks(b)
1000 loops, best of 3: 246 µs per loop
We can improve it further by releasing gil in the for-loop:
%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def use_for_no_checks_no_gil(np.ndarray[np.double_t] a):
cdef int i
cdef int n=len(a)
with nogil:
for i in range(n):
a[i]=0.0
and now:
>>> %timeit use_for_no_checks_no_gil(a)
1000000 loops, best of 3: 1.07 µs per loop
>>> %timeit use_for_no_checks_no_gil(b)
10000 loops, best of 3: 166 µs per loop
So it is somewhat faster, but still you cannot beat numpy for larger arrays.
In my opinion, there are two things to take from it:
- Cython will not transform slices to an access through a for-loop and thus Python-functionality must be used.
- There is small overhead, but it is only calling the numpy-functionality the most of work is done in numpy-code, and this cannot be speed up through Cython.
One last try using memset
function:
%%cython
from libc.string cimport memset
cimport numpy as np
def use_memset(np.ndarray[np.double_t] a):
memset(&a[0], 0, len(a)*sizeof(np.double_t))
We get:
>>> %timeit use_memset(a)
1000000 loops, best of 3: 821 ns per loop
>>> %timeit use_memset(b)
10000 loops, best of 3: 102 µs per loop
It is also as fast as the numpy-code for large arrays.
As DavidW suggested, one could try to use memory-views:
%%cython
cimport numpy as np
def use_slices_memview(double[::1] a):
a[0:len(a)]=0.0
leads to a slightly faster code for small arrays but similar fast code fr large arrays (compared to numpy-slices):
>>> %timeit use_slices_memview(a)
1000000 loops, best of 3: 1.52 µs per loop
>>> %timeit use_slices_memview(b)
10000 loops, best of 3: 105 µs per loop
That means, that the memory-view slices have less overhead than the numpy-slices. Here is the produced code:
__pyx_t_1 = __Pyx_MemoryView_Len(__pyx_v_a);
__pyx_t_2.data = __pyx_v_a.data;
__pyx_t_2.memview = __pyx_v_a.memview;
__PYX_INC_MEMVIEW(&__pyx_t_2, 0);
__pyx_t_3 = -1;
if (unlikely(__pyx_memoryview_slice_memviewslice(
&__pyx_t_2,
__pyx_v_a.shape[0], __pyx_v_a.strides[0], __pyx_v_a.suboffsets[0],
0,
0,
&__pyx_t_3,
0,
__pyx_t_1,
0,
1,
1,
0,
1) < 0))
{
__PYX_ERR(0, 27, __pyx_L1_error)
}
{
double __pyx_temp_scalar = 0.0;
{
Py_ssize_t __pyx_temp_extent = __pyx_t_2.shape[0];
Py_ssize_t __pyx_temp_idx;
double *__pyx_temp_pointer = (double *) __pyx_t_2.data;
for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
}
}
}
__PYX_XDEC_MEMVIEW(&__pyx_t_2, 1);
__pyx_t_2.memview = NULL;
__pyx_t_2.data = NULL;
I think the most important part: this code doesn't create an additional temporary object - it reuses the existing memory view for the slice.
My compiler produces (at least for my machine) a slightly faster code if memory views are used. Not sure whether it is worth an investigation. At the first sight the difference in every iteration step is:
# created code for memview-slices:
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
#created code for memview-for-loop:
__pyx_v_i = __pyx_t_3;
__pyx_t_4 = __pyx_v_i;
*((double *) ( /* dim=0 */ ((char *) (((double *) data) + __pyx_t_4)) )) = 0.0;
I would expect different compilers handle this code differently well. But clearly, the first version is easier to get optimized.
As Behzad Jamali pointed out, there is difference between double[:] a
and double[::1] a
. The second version using slices is about 20% faster on my machine. The difference is, that during the compile time it is know for the double[::1]
version, that the memory-accesses will be consecutive and this can be used for optimization. In the version with double[:]
we don't know anything about stride until the runtime.
来源:https://stackoverflow.com/questions/48272969/cython-understanding-what-the-html-annotation-file-has-to-say