Cython: understanding what the html annotation file has to say?

前端 未结 1 1982
轮回少年
轮回少年 2021-01-22 13:53

After compiling the following Cython code, I get the html file that looks like this:

import numpy as np
cimport numpy as np

cpdef my_function(np.ndarray[np.doub         


        
相关标签:
1条回答
  • 2021-01-22 14:26

    As you already know, yellow lines mean that some interactions with python happen, i.e. python functionality and not raw c-functionality is used, and you can look into the produced code to see, what happens and if it can/should be fixed/avoided.

    Not every interaction with python means a (measurable) slowdown.

    Let's take a look at this simplified function:

    %%cython
    cimport numpy as np
    def use_slices(np.ndarray[np.double_t] a):
        a[0:len(a)]=0.0
    

    When we look into the produced code we see (I kept only the important parts):

      __pyx_t_1 = PyObject_Length(((PyObject *)__pyx_v_a)); 
      __pyx_t_2 = PyInt_FromSsize_t(__pyx_t_1); 
      __pyx_t_3 = PySlice_New(__pyx_int_0, __pyx_t_2, Py_None); 
      PyObject_SetItem(((PyObject *)__pyx_v_a)
    

    So basically we get a new slice (which is a numpy-array) and then use numpy's functionality (PyObject_SetItem) to set all elements to 0.0, which is C-code under the hood.

    Let's take a look at version with hand-written for loop:

    cimport numpy as np
    def use_for(np.ndarray[np.double_t] a):
        cdef int i
        for i in range(len(a)):
            a[i]=0.0
    

    It still uses PyObject_Length (because of length) and bound-checking, but otherwise it is C-code. When we compare times:

    >>> import numpy as np
    >>> a=np.ones((500,))
    >>> %timeit use_slices(a)
    100000 loops, best of 3: 1.85 µs per loop
    >>> %timeit use_for(a)
    1000000 loops, best of 3: 1.42 µs per loop
    
    >>> b=np.ones((250000,))
    >>> %timeit use_slices(b)
    10000 loops, best of 3: 104 µs per loop
    >>> %timeit use_for(b)
    1000 loops, best of 3: 364 µs per loop
    

    You can see the additional overhead of creating a slice for small sizes, but the additional checks in the for-version means it has more overhead in the long run.

    Let's disable these checks:

    %%cython
    cimport cython
    cimport numpy as np
    @cython.boundscheck(False)
    @cython.wraparound(False)
    def use_for_no_checks(np.ndarray[np.double_t] a):
        cdef int i
        for i in range(len(a)):
            a[i]=0.0
    

    In the produced html we can see, that a[i] gets as simple as it gets:

     __pyx_t_3 = __pyx_v_i;
        *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_double_t *, __pyx_pybuffernd_a.rcbuffer->pybuffer.buf, __pyx_t_3, __pyx_pybuffernd_a.diminfo[0].strides) = 0.0;
      }
    

    __Pyx_BufPtrStrided1d(type, buf, i0, s0) is define for (type)((char*)buf + i0 * s0). And now:

    >>> %timeit use_for_no_checks(a)
    1000000 loops, best of 3: 1.17 µs per loop
    >>> %timeit use_for_no_checks(b)
    1000 loops, best of 3: 246 µs per loop
    

    We can improve it further by releasing gil in the for-loop:

    %%cython
    cimport cython
    cimport numpy as np
    @cython.boundscheck(False)
    @cython.wraparound(False)
    def use_for_no_checks_no_gil(np.ndarray[np.double_t] a):
        cdef int i
        cdef int n=len(a)
        with nogil:
          for i in range(n):
            a[i]=0.0
    

    and now:

    >>> %timeit use_for_no_checks_no_gil(a)
    1000000 loops, best of 3: 1.07 µs per loop
    >>> %timeit use_for_no_checks_no_gil(b)
    10000 loops, best of 3: 166 µs per loop
    

    So it is somewhat faster, but still you cannot beat numpy for larger arrays.

    In my opinion, there are two things to take from it:

    1. Cython will not transform slices to an access through a for-loop and thus Python-functionality must be used.
    2. There is small overhead, but it is only calling the numpy-functionality the most of work is done in numpy-code, and this cannot be speed up through Cython.

    One last try using memset function:

    %%cython
    from libc.string cimport memset
    cimport numpy as np
    def use_memset(np.ndarray[np.double_t] a):
        memset(&a[0], 0, len(a)*sizeof(np.double_t))
    

    We get:

    >>> %timeit use_memset(a)
    1000000 loops, best of 3: 821 ns per loop
    >>> %timeit use_memset(b)
    10000 loops, best of 3: 102 µs per loop
    

    It is also as fast as the numpy-code for large arrays.


    As DavidW suggested, one could try to use memory-views:

    %%cython
    cimport numpy as np
    def use_slices_memview(double[::1] a):
        a[0:len(a)]=0.0
    

    leads to a slightly faster code for small arrays but similar fast code fr large arrays (compared to numpy-slices):

    >>> %timeit use_slices_memview(a)
    1000000 loops, best of 3: 1.52 µs per loop
    
    >>> %timeit use_slices_memview(b)
    10000 loops, best of 3: 105 µs per loop
    

    That means, that the memory-view slices have less overhead than the numpy-slices. Here is the produced code:

     __pyx_t_1 = __Pyx_MemoryView_Len(__pyx_v_a); 
      __pyx_t_2.data = __pyx_v_a.data;
      __pyx_t_2.memview = __pyx_v_a.memview;
      __PYX_INC_MEMVIEW(&__pyx_t_2, 0);
      __pyx_t_3 = -1;
      if (unlikely(__pyx_memoryview_slice_memviewslice(
        &__pyx_t_2,
        __pyx_v_a.shape[0], __pyx_v_a.strides[0], __pyx_v_a.suboffsets[0],
        0,
        0,
        &__pyx_t_3,
        0,
        __pyx_t_1,
        0,
        1,
        1,
        0,
        1) < 0))
    {
        __PYX_ERR(0, 27, __pyx_L1_error)
    }
    
    {
          double __pyx_temp_scalar = 0.0;
          {
              Py_ssize_t __pyx_temp_extent = __pyx_t_2.shape[0];
              Py_ssize_t __pyx_temp_idx;
              double *__pyx_temp_pointer = (double *) __pyx_t_2.data;
              for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
                *((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
                __pyx_temp_pointer += 1;
              }
          }
      }
      __PYX_XDEC_MEMVIEW(&__pyx_t_2, 1);
      __pyx_t_2.memview = NULL;
      __pyx_t_2.data = NULL;
    

    I think the most important part: this code doesn't create an additional temporary object - it reuses the existing memory view for the slice.

    My compiler produces (at least for my machine) a slightly faster code if memory views are used. Not sure whether it is worth an investigation. At the first sight the difference in every iteration step is:

    # created code for memview-slices:
    *((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
     __pyx_temp_pointer += 1;
    
    #created code for memview-for-loop:
     __pyx_v_i = __pyx_t_3;
     __pyx_t_4 = __pyx_v_i;
     *((double *) ( /* dim=0 */ ((char *) (((double *) data) + __pyx_t_4)) )) = 0.0;
    

    I would expect different compilers handle this code differently well. But clearly, the first version is easier to get optimized.


    As Behzad Jamali pointed out, there is difference between double[:] a and double[::1] a. The second version using slices is about 20% faster on my machine. The difference is, that during the compile time it is know for the double[::1] version, that the memory-accesses will be consecutive and this can be used for optimization. In the version with double[:] we don't know anything about stride until the runtime.

    0 讨论(0)
提交回复
热议问题