Cython: understanding what the html annotation file has to say?

前端未结

关注

 1  2005

After compiling the following Cython code, I get the html file that looks like this:

import numpy as np
cimport numpy as np

cpdef my_function(np.ndarray[np.doub


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2021-01-22 14:26
              
            
            
                                                                       
As you already know, yellow lines mean that some interactions with python happen, i.e. python functionality and not raw c-functionality is used, and you can look into the produced code to see, what happens and if it can/should be fixed/avoided. 

Not every interaction with python means a (measurable) slowdown.

Let's take a look at this simplified function:

%%cython
cimport numpy as np
def use_slices(np.ndarray[np.double_t] a):
    a[0:len(a)]=0.0


When we look into the produced code we see (I kept only the important parts):

  __pyx_t_1 = PyObject_Length(((PyObject *)__pyx_v_a)); 
  __pyx_t_2 = PyInt_FromSsize_t(__pyx_t_1); 
  __pyx_t_3 = PySlice_New(__pyx_int_0, __pyx_t_2, Py_None); 
  PyObject_SetItem(((PyObject *)__pyx_v_a)


So basically we get a new slice (which is a numpy-array) and then use numpy's functionality (PyObject_SetItem) to set all elements to 0.0, which is C-code under the hood.

Let's take a look at version with hand-written for loop:

cimport numpy as np
def use_for(np.ndarray[np.double_t] a):
    cdef int i
    for i in range(len(a)):
        a[i]=0.0


It still uses PyObject_Length (because of length) and bound-checking, but otherwise it is C-code. When we compare times:

>>> import numpy as np
>>> a=np.ones((500,))
>>> %timeit use_slices(a)
100000 loops, best of 3: 1.85 µs per loop
>>> %timeit use_for(a)
1000000 loops, best of 3: 1.42 µs per loop

>>> b=np.ones((250000,))
>>> %timeit use_slices(b)
10000 loops, best of 3: 104 µs per loop
>>> %timeit use_for(b)
1000 loops, best of 3: 364 µs per loop


You can see the additional overhead of creating a slice for small sizes, but the additional checks in the for-version means it has more overhead in the long run.

Let's disable these checks:

%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def use_for_no_checks(np.ndarray[np.double_t] a):
    cdef int i
    for i in range(len(a)):
        a[i]=0.0


In the produced html we can see, that  a[i] gets as simple as it gets:

 __pyx_t_3 = __pyx_v_i;
    *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_double_t *, __pyx_pybuffernd_a.rcbuffer->pybuffer.buf, __pyx_t_3, __pyx_pybuffernd_a.diminfo[0].strides) = 0.0;
  }


__Pyx_BufPtrStrided1d(type, buf, i0, s0) is define for (type)((char*)buf + i0 * s0).
And now:

>>> %timeit use_for_no_checks(a)
1000000 loops, best of 3: 1.17 µs per loop
>>> %timeit use_for_no_checks(b)
1000 loops, best of 3: 246 µs per loop


We can improve it further by releasing gil in the for-loop:

%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def use_for_no_checks_no_gil(np.ndarray[np.double_t] a):
    cdef int i
    cdef int n=len(a)
    with nogil:
      for i in range(n):
        a[i]=0.0


and now:

>>> %timeit use_for_no_checks_no_gil(a)
1000000 loops, best of 3: 1.07 µs per loop
>>> %timeit use_for_no_checks_no_gil(b)
10000 loops, best of 3: 166 µs per loop


So it is somewhat faster, but still you cannot beat numpy for larger arrays.

In my opinion, there are two things to take from it:


Cython will not transform slices to an access through a for-loop and thus Python-functionality must be used.
There is small overhead, but it is only calling the numpy-functionality the most of work is done in numpy-code, and this cannot be speed up through Cython.




One last try using memset function:

%%cython
from libc.string cimport memset
cimport numpy as np
def use_memset(np.ndarray[np.double_t] a):
    memset(&a[0], 0, len(a)*sizeof(np.double_t))


We get:

>>> %timeit use_memset(a)
1000000 loops, best of 3: 821 ns per loop
>>> %timeit use_memset(b)
10000 loops, best of 3: 102 µs per loop


It is also as fast as the numpy-code for large arrays.



As DavidW suggested, one could try to use memory-views:

%%cython
cimport numpy as np
def use_slices_memview(double[::1] a):
    a[0:len(a)]=0.0


leads to a slightly faster code for small arrays but similar fast code fr large arrays (compared to numpy-slices):

>>> %timeit use_slices_memview(a)
1000000 loops, best of 3: 1.52 µs per loop

>>> %timeit use_slices_memview(b)
10000 loops, best of 3: 105 µs per loop


That means, that the memory-view slices have less overhead than the numpy-slices. Here is the produced code:

 __pyx_t_1 = __Pyx_MemoryView_Len(__pyx_v_a); 
  __pyx_t_2.data = __pyx_v_a.data;
  __pyx_t_2.memview = __pyx_v_a.memview;
  __PYX_INC_MEMVIEW(&__pyx_t_2, 0);
  __pyx_t_3 = -1;
  if (unlikely(__pyx_memoryview_slice_memviewslice(
    &__pyx_t_2,
    __pyx_v_a.shape[0], __pyx_v_a.strides[0], __pyx_v_a.suboffsets[0],
    0,
    0,
    &__pyx_t_3,
    0,
    __pyx_t_1,
    0,
    1,
    1,
    0,
    1) < 0))
{
    __PYX_ERR(0, 27, __pyx_L1_error)
}

{
      double __pyx_temp_scalar = 0.0;
      {
          Py_ssize_t __pyx_temp_extent = __pyx_t_2.shape[0];
          Py_ssize_t __pyx_temp_idx;
          double *__pyx_temp_pointer = (double *) __pyx_t_2.data;
          for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
            *((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
            __pyx_temp_pointer += 1;
          }
      }
  }
  __PYX_XDEC_MEMVIEW(&__pyx_t_2, 1);
  __pyx_t_2.memview = NULL;
  __pyx_t_2.data = NULL;


I think the most important part: this code doesn't create an additional temporary object - it reuses the existing memory view for the slice.

My compiler produces (at least for my machine) a slightly faster code if memory views are used. Not sure whether it is worth an investigation. At the first sight the difference in every iteration step is:

# created code for memview-slices:
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
 __pyx_temp_pointer += 1;

#created code for memview-for-loop:
 __pyx_v_i = __pyx_t_3;
 __pyx_t_4 = __pyx_v_i;
 *((double *) ( /* dim=0 */ ((char *) (((double *) data) + __pyx_t_4)) )) = 0.0;


I would expect different compilers handle this code differently well. But clearly, the first version is easier to get optimized.



As Behzad Jamali pointed out, there is difference between double[:] a and double[::1] a. The second version using slices is about 20% faster on my machine. The difference is, that during the compile time it is know for the double[::1] version, that the memory-accesses will be consecutive and this can be used for optimization. In the version with double[:] we don't know anything about stride until the runtime.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复