Going faster than struct.pack with cython

问题

I'm trying to do better than struct.pack.

Taking a specific case of packing integeres, via the answer to this question, I have the following to pack a list of ints in pack_ints.pyx:

# cython: language_level=3, boundscheck=False
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
def pack_ints(int_col):

    int_buf = bytearray(4*len(int_col))
    cdef int[::1] buf_view = memoryview(int_buf).cast('i')

    idx: int = 0
    for idx in range(len(int_col)):
        buf_view[idx] = int_col[idx]


    return int_buf

With this test code in ipython:

from struct import pack 
import pyximport; pyximport.install(language_level=3) 
import pack_ints 

amount = 10**7 
ints = list(range(amount)) 

res1 = pack(f'{amount}i', *ints) 
res2 = pack_ints.pack_ints(ints) 
assert(res1 == res2) 

%timeit pack(f'{amount}i', *ints)  
%timeit pack_ints.pack_ints(ints)

I get:

304 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
212 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I tried to type int_buf as an array('b'), but didn't see an improvement.

Is there any other way to improve upon this, or use cython in a different way, to make this operation faster?

回答1:

This answer tries to give an estimation, how much speed-up a parallelized version can yield. However, because this task is memory-bandwidth bound (Python-integer-objects take at least 32 bytes and can be scattered in memory, so there will be many cache misses) we should not expect much.

The first problem is, how to handle errors (element is not an integer or the value is too large). I will be following strategy/simplification: when object

isn't an integer,
is negative integer,
or integer is >=2^30

it will be casted to a special number (-1) which signals that something went wrong. Allowing only non-negative integers <2^30 makes my life easier, as I have to reimplement PyLong_AsLongAndOverflow whitout the raising errors and otherwise detecting overflows is often cumbersome (however, see the version at the end of the answer for a more sophisticated approach).

The memory-layout of Python's integer object can be found here:

struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
};

Member ob_size/macro Py_SIZE tells us how many 30-bit digits are used in the representation of the integer(ob_size is negative for negative integer).

My simple rule thus translates to the following C-code (I use rather C than Cython, as it is a simpler/more natural way of using Python's C-API):

#include <Python.h>

// returns -1 if vv is not an integer,
//            negative, or > 2**30-1
int to_int(PyObject *vv){ 
   if (PyLong_Check(vv)) {
       PyLongObject * v = (PyLongObject *)vv;
       Py_ssize_t i = Py_SIZE(v);
       if(i==0){
           return 0;
       }
       if(i==1){//small enought for a digit
           return v->ob_digit[0];
       }
       //negative (i<0) or too big (i>1)
       return -1;
   }
   return -1;
}

Now given a list, we can convert it to an int-buffer in parallel with the following C-function, which uses omp:

void convert_list(PyListObject *lst, int *output){
    Py_ssize_t n = Py_SIZE(lst);
    PyObject **data = lst->ob_item;
    #pragma omp parallel for
    for(Py_ssize_t i=0; i<n; ++i){
        output[i] = to_int(data[i]);
    }
}

There is not much to say - PyListObject-API is used to access the elements of the list in parallel. It can be done, because there are no ref counting/racing conditions in to_int-function.

Now, bundling it all together with Cython:

%%cython -c=-fopenmp --link-args=-fopenmp
import cython

cdef extern from *:
    """
    #include <Python.h>

    int to_int(PyObject *vv){ 
       ... code
    }

    void convert_list(PyListObject *lst, int *output){
        ... code
    }
    """
    void convert_list(list lst, int *output)

@cython.boundscheck(False)
@cython.wraparound(False)
def pack_ints_ead(list int_col):
    cdef char[::1] int_buf = bytearray(4*len(int_col))
    convert_list(int_col, <int*>(&int_buf[0]))
    return int_buf.base

One important detail is: convert_list must not be nogil (because it isn't)! Omp threads and Python-threads (which are affected by GIL) are completly different things.

One can (but there is no must) release GIL for omp-operations while using objects with buffer-protocol - because those objects get locked via buffer-protocol and cannot be changed from different Python-threads. A list has no such locking mechanism and thus, if GIL were released, the list could be changed in another threads and all our pointers could get invalidated.

So now to timings (with a slightly bigger list):

amount = 5*10**7 
ints = list(range(amount)) 


%timeit pack(f'{amount}i', *ints)  
# 1.51 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pack_ints_DavidW(ints) 
# 284 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pack_ints_ead(ints) 
# 177 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

btw turning off the parallelization for pack_ints_ead leads to running time of 209 ms.

So given the modest improvement of ca. 33%, I would opt for the more robust DavidW's solution.

Here is implementation with a slightly different way of signaling wrong values:

not an integer object results in -2147483648(i.e. 0x80000000) - the smallest negative value a 32bit-int can store.
integers >=2147483647 (i.e. >=0x7fffffff) will be mapped to/stored as 2147483647 - the biggest positive number a 32bit-int can store.
integers <=-2147483647 (i.e. <=0x80000001) will be mapped to/stored as -2147483647
all other integer are mapped on their correct value.

The main advantage is, that it works correctly for a larger range of integer-values. This algorithm yields almost the same running time (maybe 2-3% slower) as the first, simple version:

int to_int(PyObject *vv){ 
   if (PyLong_Check(vv)) {
       PyLongObject * v = (PyLongObject *)vv;
       Py_ssize_t i = Py_SIZE(v);
       int sign = i<0 ? -1 : 1;
       i = abs(i);
       if(i==0){
           return 0;
       }
       if(i==1){//small enought for a digit
           return sign*v->ob_digit[0];
       }
       if(i==2 && (v->ob_digit[1]>>1)==0){
           int add = (v->ob_digit[1]&1) << 30;
           return sign*(v->ob_digit[0]+add);
       }
       return sign * 0x7fffffff;
   }
   return 0x80000000;
}

回答2:

When I run my code from the original question I get a ~5 times speed-up.

When I run your code here I see the results you report plus an important warning at the compile stage that I think you're ignoring:

warning: pack_ints.pyx:13:17: Index should be typed for more efficient access

I'm not sure why it isn't picking up the type correctly but to fix it you should change the definition of i back to the code I originally wrote:

cdef int i
# not "i: int"

Hopefully someone else will come along and try something cleverer, because it's obvious a bit ridiculous that this is an answer.

来源：https://stackoverflow.com/questions/59962466/going-faster-than-struct-pack-with-cython

标签

python

struct

cython

pack