How to sort an array of integers faster than quicksort?

丶灬走出姿态 提交于 2019-11-27 03:21:00

问题


Sorting an array of integers with numpy's quicksort has become the bottleneck of my algorithm. Unfortunately, numpy does not have radix sort yet. Although counting sort would be a one-liner in numpy:

np.repeat(np.arange(1+x.max()), np.bincount(x))

see the accepted answer to the How can I vectorize this python count sort so it is absolutely as fast as it can be? question, the integers in my application can run from 0 to 2**32.

Am I stuck with quicksort?


This post was primarily motivated by the Numpy grouping using itertools.groupby performance question.
Also note that it is not merely OK to ask and answer your own question, it is explicitly encouraged.


回答1:


No, you are not stuck with quicksort. You could use, for example, integer_sort from Boost.Sort or u4_sort from usort. When sorting this array:

array(randint(0, high=1<<32, size=10**8), uint32)

I get the following results:

NumPy quicksort:         8.636 s  1.0  (baseline)
Boost.Sort integer_sort: 4.327 s  2.0x speedup
usort u4_sort:           2.065 s  4.2x speedup

I would not jump to conclusions based on this single experiment and use usort blindly. I would test with my actual data and measure what happens. Your mileage will vary depending on your data and on your machine. The integer_sort in Boost.Sort has a rich set of options for tuning, see the documentation.

Below I describe two ways to call a native C or C++ function from Python. Despite the long description, it's fairly easy to do it.


Boost.Sort

Put these lines into the spreadsort.cpp file:

#include <cinttypes>
#include "boost/sort/spreadsort/spreadsort.hpp"
using namespace boost::sort::spreadsort;

extern "C" {
    void spreadsort(std::uint32_t* begin,  std::size_t len) {
        integer_sort(begin, begin + len);
    }
}

It basically instantiates the templated integer_sort for 32 bit unsigned integers; the extern "C" part ensures C linkage by disabling name mangling. Assuming you are using gcc and that the necessary include files of boost are under the /tmp/boost_1_60_0 directory, you can compile it:

g++ -O3 -std=c++11 -march=native -DNDEBUG -shared -fPIC -I/tmp/boost_1_60_0 spreadsort.cpp -o spreadsort.so  

The key flags are -fPIC to generate position-independet code and -shared to generate a shared object .so file. (Read the docs of gcc for further details.)

Then, you wrap the spreadsort() C++ function in Python using ctypes:

from ctypes import cdll, c_size_t, c_uint32
from numpy import uint32
from numpy.ctypeslib import ndpointer

__all__ = ['integer_sort']

# In spreadsort.cpp: void spreadsort(std::uint32_t* begin,  std::size_t len)
lib = cdll.LoadLibrary('./spreadsort.so')
sort = lib.spreadsort
sort.restype = None
sort.argtypes = [ndpointer(c_uint32, flags='C_CONTIGUOUS'), c_size_t]

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    sort(arr, arr.size)

Alternatively, you can use cffi:

from cffi import FFI
from numpy import uint32

__all__ = ['integer_sort']

ffi = FFI()
ffi.cdef('void spreadsort(uint32_t* begin,  size_t len);')
C = ffi.dlopen('./spreadsort.so')

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    begin = ffi.cast('uint32_t*', arr.ctypes.data)
    C.spreadsort(begin, arr.size)

At the cdll.LoadLibrary() and ffi.dlopen() calls I assumed that the path to the spreadsort.so file is ./spreadsort.so. Alternatively, you can write

lib = cdll.LoadLibrary('spreadsort.so')

or

C = ffi.dlopen('spreadsort.so')

if you append the path to spreadsort.so to the LD_LIBRARY_PATH environment variable. See also Shared Libraries.

Usage. In both cases you simply call the above Python wrapper function integer_sort() with your numpy array of 32 bit unsigned integers.


usort

As for u4_sort, you can compile it as follows:

cc -DBUILDING_u4_sort -I/usr/include -I./ -I../ -I../../ -I../../../ -I../../../../ -std=c99 -fgnu89-inline -O3 -g -fPIC -shared -march=native u4_sort.c -o u4_sort.so

Issue this command in the directory where the u4_sort.c file is located. (Probably there is a less hackish way but I failed to figure that out. I just looked into the deps.mk file in the usort directory to find out the necessary compiler flags and include paths.)

Then, you can wrap the C function as follows:

from cffi import FFI
from numpy import uint32

__all__ = ['integer_sort']

ffi = FFI()
ffi.cdef('void u4_sort(unsigned* a, const long sz);')
C = ffi.dlopen('u4_sort.so')

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    begin = ffi.cast('unsigned*', arr.ctypes.data)
    C.u4_sort(begin, arr.size)

In the above code, I assumed that the path to u4_sort.so has been appended to the LD_LIBRARY_PATH environment variable.

Usage. As before with Boost.Sort, you simply call the above Python wrapper function integer_sort() with your numpy array of 32 bit unsigned integers.




回答2:


A radix sort base 256 (1 byte) can generate a matrix of counts used to determine bucket size in 1 pass of the data, then takes 4 passes to do the sort. On my system, Intel 2600K 3.4ghz, using Visual Studio release build for a C++ program, it takes about 1.15 seconds to sort 10^8 psuedo random unsigned 32 bit integers.

Looking at the u4_sort code mentioned in Ali's answer, the programs are similar, but u4_sort uses field sizes of {10,11,11}, taking 3 passes to sort data and 1 pass to copy back, while this example uses field sizes of {8,8,8,8}, taking 4 passes to sort data. The process is probably memory bandwidth limited due to the random access writes, so the optimizations in u4_sort, like macros for shift, one loop with fixed shift per field, aren't helping much. My results are better probably due to system and/or compiler differences. (Note u8_sort is for 64 bit integers).

Example code:

//  a is input array, b is working array
void RadixSort(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[4][256] = {0};            // count / index matrix
size_t i,j,m,n;
uint32_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 4; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }       
    }
    for(j = 0; j < 4; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }       
    }
    for(j = 0; j < 4; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
}



回答3:


A radix-sort with python/numba(0.23), according to @rcgldr C program, with multithread on a 2 cores processor.

first the radix sort on numba, with two global arrays for efficiency.

from threading import Thread
from pylab import *
from numba import jit
n=uint32(10**8)
m=n//2
if 'x1'  not in locals() : x1=array(randint(0,1<<16,2*n),uint16); #to avoid regeneration
x2=x1.copy()
x=frombuffer(x2,uint32) # randint doesn't work with 32 bits here :(
y=empty_like(x) 
nbits=8
buffsize=1<<nbits
mask=buffsize-1

@jit(nopython=True,nogil=True)
def radix(x,y):
    xs=x.size
    dec=0
    while dec < 32 :
        u=np.zeros(buffsize,uint32)
        k=0
        while k<xs:
            u[(x[k]>>dec)& mask]+=1
            k+=1
        j=t=0
        for j in range(buffsize):
            b=u[j]
            u[j]=t
            t+=b
            v=u.copy()
        k=0
        while k<xs:
            j=(x[k]>>dec)&mask
            y[u[j]]=x[k]
            u[j]+=1
            k+=1
        x,y=y,x
        dec+=nbits

then the parallélisation, possible with nogil option.

def para(nthreads=2):
        threads=[Thread(target=radix,
            args=(x[i*n//nthreads(i+1)*n//nthreads],
            y[i*n//nthreads:(i+1)*n//nthreads])) 
            for i in range(nthreads)]
        for t in  threads: t.start()
        for t in  threads: t.join()

@jit
def fuse(x,y):
    kl=0
    kr=n//2
    k=0
    while k<n:
        if y[kl]<x[kr] :
            x[k]=y[kl]
            kl+=1
            if kl==m : break
        else :
            x[k]=x[kr]
            kr+=1
        k+=1

def sort():
    para(2)
    y[:m]=x[:m]
    fuse(x,y)

benchmarks :

In [24]: %timeit x2=x1.copy();x=frombuffer(x2,uint32) # time offset
1 loops, best of 3: 431 ms per loop

In [25]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);x.sort()
1 loops, best of 3: 37.8 s per loop

In [26]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);para(1)
1 loops, best of 3: 5.7 s per loop

In [27]: %timeit x2=x1.copy();x=frombuffer(x2,uint32);sort()
1 loops, best of 3: 4.02 s per loop      

So a pure python solution with a 10x (37s->3.5s) gain on my poor 1GHz machine. Can be enhanced with more cores and multifusion.



来源:https://stackoverflow.com/questions/35317442/how-to-sort-an-array-of-integers-faster-than-quicksort

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!