Memory alignment for fast FFT in Python using shared arrays

I write an image processing app that needs to do multiple things and it has to do them as much real-time as possible. Acquisition of the data and their processing runs in separate processes (mainly for performance reasons). The data itself is quite large (2MPix 16-bit grayscale images).

I can share arrays between processes as it is described in this post: How do I pass large numpy arrays between python subprocesses without saving to disk? (I use the shmarray script from the numpy-shared package). I can perform the supplied Numpy FFT on those data without problem, but it is quite slow.

Calling FFTW would probably be much faster, but in order to fully benefit from it, I am supposed to run my operations on arrays that are memory aligned.

The question: Is there a way how to create and share Numpy-like arrays between processes, that are, at the same time, guaranteed to be memory aligned?

The simplest standard trick to get correctly aligned memory is to allocate a bit more than needed and skip the first few bytes if the alignment is wrong. If I remember correctly, NumPy arrays will always be 8-byte aligned, and FFTW requires 16-byte aligment to perform best. So you would simply allocate 8 bytes more than needed, and skip the first 8 bytes if necessary.

Edit: This is rather easy to implement. The pointer to the data is available as an integer in the ctypes.data attribute of a NumPy array. Using the shifted block can be achieved by slicing, viewing as a different data type and reshaping -- all these won't copy the data, but rather reuse the same buf.

To allocate an 16-byte aligned 1000x1000 array of 64-bit floating point numbers, we could use this code:

m = n = 1000
dtype = numpy.dtype(numpy.float64)
nbytes = m * n * dtype.itemsize
buf = numpy.empty(nbytes + 16, dtype=numpy.uint8)
start_index = -buf.ctypes.data % 16
a = buf[start_index:start_index + nbytes].view(dtype).reshape(m, n)

Now, a is an array with the desired properties, as can be verified by checking that a.ctypes.data % 16 is indeed 0.

Generalizing on Sven's answer, this function will return an aligned copy (if needed) of any numpy array:

import numpy as np
def aligned(a, alignment=16):
    if (a.ctypes.data % alignment) == 0:
        return a

    extra = alignment / a.itemsize
    buf = np.empty(a.size + extra, dtype=a.dtype)
    ofs = (-buf.ctypes.data % alignment) / a.itemsize
    aa = buf[ofs:ofs+a.size].reshape(a.shape)
    np.copyto(aa, a)
    assert (aa.ctypes.data % alignment) == 0
    return aa

来源：https://stackoverflow.com/questions/9895787/memory-alignment-for-fast-fft-in-python-using-shared-arrays

标签

python

numpy

multiprocessing

fft

memory-alignment