How do I pass large numpy arrays between python subprocesses without saving to disk?

后端未结

关注

 6  1482

说谎 2020-11-29 21:21

Is there a good way to pass a large chunk of data between two python subprocesses without using the disk? Here\'s a cartoon example of what I\'m hoping to accomplish:

6条回答

我在风中等你 (楼主)

2020-11-29 22:01

From the other answers, it seems that numpy-sharedmem is the way to go.

However, if you need a pure python solution, or installing extensions, cython or the like is a (big) hassle, you might want to use the following code which is a simplified version of Nadav's code:

import numpy, ctypes, multiprocessing

_ctypes_to_numpy = {
    ctypes.c_char   : numpy.dtype(numpy.uint8),
    ctypes.c_wchar  : numpy.dtype(numpy.int16),
    ctypes.c_byte   : numpy.dtype(numpy.int8),
    ctypes.c_ubyte  : numpy.dtype(numpy.uint8),
    ctypes.c_short  : numpy.dtype(numpy.int16),
    ctypes.c_ushort : numpy.dtype(numpy.uint16),
    ctypes.c_int    : numpy.dtype(numpy.int32),
    ctypes.c_uint   : numpy.dtype(numpy.uint32),
    ctypes.c_long   : numpy.dtype(numpy.int64),
    ctypes.c_ulong  : numpy.dtype(numpy.uint64),
    ctypes.c_float  : numpy.dtype(numpy.float32),
    ctypes.c_double : numpy.dtype(numpy.float64)}

_numpy_to_ctypes = dict(zip(_ctypes_to_numpy.values(),
                            _ctypes_to_numpy.keys()))


def shm_as_ndarray(mp_array, shape = None):
    '''Given a multiprocessing.Array, returns an ndarray pointing to
    the same data.'''

    # support SynchronizedArray:
    if not hasattr(mp_array, '_type_'):
        mp_array = mp_array.get_obj()

    dtype = _ctypes_to_numpy[mp_array._type_]
    result = numpy.frombuffer(mp_array, dtype)

    if shape is not None:
        result = result.reshape(shape)

    return numpy.asarray(result)


def ndarray_to_shm(array, lock = False):
    '''Generate an 1D multiprocessing.Array containing the data from
    the passed ndarray.  The data will be *copied* into shared
    memory.'''

    array1d = array.ravel(order = 'A')

    try:
        c_type = _numpy_to_ctypes[array1d.dtype]
    except KeyError:
        c_type = _numpy_to_ctypes[numpy.dtype(array1d.dtype)]

    result = multiprocessing.Array(c_type, array1d.size, lock = lock)
    shm_as_ndarray(result)[:] = array1d
    return result

You would use it like this:

Use sa = ndarray_to_shm(a) to convert the ndarray a into a shared multiprocessing.Array.
Use multiprocessing.Process(target = somefunc, args = (sa, ) (and start, maybe join) to call somefunc in a separate process, passing the shared array.
In somefunc, use a = shm_as_ndarray(sa) to get an ndarray pointing to the shared data. (Actually, you may want to do the same in the original process, immediately after creating sa, in order to have two ndarrays referencing the same data.)

AFAICS, you don't need to set lock to True, since shm_as_ndarray will not use the locking anyhow. If you need locking, you would set lock to True and call acquire/release on sa.

Also, if your array is not 1-dimensional, you might want to transfer the shape along with sa (e.g. use args = (sa, a.shape)).

This solution has the advantage that it does not need additional packages or extension modules, except multiprocessing (which is in the standard library).

0 讨论(0)

查看其它6个回答