How do I pass large numpy arrays between python subprocesses without saving to disk?

后端 未结 6 1482
说谎
说谎 2020-11-29 21:21

Is there a good way to pass a large chunk of data between two python subprocesses without using the disk? Here\'s a cartoon example of what I\'m hoping to accomplish:

<
6条回答
  •  我在风中等你
    2020-11-29 22:01

    From the other answers, it seems that numpy-sharedmem is the way to go.

    However, if you need a pure python solution, or installing extensions, cython or the like is a (big) hassle, you might want to use the following code which is a simplified version of Nadav's code:

    import numpy, ctypes, multiprocessing
    
    _ctypes_to_numpy = {
        ctypes.c_char   : numpy.dtype(numpy.uint8),
        ctypes.c_wchar  : numpy.dtype(numpy.int16),
        ctypes.c_byte   : numpy.dtype(numpy.int8),
        ctypes.c_ubyte  : numpy.dtype(numpy.uint8),
        ctypes.c_short  : numpy.dtype(numpy.int16),
        ctypes.c_ushort : numpy.dtype(numpy.uint16),
        ctypes.c_int    : numpy.dtype(numpy.int32),
        ctypes.c_uint   : numpy.dtype(numpy.uint32),
        ctypes.c_long   : numpy.dtype(numpy.int64),
        ctypes.c_ulong  : numpy.dtype(numpy.uint64),
        ctypes.c_float  : numpy.dtype(numpy.float32),
        ctypes.c_double : numpy.dtype(numpy.float64)}
    
    _numpy_to_ctypes = dict(zip(_ctypes_to_numpy.values(),
                                _ctypes_to_numpy.keys()))
    
    
    def shm_as_ndarray(mp_array, shape = None):
        '''Given a multiprocessing.Array, returns an ndarray pointing to
        the same data.'''
    
        # support SynchronizedArray:
        if not hasattr(mp_array, '_type_'):
            mp_array = mp_array.get_obj()
    
        dtype = _ctypes_to_numpy[mp_array._type_]
        result = numpy.frombuffer(mp_array, dtype)
    
        if shape is not None:
            result = result.reshape(shape)
    
        return numpy.asarray(result)
    
    
    def ndarray_to_shm(array, lock = False):
        '''Generate an 1D multiprocessing.Array containing the data from
        the passed ndarray.  The data will be *copied* into shared
        memory.'''
    
        array1d = array.ravel(order = 'A')
    
        try:
            c_type = _numpy_to_ctypes[array1d.dtype]
        except KeyError:
            c_type = _numpy_to_ctypes[numpy.dtype(array1d.dtype)]
    
        result = multiprocessing.Array(c_type, array1d.size, lock = lock)
        shm_as_ndarray(result)[:] = array1d
        return result
    

    You would use it like this:

    1. Use sa = ndarray_to_shm(a) to convert the ndarray a into a shared multiprocessing.Array.
    2. Use multiprocessing.Process(target = somefunc, args = (sa, ) (and start, maybe join) to call somefunc in a separate process, passing the shared array.
    3. In somefunc, use a = shm_as_ndarray(sa) to get an ndarray pointing to the shared data. (Actually, you may want to do the same in the original process, immediately after creating sa, in order to have two ndarrays referencing the same data.)

    AFAICS, you don't need to set lock to True, since shm_as_ndarray will not use the locking anyhow. If you need locking, you would set lock to True and call acquire/release on sa.

    Also, if your array is not 1-dimensional, you might want to transfer the shape along with sa (e.g. use args = (sa, a.shape)).

    This solution has the advantage that it does not need additional packages or extension modules, except multiprocessing (which is in the standard library).

提交回复
热议问题