share variable (data from file) among multiple python scripts with not loaded duplicates

问题

I would like to load a big matrix contained in the matrix_file.mtx. This load must be made once. Once the variable matrix is loaded into the memory, I would like many python scripts to share it with not duplicates in order to have a memory efficient multiscript program in bash (or python itself). I can imagine some pseudocode like this:

# Loading and sharing script:
import share
matrix = open("matrix_file.mtx","r")
share.send_to_shared_ram(matrix, as_variable('matrix'))

# Shared matrix variable processing script_1
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>

# Shared matrix variable processing script_2
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>
...

The idea is pointer_to_matrix to point to matrix in RAM, which is only once loaded by the n scripts (not n times). They are separately called from a bash script (or if possible form a python main):

$ python Load_and_share.py
$ python script_1.py -args string &
$ python script_2.py -args string &
$ ...
$ python script_n.py -args string &

I'd also be interested in solutions via hard disk, i.e. matrix could be stored at disk while the share object access to it as being required. Nonetheless, the object (a kind of pointer) in RAM can be seen as the whole matrix.

Thank you for your help.

回答1:

Between the mmap module and numpy.frombuffer, this is fairly easy:

import mmap
import numpy as np

with open("matrix_file.mtx","rb") as matfile:
    mm = mmap.mmap(matfile.fileno(), 0, access=mmap.ACCESS_READ)
    # Optionally, on UNIX-like systems in Py3.3+, add:
    # os.posix_fadvise(matfile.fileno(), 0, len(mm), os.POSIX_FADV_WILLNEED)
    # to trigger background read in of the file to the system cache,
    # minimizing page faults when you use it

matrix = np.frombuffer(mm, np.uint8)

Each process would perform this work separately, and get a read only view of the same memory. You'd change the dtype to something other than uint8 as needed. Switching to ACCESS_WRITE would allow modifications to shared data, though it would require synchronization and possibly explicit calls to mm.flush to actually ensure the data was reflected in other processes.

A more complex solution that follows your initial design more closely might be to uses multiprocessing.SyncManager to create a connectable shared "server" for data, allowing a single common store of data to be registered with the manager and returned to as many users as desired; creating an Array (based on ctypes types) with the correct type on the manager, then register-ing a function that returns the same shared Array to all callers would work too (each caller would then convert the returned Array via numpy.frombuffer as before). It's much more involved (it would be easier to have a single Python process initialize an Array, then launch Processes that would share it automatically thanks to fork semantics), but it's the closest to the concept you describe.

来源：https://stackoverflow.com/questions/34819892/share-variable-data-from-file-among-multiple-python-scripts-with-not-loaded-du

标签

python

bash

shared-memory