Python: Multiprocessing on Windows -> Shared Readonly Memory

徘徊边缘 提交于 2021-02-11 14:44:58

问题


Is there a way to share a huge dictionary to multiprocessing Subprocesses on windows without duplicating the whole memory? I only need it read-only within the sub-processes, if that helps.

My programm roughly looks like this:

def workerFunc(args):
    id, data_mp, some_more_args = args

    # Do some logic
    # Parse some files on the disk
    # and access some random keys from data_mp which are only known after parsing those files on disk ...
    some_keys = [some_random_ids...]

    # Do something with 
    do_something = [data_mp[x] for x in some_keys]
    return do_something


if __name__ == "__main__":
    multiprocessing.freeze_support()    # Using this script as a PyInstalled .exe later on ...

    DATA = readpickle('my_pickle.pkl')   # my_pickle.pkl is huge, ~1GB
    # DATA looks like this:
    # {1: ['some text', SOME_1D_OR_2D_LIST...[[1,2,3], [123...]]], 
    #  2: ..., 
    #  3: ..., ..., 
    #  1 million keys... }

    # Here I'm doing something with DATA in the main programm...

    # Then I want to spawn N multiprocessing subprocesses, each doing some logic and than accessing a few keys of DATA to read from ...

    manager = multiprocessing.Manager()
    data_mp = manager.dict(DATA)    # Right now I'm putting DATA into the shared memory... so it effectively duplicates the required memory...

    joblist = []
    for idx in range(10000): # Generate the workers, pass the shared memory link data_mp to each worker later on ...
        joblist.append((idx, data_mp, some_more_args))

    # Start Pool of Procs... 
    p = multiprocessing.Pool()
    returnNodes = []
    for ret in p.imap_unordered(workerFunc, jobList):
       returnNodes.append(ret)

    # Do some after work with DATA and returnNodes...
    # and generate some overview xls-file out of it

Unfortunately there's no other way to save my big dictionary... I know a SQL Database would be better because each worker only accesses a few keys of DATA_mp within his subproc, but I don't know in advance which keys will be adressed by each worker.

So my question is: Is there any other way on windows to do this instead of using a Manager.dict() which, as stated above already, effectively duplicates the required memory?

Thanks!

EDIT Unfortunately in my corporate environment, there's no possibility for my tool to use a SQL DB because there's no dedicated machine available. I can only work on file-basis on networkdrives. I tried SQLite already, but it was seriously slow (even though I didnt understand why...). Yes it's a simple key->value kind of dictionary in DATA...

And using Python 2.7!

来源:https://stackoverflow.com/questions/60435574/python-multiprocessing-on-windows-shared-readonly-memory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!