numpy vs. multiprocessing and mmap

旧街凉风 提交于 2019-11-28 05:56:07

My usual approach (if you can live with extra memory copies) is to do all IO in one process and then send things out to a pool of worker threads. To load a slice of a memmapped array into memory just do x = np.array(data[yourslice]) (data[yourslice].copy() doesn't actually do this, which can lead to some confusion.).

First off, let's generate some test data:

import numpy as np
np.random.random(10000).tofile('data.dat')

You can reproduce your errors with something like this:

import numpy as np
import multiprocessing

def main():
    data = np.memmap('data.dat', dtype=np.float, mode='r')
    pool = multiprocessing.Pool()
    results = pool.imap(calculation, chunks(data))
    results = np.fromiter(results, dtype=np.float)

def chunks(data, chunksize=100):
    """Overly-simple chunker..."""
    intervals = range(0, data.size, chunksize) + [None]
    for start, stop in zip(intervals[:-1], intervals[1:]):
        yield data[start:stop]

def calculation(chunk):
    """Dummy calculation."""
    return chunk.mean() - chunk.std()

if __name__ == '__main__':
    main()

And if you just switch to yielding np.array(data[start:stop]) instead, you'll fix the problem:

import numpy as np
import multiprocessing

def main():
    data = np.memmap('data.dat', dtype=np.float, mode='r')
    pool = multiprocessing.Pool()
    results = pool.imap(calculation, chunks(data))
    results = np.fromiter(results, dtype=np.float)

def chunks(data, chunksize=100):
    """Overly-simple chunker..."""
    intervals = range(0, data.size, chunksize) + [None]
    for start, stop in zip(intervals[:-1], intervals[1:]):
        yield np.array(data[start:stop])

def calculation(chunk):
    """Dummy calculation."""
    return chunk.mean() - chunk.std()

if __name__ == '__main__':
    main()

Of course, this does make an extra in-memory copy of each chunk.

In the long run, you'll probably find that it's easier to switch away from memmapped files and move to something like HDF. This especially true if your data is multidimensional. (I'd reccomend h5py, but pyTables is nice if your data is "table-like".)

Good luck, at any rate!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!