Optimizing my large data code with little RAM

后端 未结 2 1294
粉色の甜心
粉色の甜心 2020-12-18 04:52

I have a 120 GB file saved (in binary via pickle) that contains about 50,000 (600x600) 2d numpy arrays. I need to stack all of these arrays using a median. The

2条回答
  •  眼角桃花
    2020-12-18 05:29

    This is a perfect use case for numpy's memory mapped arrays. Memory mapped arrays allow you to treat a .npy file on disk as though it were loaded in memory as a numpy array, without actually loading it. It's as simple as

    arr = np.load('filename', mmap_mode='r')
    

    For the most part you can treat this as any other array. Array elements are only loaded into memory as required. Unfortunately some quick experimentation suggests that median doesn't handle memmory mapped arrays well*, it still seems to load a substantial portion of the data into memory at once. So median(arr, 0) may not work.

    However, you can still loop over each index and calculate the median without running into memory issues.

    [[np.median([arr[k][i][j] for k in range(50000)]) for i in range(600)] for j in range(600)]
    

    where 50,000 reflects the total number of arrays.

    Without the overhead of unpickling each file just to extract a single pixel the run time should be much quicker (by about 360000 times).

    Of course, that leaves the problem of creating a .npy file containing all of the data. A file can be created as follows,

    arr = np.lib.format.open_memmap(
        'filename',              # File to store in
        mode='w+',               # Specify to create the file and write to it
        dtype=float32,           # Change this to your data's type
        shape=(50000, 600, 600)  # Shape of resulting array
    )
    

    Then, load the data as before and store it into the array (which just writes it to disk behind the scenes).

    idx = 0
    with open(filename, 'rb') as f:
        while True:
            try:
                arr[idx] = pickle.load(f)
                idx += 1
            except EOFError:
                break
    

    Give it a couple hours to run, then head back to the start of this answer to see how to load it and take the median. Can't be any simpler**.

    *I just tested it on a 7GB file, taking the median of 1,500 samples of 5,000,000 elements and memory usage was around 7GB, suggesting the entire array may have been loaded into memory. It doesn't hurt to try this way first though. If anyone else has experience with median on memmapped arrays feel free to comment.

    ** If you believe strangers on the internet.

提交回复
热议问题