Optimizing my large data code with little RAM

后端 未结 2 1292
粉色の甜心
粉色の甜心 2020-12-18 04:52

I have a 120 GB file saved (in binary via pickle) that contains about 50,000 (600x600) 2d numpy arrays. I need to stack all of these arrays using a median. The

相关标签:
2条回答
  • 2020-12-18 05:29

    This is a perfect use case for numpy's memory mapped arrays. Memory mapped arrays allow you to treat a .npy file on disk as though it were loaded in memory as a numpy array, without actually loading it. It's as simple as

    arr = np.load('filename', mmap_mode='r')
    

    For the most part you can treat this as any other array. Array elements are only loaded into memory as required. Unfortunately some quick experimentation suggests that median doesn't handle memmory mapped arrays well*, it still seems to load a substantial portion of the data into memory at once. So median(arr, 0) may not work.

    However, you can still loop over each index and calculate the median without running into memory issues.

    [[np.median([arr[k][i][j] for k in range(50000)]) for i in range(600)] for j in range(600)]
    

    where 50,000 reflects the total number of arrays.

    Without the overhead of unpickling each file just to extract a single pixel the run time should be much quicker (by about 360000 times).

    Of course, that leaves the problem of creating a .npy file containing all of the data. A file can be created as follows,

    arr = np.lib.format.open_memmap(
        'filename',              # File to store in
        mode='w+',               # Specify to create the file and write to it
        dtype=float32,           # Change this to your data's type
        shape=(50000, 600, 600)  # Shape of resulting array
    )
    

    Then, load the data as before and store it into the array (which just writes it to disk behind the scenes).

    idx = 0
    with open(filename, 'rb') as f:
        while True:
            try:
                arr[idx] = pickle.load(f)
                idx += 1
            except EOFError:
                break
    

    Give it a couple hours to run, then head back to the start of this answer to see how to load it and take the median. Can't be any simpler**.

    *I just tested it on a 7GB file, taking the median of 1,500 samples of 5,000,000 elements and memory usage was around 7GB, suggesting the entire array may have been loaded into memory. It doesn't hurt to try this way first though. If anyone else has experience with median on memmapped arrays feel free to comment.

    ** If you believe strangers on the internet.

    0 讨论(0)
  • 2020-12-18 05:32

    Note: I use Python 2.x, porting this to 3.x shouldn't be difficult.


    My idea is simple - disk space is plentiful, so let's do some preprocessing and turn that big pickle file into something that is easier to process in small chunks.

    Preparation

    In order to test this, I wrote a small script the generates a pickle file that resembles yours. I assumed your input images are grayscale and have 8bit depth, and generated 10000 random images using numpy.random.randint.

    This script will act as a benchmark that we can compare the preprocessing and processing stages against.

    import numpy as np
    import pickle
    import time
    
    IMAGE_WIDTH = 600
    IMAGE_HEIGHT = 600
    FILE_COUNT = 10000
    
    t1 = time.time()
    
    with open('data/raw_data.pickle', 'wb') as f:
        for i in range(FILE_COUNT):
            data = np.random.randint(256, size=IMAGE_WIDTH*IMAGE_HEIGHT, dtype=np.uint8)
            data = data.reshape(IMAGE_HEIGHT, IMAGE_WIDTH)
            pickle.dump(data, f)
            print i,
    
    t2 = time.time()
    print '\nDone in %0.3f seconds' % (t2 - t1)
    

    In a test run this script completed in 372 seconds, generating ~ 10 GB file.

    Preprocessing

    Let's split the input images on a row-by-row basis -- we will have 600 files, where file N contains row N from each input image. We can store the row data in binary using numpy.ndarray.tofile (and later load those files using numpy.fromfile).

    import numpy as np
    import pickle
    import time
    
    # Increase open file limit
    # See https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles
    import win32file
    win32file._setmaxstdio(1024)
    
    IMAGE_WIDTH = 600
    IMAGE_HEIGHT = 600
    FILE_COUNT = 10000
    
    t1 = time.time()
    
    outfiles = []
    for i in range(IMAGE_HEIGHT):
        outfilename = 'data/row_%03d.dat' % i
        outfiles.append(open(outfilename, 'wb'))
    
    
    with open('data/raw_data.pickle', 'rb') as f:
        for i in range(FILE_COUNT):
            data = pickle.load(f)
            for j in range(IMAGE_HEIGHT):
                data[j].tofile(outfiles[j])
            print i,
    
    for i in range(IMAGE_HEIGHT):
        outfiles[i].close()
    
    t2 = time.time()
    print '\nDone in %0.3f seconds' % (t2 - t1)
    

    In a test run, this script completed in 134 seconds, generating 600 files, 6 million bytes each. It used ~30MB or RAM.

    Processing

    Simple, just load each array using numpy.fromfile, then use numpy.median to get per-column medians, reducing it back to a single row, and accumulate such rows in a list.

    Finally, use numpy.vstack to reassemble a median image.

    import numpy as np
    import time
    
    IMAGE_WIDTH = 600
    IMAGE_HEIGHT = 600
    
    t1 = time.time()
    
    result_rows = []
    
    for i in range(IMAGE_HEIGHT):
        outfilename = 'data/row_%03d.dat' % i
        data = np.fromfile(outfilename, dtype=np.uint8).reshape(-1, IMAGE_WIDTH)
        median_row = np.median(data, axis=0)
        result_rows.append(median_row)
        print i,
    
    result = np.vstack(result_rows)
    print result
    
    t2 = time.time()
    print '\nDone in %0.3f seconds' % (t2 - t1)
    

    In a test run, this script completed in 74 seconds. You could even parallelize it quite easily, but it doesn't seem to be worth it. The script used ~40MB of RAM.


    Given how both of those scripts are linear, the time used should scale linearly as well. For 50000 images, this is about 11 minutes for preprocessing and 6 minutes for the final processing. This is on i7-4930K @ 3.4GHz, using 32bit Python on purpose.

    0 讨论(0)
提交回复
热议问题