Optimizing my large data code with little RAM

后端未结

关注

 2  1295

I have a 120 GB file saved (in binary via pickle) that contains about 50,000 (600x600) 2d numpy arrays. I need to stack all of these arrays using a median. The

相关标签:

2条回答

眼角桃花

2020-12-18 05:29
This is a perfect use case for numpy's memory mapped arrays. Memory mapped arrays allow you to treat a .npy file on disk as though it were loaded in memory as a numpy array, without actually loading it. It's as simple as
```
arr = np.load('filename', mmap_mode='r')
```
For the most part you can treat this as any other array. Array elements are only loaded into memory as required. Unfortunately some quick experimentation suggests that median doesn't handle memmory mapped arrays well*, it still seems to load a substantial portion of the data into memory at once. So median(arr, 0) may not work.

However, you can still loop over each index and calculate the median without running into memory issues.
```
[[np.median([arr[k][i][j] for k in range(50000)]) for i in range(600)] for j in range(600)]
```
where 50,000 reflects the total number of arrays.

Without the overhead of unpickling each file just to extract a single pixel the run time should be much quicker (by about 360000 times).

Of course, that leaves the problem of creating a .npy file containing all of the data. A file can be created as follows,
```
arr = np.lib.format.open_memmap(
    'filename',              # File to store in
    mode='w+',               # Specify to create the file and write to it
    dtype=float32,           # Change this to your data's type
    shape=(50000, 600, 600)  # Shape of resulting array
)
```
Then, load the data as before and store it into the array (which just writes it to disk behind the scenes).
```
idx = 0
with open(filename, 'rb') as f:
    while True:
        try:
            arr[idx] = pickle.load(f)
            idx += 1
        except EOFError:
            break
```
Give it a couple hours to run, then head back to the start of this answer to see how to load it and take the median. Can't be any simpler**.

*I just tested it on a 7GB file, taking the median of 1,500 samples of 5,000,000 elements and memory usage was around 7GB, suggesting the entire array may have been loaded into memory. It doesn't hurt to try this way first though. If anyone else has experience with median on memmapped arrays feel free to comment.

** If you believe strangers on the internet.
0 讨论(0)
发布评论:

提交评论
- 加载中...
故里飘歌

2020-12-18 05:32
Note: I use Python 2.x, porting this to 3.x shouldn't be difficult.

My idea is simple - disk space is plentiful, so let's do some preprocessing and turn that big pickle file into something that is easier to process in small chunks.

Preparation

In order to test this, I wrote a small script the generates a pickle file that resembles yours. I assumed your input images are grayscale and have 8bit depth, and generated 10000 random images using numpy.random.randint.

This script will act as a benchmark that we can compare the preprocessing and processing stages against.
```
import numpy as np
import pickle
import time

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000

t1 = time.time()

with open('data/raw_data.pickle', 'wb') as f:
    for i in range(FILE_COUNT):
        data = np.random.randint(256, size=IMAGE_WIDTH*IMAGE_HEIGHT, dtype=np.uint8)
        data = data.reshape(IMAGE_HEIGHT, IMAGE_WIDTH)
        pickle.dump(data, f)
        print i,

t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
```
In a test run this script completed in 372 seconds, generating ~ 10 GB file.

Preprocessing

Let's split the input images on a row-by-row basis -- we will have 600 files, where file N contains row N from each input image. We can store the row data in binary using numpy.ndarray.tofile (and later load those files using numpy.fromfile).
```
import numpy as np
import pickle
import time

# Increase open file limit
# See https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles
import win32file
win32file._setmaxstdio(1024)

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000

t1 = time.time()

outfiles = []
for i in range(IMAGE_HEIGHT):
    outfilename = 'data/row_%03d.dat' % i
    outfiles.append(open(outfilename, 'wb'))


with open('data/raw_data.pickle', 'rb') as f:
    for i in range(FILE_COUNT):
        data = pickle.load(f)
        for j in range(IMAGE_HEIGHT):
            data[j].tofile(outfiles[j])
        print i,

for i in range(IMAGE_HEIGHT):
    outfiles[i].close()

t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
```
In a test run, this script completed in 134 seconds, generating 600 files, 6 million bytes each. It used ~30MB or RAM.

Processing

Simple, just load each array using numpy.fromfile, then use numpy.median to get per-column medians, reducing it back to a single row, and accumulate such rows in a list.

Finally, use numpy.vstack to reassemble a median image.
```
import numpy as np
import time

IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600

t1 = time.time()

result_rows = []

for i in range(IMAGE_HEIGHT):
    outfilename = 'data/row_%03d.dat' % i
    data = np.fromfile(outfilename, dtype=np.uint8).reshape(-1, IMAGE_WIDTH)
    median_row = np.median(data, axis=0)
    result_rows.append(median_row)
    print i,

result = np.vstack(result_rows)
print result

t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
```
In a test run, this script completed in 74 seconds. You could even parallelize it quite easily, but it doesn't seem to be worth it. The script used ~40MB of RAM.

Given how both of those scripts are linear, the time used should scale linearly as well. For 50000 images, this is about 11 minutes for preprocessing and 6 minutes for the final processing. This is on i7-4930K @ 3.4GHz, using 32bit Python on purpose.
0 讨论(0)
发布评论:

提交评论
- 加载中...

Optimizing my large data code with little RAM

Preparation

Preprocessing

Processing