I have a 120 GB file saved (in binary via pickle
) that contains about 50,000 (600x600) 2d numpy arrays. I need to stack all of these arrays using a median. The
This is a perfect use case for numpy's memory mapped arrays.
Memory mapped arrays allow you to treat a .npy
file on disk as though it were loaded in memory as a numpy array, without actually loading it. It's as simple as
arr = np.load('filename', mmap_mode='r')
For the most part you can treat this as any other array. Array elements are only loaded into memory as required. Unfortunately some quick experimentation suggests that median
doesn't handle memmory mapped arrays well*, it still seems to load a substantial portion of the data into memory at once. So median(arr, 0)
may not work.
However, you can still loop over each index and calculate the median without running into memory issues.
[[np.median([arr[k][i][j] for k in range(50000)]) for i in range(600)] for j in range(600)]
where 50,000 reflects the total number of arrays.
Without the overhead of unpickling each file just to extract a single pixel the run time should be much quicker (by about 360000 times).
Of course, that leaves the problem of creating a .npy
file containing all of the data. A file can be created as follows,
arr = np.lib.format.open_memmap(
'filename', # File to store in
mode='w+', # Specify to create the file and write to it
dtype=float32, # Change this to your data's type
shape=(50000, 600, 600) # Shape of resulting array
)
Then, load the data as before and store it into the array (which just writes it to disk behind the scenes).
idx = 0
with open(filename, 'rb') as f:
while True:
try:
arr[idx] = pickle.load(f)
idx += 1
except EOFError:
break
Give it a couple hours to run, then head back to the start of this answer to see how to load it and take the median. Can't be any simpler**.
*I just tested it on a 7GB file, taking the median of 1,500 samples of 5,000,000 elements and memory usage was around 7GB, suggesting the entire array may have been loaded into memory. It doesn't hurt to try this way first though. If anyone else has experience with median on memmapped arrays feel free to comment.
** If you believe strangers on the internet.
Note: I use Python 2.x, porting this to 3.x shouldn't be difficult.
My idea is simple - disk space is plentiful, so let's do some preprocessing and turn that big pickle file into something that is easier to process in small chunks.
In order to test this, I wrote a small script the generates a pickle file that resembles yours. I assumed your input images are grayscale and have 8bit depth, and generated 10000 random images using numpy.random.randint.
This script will act as a benchmark that we can compare the preprocessing and processing stages against.
import numpy as np
import pickle
import time
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000
t1 = time.time()
with open('data/raw_data.pickle', 'wb') as f:
for i in range(FILE_COUNT):
data = np.random.randint(256, size=IMAGE_WIDTH*IMAGE_HEIGHT, dtype=np.uint8)
data = data.reshape(IMAGE_HEIGHT, IMAGE_WIDTH)
pickle.dump(data, f)
print i,
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run this script completed in 372 seconds, generating ~ 10 GB file.
Let's split the input images on a row-by-row basis -- we will have 600 files, where file N
contains row N
from each input image. We can store the row data in binary using numpy.ndarray.tofile (and later load those files using numpy.fromfile).
import numpy as np
import pickle
import time
# Increase open file limit
# See https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles
import win32file
win32file._setmaxstdio(1024)
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000
t1 = time.time()
outfiles = []
for i in range(IMAGE_HEIGHT):
outfilename = 'data/row_%03d.dat' % i
outfiles.append(open(outfilename, 'wb'))
with open('data/raw_data.pickle', 'rb') as f:
for i in range(FILE_COUNT):
data = pickle.load(f)
for j in range(IMAGE_HEIGHT):
data[j].tofile(outfiles[j])
print i,
for i in range(IMAGE_HEIGHT):
outfiles[i].close()
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run, this script completed in 134 seconds, generating 600 files, 6 million bytes each. It used ~30MB or RAM.
Simple, just load each array using numpy.fromfile, then use numpy.median to get per-column medians, reducing it back to a single row, and accumulate such rows in a list.
Finally, use numpy.vstack to reassemble a median image.
import numpy as np
import time
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
t1 = time.time()
result_rows = []
for i in range(IMAGE_HEIGHT):
outfilename = 'data/row_%03d.dat' % i
data = np.fromfile(outfilename, dtype=np.uint8).reshape(-1, IMAGE_WIDTH)
median_row = np.median(data, axis=0)
result_rows.append(median_row)
print i,
result = np.vstack(result_rows)
print result
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run, this script completed in 74 seconds. You could even parallelize it quite easily, but it doesn't seem to be worth it. The script used ~40MB of RAM.
Given how both of those scripts are linear, the time used should scale linearly as well. For 50000 images, this is about 11 minutes for preprocessing and 6 minutes for the final processing. This is on i7-4930K @ 3.4GHz, using 32bit Python on purpose.