问题
I'm trying to build a basic character recognition model using the many classifiers that scikit provides. The dataset being used is a standard handwritten set of alphanumeric samples (Chars74K image dataset taken from this source: EnglishHnd.tgz).
There are 55 samples of each character (62 alphanumeric characters in all), each being 900x1200 pixels. I'm flattening the matrix (first converting to grayscale) into a 1x1080000 array (each representing a feature).
for sample in sample_images: # sample images is the list of the .png files
img = imread(sample);
img_gray = rgb2gray(img);
if n == 0 and m == 0: # n and m are global variables
n, m = np.shape(img_gray);
img_gray = np.reshape(img_gray, n*m);
img_gray = np.append(img_gray, sample_id); # sample id stores the label of the training sample
if len(samples) == 0: # samples is the final numpy ndarray
samples = np.append(samples, img_gray);
samples = np.reshape(samples, [1, n*m + 1]);
else:
samples = np.append(samples, [img_gray], axis=0);
So the final data structure should have 55x62 arrays, where each array is 1080000 elements in capacity. Only the final structure is being stored (the scope of the intermediate matrices is local).
The amount of data being stored to learn the model is pretty large (I guess), because the program isn't really progressing beyond a point, and crashed my system to the extent that the BIOS had to be repaired!
Upto this point, the program is only gathering the data to send to the classifier ... the classification hasn't even been introduced into the code yet.
Any suggestions as to what can be done to handle the data more efficiently?
Note: I'm using numpy to store the final structure of flattened matrices. Also, the system has an 8Gb RAM.
回答1:
This seems like a case of stack overflow. You have 3,682,800,000 array elements, if I understand your question. What is the element type? If it is one byte, that is about 3 gigabytes of data, easily enough to fill up your stack size (usually about 1 megabyte). Even with one bit an element, you are still at 500 mb. Try using heap memory (up to 8 gigs on your machine)
回答2:
I was encouraged to post this as a solution, although the comments above are probably more enlightening.
The issue with the users program is two fold. Really it's just overwhelming the stack.
Much more common, especially with image processing in things like computer graphics or computer vision, is to process the images one at a time. This could work well with sklearn where you could just be updating your models as you read in the image.
You could use this bit of code found from this stack article:
import os
rootdir = '/path/to/my/pictures'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if file[-3:] == 'png': # or whatever your file type is / some check
# do your training here
img = imread(file)
img_gray = rgb2gray(img)
if n == 0 and m == 0: # n and m are global variables
n, m = np.shape(img_gray);
img_gray = np.reshape(img_gray, n*m)
# sample id stores the label of the training sample
img_gray = np.append(img_gray, sample_id)
# samples is the final numpy ndarray
if len(samples) == 0:
samples = np.append(samples, img_gray);
samples = np.reshape(samples, [1, n*m + 1])
else:
samples = np.append(samples, [img_gray], axis=0)
This is more of pseudocode, but the general flow should have the right idea. Let me know if there's anything else I can do! Also check out OpenCV if you're interested on some cool deep learning algorithms. They're a bunch of cool stuff there and images make for great sample data.
Hope this helps.
来源:https://stackoverflow.com/questions/42500010/python-numpy-crashes-system-with-large-number-of-array-elements