问题
I just started with dask because it offers great parallel processing power. I have around 40000 images on my disk which I am going to use for building a classifier using some DL library, say Keras or TF. I collected this meta-info(image path and corresponding label) in a pandas dataframe, which looks like this:
img_path labels
0 data/1.JPG 1
1 data/2.JPG 1
2 data/3.JPG 5
...
Now here is my simple task: Use dask to read images and corresponding labels in a lazy fashion. Do some processing on images and pass batches to the classifier in a batch size of 32.
Define functions for reading and preprocessing:
def read_data(idx): img = cv2.imread(data['img_path'].iloc[idx]) label = data['labels'].iloc[idx] return img, label def img_resize(img): return cv2.resize(img, (224,224))Get delayed dask arrays:
data = [dd.delayed(read_data)(idx) for idx in range(len(df))] images = [d[0] for d in data] labels = [d[1] for d in data] resized_images = [dd.delayed(img_resize)(img) for img in images] resized_images = [dd.array.from_delayed(x, shape=(224,224, 3),dtype=np.float32) for x in resized_images]
Now here are my questions:
Q1. How do I get a batch of data, with batch_size=32 from this array? Is this equivalent to a lazy generator now? If not, can it be made to behave like one?
Q2. How to choose effective chunksize for better batch generation? For example, if I have 4 cores, size of images is (224,224,3), how can I make my batch processing efficient?
来源:https://stackoverflow.com/questions/56586748/generating-batches-of-images-in-dask