How to feed multiple NumPy arrays to a deep learning network in Keras?

问题

I have around 13 NumPy arrays stored as files that take around 24 gigabytes on disk. Each file is for a single subject and consists of two arrays: one containing input data (a list of 2D matrices, rows represent sequential time), and the other one containing labels of the data.

My final goal is to feed all the data to a deep learning network I've written in Keras to classify new data. But I don't know how to do it without running out of memory.

I've read about Keras's data generators, but cannot find a way to use it for my situation.

I've also looked up HDF5 and h5py, but I don't know how can add all the data to a single array(dataset in HDF5) without running out of memory.

回答1:

What you need to do is to implement a generator, to feed the data little by little to your model. Keras, does have a TimeseriesGenerator, but I don't think you can use it as it requires you to first load the whole dataset in memory. Thankfully, keras has a generator for images (called ImageDataGenerator), which we will use to base our custom generator off of.

First two words on how it works. You have two main classes the ImageDataGenerator class (which mostly deals with any preprocessing you want to perform on each image) and the DirectoryIterator class, which actually does all the work. The latter is what we will modify to get what we want. What it essentially does is:

Inherits from keras.preprocessing.image.Iterator, which implements many methods that initialize and generate an array called index_array that contains the indices of the images that are used in each batch. This array is changed in each iteration, while the data it draws from are shuffled in each epoch. We will build our generator upon this, to maintain its functionality.
Searches for all images under a directory; the labels are deduced from the directory structure. It stores the path to each image and its label in two class variable called filenames and classes respectively. We will use these same variables to store the locations of the timeseries and their classes.
It has a method called _get_batches_of_transformed_samples() that accepts an index_array, loads the images whose indices correspond to those of the array and returns a batch of these images and one containing their classes.

What I'd suggest you do is:

Write a script that structures your timeseries data like how you are supposed to structure images when using the ImageDataGenerator. This involves creating a directory for each class and placing each timeseries separatly inside this directory. While this probably will require more storage than your current option, the data won't be loaded in memory while training the model.
Get acquainted on how the DirectoryIterator works.
Define your own generator class (e.g. MyTimeseriesGenerator). Make sure it inherits from the Iterator class mentioned above.
Modify it so that it searches for the format of files you want (e.g. HDF5, npy) and not image formats (e.g. png, jpeg) like it currently does. This is done in the lines 1733-1763. You don't need to make it work on multiple threads like keras' DirectoryIterator does, as this procedure is done only once.
Change the _get_batches_of_transformed_samples() method, so that it reads the file type that you want, instead of reading images (lines 1774-1788). Remove any other image-related functionality the DirectoryIterator has (transforming the images, standardizing them, saving them, etc.)
Make sure that the array returned by the method above matches what you want your model to accept. I'm guessing it should be in the lines of (batch_size, n_timesteps) or (batch_size, n_timesteps, n_feature), for the data and (batch_size, n_classes) for the labels.

That's about all! It sounds more difficult than it actually is. Once you get acquainted with the DirectoryIterator class, everything else is trivial.

Intended use (after modifications to the code):

from custom_generator import MyTimeseriesGenerator  # assuming you named your class 
                                                    # MyTimeseriesGenerator and you
                                                    # wrote it in a python file 
                                                    # named custom_generator

train_dir = 'path/to/your/properly/structured/train/directory'
valid_dir = 'path/to/your/properly/structured/validation/directory'

train_gen = MyTimeseriesGenerator(train_dir, batch_size=..., ...)
valid_gen = MyTimeseriesGenerator(valid_dir, batch_size=..., ...)

# instantiate and compile model, define hyper-parameters, callbacks, etc.

model.fit_generator(train_gen, validation_data=valid_gen, epochs=..., ...)

来源：https://stackoverflow.com/questions/51697727/how-to-feed-multiple-numpy-arrays-to-a-deep-learning-network-in-keras

标签

python

numpy

keras

hdf5