问题
I have a very long dataframe (25 million rows x 500 columns) which I can access as a csv file or a parquet file but I can load into the RAM of my PC.
The data should be shaped appropriately in order to become input to a Keras LSTM model (Tensorflow 2), given a desired number of timestamps per sample and a desired number of samples per batch.
This is my second post in this subject. I have already been given the advice to convert the data to tfrecord format.
Since my original environment will be PySpark the way to do this transformation would be:
myDataFrame.write.format("tfrecords").option("writeLocality", "local").save("/path")
How to convert multiple parquet files into TFrecord files using SPARK?
Assuming now that this has been done and to simplify things and make them concrete and reproducible let's assume a dataframe shaped 1000 rows x 3 columns where the first two columns are features and the last is the target, while each row corresponds to a timestamp.
For example the first column is temperature, the second column is wind_speed and the third column (the target) is energy_consumption. Each row corresponds to an hour. The dataset contains observations of 1,000 consecutive hours. We assume that the energy consumption at any given hour is a function of the state of the atmosphere over several hours before. Therefore, we want to use an lstm model to estimate energy consumption. We have decided to feed the lstm model with samples each of which contains the data from the previous 5 hours (i.e. 5 rows per sample). For simplicity assume that the target has been shifted backwards one hour so that a slice data[0:4, :-1]
has as target data[3, -1]
. Assume as batch_size = 32
.
The data are in our hard disk in .tfrecords
format. We can not load all the data to our RAM.
How we would go about it? Can you write the code for this toy example?
回答1:
I don't understand the question. This works out of the box with tfrecord
s:
# this will not load all data into RAM
dataset = tf.data.TFRecordDataset("./path_to_tfrecord.tfrecord")
k = 0
for sample in dataset:
print(sample.numpy())
to train
model.fit(train_data=dataset)
Can you give a few samples of what gets printed? (With "..."s to shorten stuff if necessary).
来源:https://stackoverflow.com/questions/60126186/transforming-the-data-stored-in-tfrecord-format-to-become-inputs-to-a-lstm-keras