Split .tfrecords file into many .tfrecords files

前端未结

关注

 4  739

长情又很酷 2020-12-09 20:37

Is there any way to split .tfrecords file into many .tfrecords files directly, without writing back each Dataset example ?

4条回答

一向 (楼主)

2020-12-09 21:23
Very efficient way for TensorFlow 2.x

As mentioned by @yongjieyongjie you should use .batch() instead of .shard() to avoid iterating more often over the dataset as needed. But in case you have a very large dataset, too big for memory, it will fail (but no error), just giving you a few files and a fraction of your original dataset.

First you should batch your dataset, and use as batch size the amount of records you want to have per file (I assume your dataset is already in serialized format, otherwise see here).
```
dataset = dataset.batch(ITEMS_PER_FILE)
```
Next thing you want to do, is to use a generator to avoid running out of memory.
```
def write_generator():
    i = 0
    iterator = iter(dataset)
    optional = iterator.get_next_as_optional()
    while optional.has_value().numpy():
        ds = optional.get_value()
        optional = iterator.get_next_as_optional()
        batch_ds = tf.data.Dataset.from_tensor_slices(ds)
        writer = tf.data.experimental.TFRecordWriter(save_to + "\\" + name + "-" + str(i) + ".tfrecord", compression_type='GZIP')#compression_type='GZIP'
        i += 1
        yield batch_ds, writer, i
    return
```
Now simply use the generator in a normal for-loop
```
for data, wri, i in write_generator():
    start_time = time.time()
    wri.write(data)
    print("Time needed: ", time.time() - start_time, "s", "\t", NAME_OF_FILES + "-" + str(i) + ".tfrecord")
```
As long one single file fits raw in memory, this should just work fine.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

Split .tfrecords file into many .tfrecords files

Very efficient way for TensorFlow 2.x