Split .tfrecords file into many .tfrecords files

前端 未结 4 740
长情又很酷
长情又很酷 2020-12-09 20:37

Is there any way to split .tfrecords file into many .tfrecords files directly, without writing back each Dataset example ?

4条回答
  •  旧时难觅i
    2020-12-09 21:27

    Using .batch() instead of .shard() to avoid iterating over dataset multiple times

    A more performant approach (compared to using tf.data.Dataset.shard()) would be to use batching:

    import tensorflow as tf
    
    ITEMS_PER_FILE = 100 # Assuming we are saving 100 items per .tfrecord file
    
    
    raw_dataset = tf.data.TFRecordDataset('in.tfrecord')
    
    batch_idx = 0
    for batch in raw_dataset.batch(ITEMS_PER_FILE):
    
        # Converting `batch` back into a `Dataset`, assuming batch is a `tuple` of `tensors`
        batch_ds = tf.data.Dataset.from_tensor_slices(tuple([*batch]))
        filename = f'out.tfrecord.{batch_idx:03d}'
    
        writer = tf.data.experimental.TFRecordWriter(filename)
        writer.write(batch_ds)
    
        batch_idx += 1
    

提交回复
热议问题