Split a dataset created by Tensorflow dataset API in to Train and Test?

前端 未结 8 713
天命终不由人
天命终不由人 2020-12-08 02:12

Does anyone know how to split a dataset created by the dataset API (tf.data.Dataset) in Tensorflow into Test and Train?

相关标签:
8条回答
  • 2020-12-08 02:37

    You may use Dataset.take() and Dataset.skip():

    train_size = int(0.7 * DATASET_SIZE)
    val_size = int(0.15 * DATASET_SIZE)
    test_size = int(0.15 * DATASET_SIZE)
    
    full_dataset = tf.data.TFRecordDataset(FLAGS.input_file)
    full_dataset = full_dataset.shuffle()
    train_dataset = full_dataset.take(train_size)
    test_dataset = full_dataset.skip(train_size)
    val_dataset = test_dataset.skip(val_size)
    test_dataset = test_dataset.take(test_size)
    

    For more generality, I gave an example using a 70/15/15 train/val/test split but if you don't need a test or a val set, just ignore the last 2 lines.

    Take:

    Creates a Dataset with at most count elements from this dataset.

    Skip:

    Creates a Dataset that skips count elements from this dataset.

    You may also want to look into Dataset.shard():

    Creates a Dataset that includes only 1/num_shards of this dataset.


    Disclaimer I stumbled upon this question after answering this one so I thought I'd spread the love

    0 讨论(0)
  • Can't comment, but above answer has overlap and is incorrect. Set BUFFER_SIZE to DATASET_SIZE for perfect shuffle. Try different sized val/test size to verify. Answer should be:

    DATASET_SIZE = tf.data.experimental.cardinality(full_dataset).numpy()
    train_size = int(0.7 * DATASET_SIZE)
    val_size = int(0.15 * DATASET_SIZE)
    test_size = int(0.15 * DATASET_SIZE)
    
    full_dataset = full_dataset.shuffle(BUFFER_SIZE)
    train_dataset = full_dataset.take(train_size)
    test_dataset = full_dataset.skip(train_size)
    val_dataset = test_dataset.take(val_size)
    test_dataset = test_dataset.skip(val_size)
    
    0 讨论(0)
  • 2020-12-08 02:42

    @ted's answer will cause some overlap. Try this.

    train_ds_size = int(0.64 * full_ds_size)
    valid_ds_size = int(0.16 * full_ds_size)
    
    train_ds = full_ds.take(train_ds_size)
    remaining = full_ds.skip(train_ds_size)  
    valid_ds = remaining.take(valid_ds_size)
    test_ds = remaining.skip(valid_ds_size)
    

    use code below to test.

    tf.enable_eager_execution()
    
    dataset = tf.data.Dataset.range(100)
    
    train_size = 20
    valid_size = 30
    test_size = 50
    
    train = dataset.take(train_size)
    remaining = dataset.skip(train_size)
    valid = remaining.take(valid_size)
    test = remaining.skip(valid_size)
    
    for i in train:
        print(i)
    
    for i in valid:
        print(i)
    
    for i in test:
        print(i)
    
    0 讨论(0)
  • 2020-12-08 02:44

    Now Tensorflow doesn't contain any tools for that.
    You could use sklearn.model_selection.train_test_split to generate train/eval/test dataset, then create tf.data.Dataset respectively.

    0 讨论(0)
  • 2020-12-08 02:48

    Assuming you have all_dataset variable of tf.data.Dataset type:

    test_dataset = all_dataset.take(1000) 
    train_dataset = all_dataset.skip(1000)
    

    Test dataset now has first 1000 elements and the rest goes for training.

    0 讨论(0)
  • 2020-12-08 02:52

    You can use shard:

    dataset = dataset.shuffle()  # optional
    trainset = dataset.shard(2, 0)
    testset = dataset.shard(2, 1)
    

    See: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shard

    0 讨论(0)
提交回复
热议问题