How do I split a custom dataset into training and test datasets?

前端 未结 5 1299
遇见更好的自我
遇见更好的自我 2020-12-07 09:50
import pandas as pd
import numpy as np
import cv2
from torch.utils.data.dataset import Dataset

class CustomDatasetFromCSV(Dataset):
    def __init__(self, csv_path,         


        
5条回答
  •  [愿得一人]
    2020-12-07 10:30

    This is the PyTorch Subset class attached holding the random_split method. Note that this method is base for the SubsetRandomSampler.

    For MNIST if we use random_split:

    loader = DataLoader(
      torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                                 transform=torchvision.transforms.Compose([
                                   torchvision.transforms.ToTensor(),
                                   torchvision.transforms.Normalize(
                                     (0.5,), (0.5,))
                                 ])),
      batch_size=16, shuffle=False)
    
    print(loader.dataset.data.shape)
    test_ds, valid_ds = torch.utils.data.random_split(loader.dataset, (50000, 10000))
    print(test_ds, valid_ds)
    print(test_ds.indices, valid_ds.indices)
    print(test_ds.indices.shape, valid_ds.indices.shape)
    

    We get:

    torch.Size([60000, 28, 28])
     
    tensor([ 1520,  4155, 45472,  ..., 37969, 45782, 34080]) tensor([ 9133, 51600, 22067,  ...,  3950, 37306, 31400])
    torch.Size([50000]) torch.Size([10000])
    

    Our test_ds.indices and valid_ds.indices will be random from range (0, 600000). But if I would like to get sequence of indices from (0, 49999) and from (50000, 59999) I cannot do that at the moment unfortunately, except this way.

    Handy in case you run the MNIST benchmark where it is predefined what should be the test and what should be the validation dataset.

提交回复
热议问题