How do I split a custom dataset into training and test datasets?

前端 未结 5 1296
遇见更好的自我
遇见更好的自我 2020-12-07 09:50
import pandas as pd
import numpy as np
import cv2
from torch.utils.data.dataset import Dataset

class CustomDatasetFromCSV(Dataset):
    def __init__(self, csv_path,         


        
相关标签:
5条回答
  • 2020-12-07 10:12

    Current answers do random splits which has disadvantage that number of samples per class is not guaranteed to be balanced. This is especially problematic when you want to have small number of samples per class. For example, MNIST has 60,000 examples, i.e. 6000 per digit. Assume that you want only 30 examples per digit in your training set. In this case, random split may produce imbalance between classes (one digit with more training data then others). So you want to make sure each digit precisely has only 30 labels. This is called stratified sampling.

    One way to do this is using sampler interface in Pytorch and sample code is here.

    Another way to do this is just hack your way through :). For example, below is simple implementation for MNIST where ds is MNIST dataset and k is number of samples needed for each class.

    def sampleFromClass(ds, k):
        class_counts = {}
        train_data = []
        train_label = []
        test_data = []
        test_label = []
        for data, label in ds:
            c = label.item()
            class_counts[c] = class_counts.get(c, 0) + 1
            if class_counts[c] <= k:
                train_data.append(data)
                train_label.append(torch.unsqueeze(label, 0))
            else:
                test_data.append(data)
                test_label.append(torch.unsqueeze(label, 0))
        train_data = torch.cat(train_data)
        for ll in train_label:
            print(ll)
        train_label = torch.cat(train_label)
        test_data = torch.cat(test_data)
        test_label = torch.cat(test_label)
    
        return (TensorDataset(train_data, train_label), 
            TensorDataset(test_data, test_label))
    

    You can use this function like this:

    def main():
        train_ds = datasets.MNIST('../data', train=True, download=True,
                           transform=transforms.Compose([
                               transforms.ToTensor()
                           ]))
        train_ds, test_ds = sampleFromClass(train_ds, 3)
    
    0 讨论(0)
  • 2020-12-07 10:23

    Bear in mind that most canonical examples are already spited. For instance on this page you will find MNIST. One common belief is that is has 60.000 images. Bang! Wrong! It has 70.000 images out of that 60.000 training and 10.000 validation (test) images.

    So for the canonical datasets the flavor of PyTorch is to provide you already spited datasets.

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    from torch.utils.data import DataLoader, Dataset, TensorDataset
    from torch.optim import *
    import torchvision
    import torchvision.transforms as transforms
    import matplotlib.pyplot as plt
    import os
    import numpy as np
    import random
    
    bs=512
    
    t = transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize(mean=(0), std=(1))]
                           )
    
    dl_train = DataLoader( torchvision.datasets.MNIST('/data/mnist', download=True, train=True, transform=t), 
                    batch_size=bs, drop_last=True, shuffle=True)
    dl_valid = DataLoader( torchvision.datasets.MNIST('/data/mnist', download=True, train=False, transform=t), 
                    batch_size=bs, drop_last=True, shuffle=True)
    
    0 讨论(0)
  • 2020-12-07 10:24

    Using Pytorch's SubsetRandomSampler:

    import torch
    import numpy as np
    from torchvision import datasets
    from torchvision import transforms
    from torch.utils.data.sampler import SubsetRandomSampler
    
    class CustomDatasetFromCSV(Dataset):
        def __init__(self, csv_path, transform=None):
            self.data = pd.read_csv(csv_path)
            self.labels = pd.get_dummies(self.data['emotion']).as_matrix()
            self.height = 48
            self.width = 48
            self.transform = transform
    
        def __getitem__(self, index):
            # This method should return only 1 sample and label 
            # (according to "index"), not the whole dataset
            # So probably something like this for you:
            pixel_sequence = self.data['pixels'][index]
            face = [int(pixel) for pixel in pixel_sequence.split(' ')]
            face = np.asarray(face).reshape(self.width, self.height)
            face = cv2.resize(face.astype('uint8'), (self.width, self.height))
            label = self.labels[index]
    
            return face, label
    
        def __len__(self):
            return len(self.labels)
    
    
    dataset = CustomDatasetFromCSV(my_path)
    batch_size = 16
    validation_split = .2
    shuffle_dataset = True
    random_seed= 42
    
    # Creating data indices for training and validation splits:
    dataset_size = len(dataset)
    indices = list(range(dataset_size))
    split = int(np.floor(validation_split * dataset_size))
    if shuffle_dataset :
        np.random.seed(random_seed)
        np.random.shuffle(indices)
    train_indices, val_indices = indices[split:], indices[:split]
    
    # Creating PT data samplers and loaders:
    train_sampler = SubsetRandomSampler(train_indices)
    valid_sampler = SubsetRandomSampler(val_indices)
    
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 
                                               sampler=train_sampler)
    validation_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
                                                    sampler=valid_sampler)
    
    # Usage Example:
    num_epochs = 10
    for epoch in range(num_epochs):
        # Train:   
        for batch_index, (faces, labels) in enumerate(train_loader):
            # ...
    
    0 讨论(0)
  • 2020-12-07 10:26

    Starting in PyTorch 0.4.1 you can use random_split:

    train_size = int(0.8 * len(full_dataset))
    test_size = len(full_dataset) - train_size
    train_dataset, test_dataset = torch.utils.data.random_split(full_dataset, [train_size, test_size])
    
    0 讨论(0)
  • 2020-12-07 10:30

    This is the PyTorch Subset class attached holding the random_split method. Note that this method is base for the SubsetRandomSampler.

    For MNIST if we use random_split:

    loader = DataLoader(
      torchvision.datasets.MNIST('/data/mnist', train=True, download=True,
                                 transform=torchvision.transforms.Compose([
                                   torchvision.transforms.ToTensor(),
                                   torchvision.transforms.Normalize(
                                     (0.5,), (0.5,))
                                 ])),
      batch_size=16, shuffle=False)
    
    print(loader.dataset.data.shape)
    test_ds, valid_ds = torch.utils.data.random_split(loader.dataset, (50000, 10000))
    print(test_ds, valid_ds)
    print(test_ds.indices, valid_ds.indices)
    print(test_ds.indices.shape, valid_ds.indices.shape)
    

    We get:

    torch.Size([60000, 28, 28])
    <torch.utils.data.dataset.Subset object at 0x0000020FD1880B00> <torch.utils.data.dataset.Subset object at 0x0000020FD1880C50>
    tensor([ 1520,  4155, 45472,  ..., 37969, 45782, 34080]) tensor([ 9133, 51600, 22067,  ...,  3950, 37306, 31400])
    torch.Size([50000]) torch.Size([10000])
    

    Our test_ds.indices and valid_ds.indices will be random from range (0, 600000). But if I would like to get sequence of indices from (0, 49999) and from (50000, 59999) I cannot do that at the moment unfortunately, except this way.

    Handy in case you run the MNIST benchmark where it is predefined what should be the test and what should be the validation dataset.

    0 讨论(0)
提交回复
热议问题