Most efficient way to use a large data set for PyTorch?

后端 未结 3 1312
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-02-09 08:18

Perhaps this question has been asked before, but I\'m having trouble finding relevant info for my situation.

I\'m using PyTorch to create a CNN for regression with image

3条回答
  •  半阙折子戏
    2021-02-09 08:54

    In addition to the above answers, the following may be useful due to some recent advances (2020) in the Pytorch world.

    Your question: Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?

    You can leave the image files in their original format (.jpg, .png, etc.) on your local disk or on the cloud storage, but with one added step - compress the directory as a tar file. Please read this for more details:

    Pytorch Blog (Aug 2020): Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs (https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/)

    This package is designed for situations where the data files are too large to fit in memory for training. Therefore, you give the URL of the dataset location (local, cloud, ..) and it will bring in the data in batches and in parallel.

    The only (current) requirement is that the dataset must be in a tar file format.

    The tar file can be on the local disk or on the cloud. With this, you don't have to load the entire dataset into the memory every time. You can use the torch.utils.data.DataLoader to load in batches for stochastic gradient descent.

提交回复
热议问题