CUDA runtime error (59) : device-side assert triggered

前端 未结 4 1795
星月不相逢
星月不相逢 2020-12-15 17:36

I have access to Tesla K20c, I am running ResNet50 on CIFAR10 dataset... Then I get the error as:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_152458471046

相关标签:
4条回答
  • 2020-12-15 17:50

    I have encountered this problem several times. And I find it to be an index issue. For example, if your ground truth label starts at 1: target = [1,2,3,4,5], then you should subtract 1 for every label, change it to: [0,1,2,3,4]. This solves my problem every time.

    0 讨论(0)
  • 2020-12-15 18:01

    I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.

    0 讨论(0)
  • 2020-12-15 18:03

    This error can be made more elaborative if you switch to CPU first. Once you switch to CPU, it will show the exact error, which is most probably related to the indexing problem, which is IndexError: Target 2 is out of bounds in my case and could be related in yours case. The issue is "How many classes are you currently using and what is the shape of your output?", you can find the classes like this

    max(train_labels)
    min(train_labels)
    

    which in my case gave me 2 and 0, the problem is caused by missing 1 index, so a quick hack is to quickly replace all 2s with 1s , which can be done through this code:

    train_=train.copy()
    train_['label'] =train_['label'].replace(2,1)
    

    then you run the same code and see the results, it should work

    class NDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels
    
        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx])
            return item
    
        def __len__(self):
            return len(self.labels)
    
    train_dataset = NDataset(train_encodings, train_labels)
    val_dataset = NDataset(val_encodings, val_labels)
    test_dataset = NDataset(test_encodings, test_labels)
    
    0 讨论(0)
  • 2020-12-15 18:07

    In general, when encountering cuda runtine errors, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1 flag to obtain an accurate stack trace.

    In your specific case, the targets of your data were too high (or low) for the specified number of classes.

    0 讨论(0)
提交回复
热议问题