I have access to Tesla K20c, I am running ResNet50 on CIFAR10 dataset...
Then I get the error as: THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_152458471046
I have encountered this problem several times. And I find it to be an index issue. For example, if your ground truth label starts at 1: target = [1,2,3,4,5], then you should subtract 1 for every label, change it to: [0,1,2,3,4]. This solves my problem every time.
I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.
This error can be made more elaborative if you switch to CPU first. Once you switch to CPU, it will show the exact error, which is most probably related to the indexing problem, which is IndexError: Target 2 is out of bounds in my case and could be related in yours case. The issue is "How many classes are you currently using and what is the shape of your output?", you can find the classes like this
max(train_labels)
min(train_labels)
which in my case gave me 2 and 0, the problem is caused by missing 1 index, so a quick hack is to quickly replace all 2s with 1s , which can be done through this code:
train_=train.copy()
train_['label'] =train_['label'].replace(2,1)
then you run the same code and see the results, it should work
class NDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = NDataset(train_encodings, train_labels)
val_dataset = NDataset(val_encodings, val_labels)
test_dataset = NDataset(test_encodings, test_labels)
In general, when encountering cuda runtine error
s, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1
flag to obtain an accurate stack trace.
In your specific case, the targets of your data were too high (or low) for the specified number of classes.