问题
I'm trying to implement a particular loss function in PyTorch called SMAPE (commonly used in time series forecasting). I have two variables, model_outputs
and target_outputs
, and the formula for computing the element-wise SMAPE is straight-forward:
numerator = torch.abs(model_outputs - target_outputs)
denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
elementwise_smape = torch.div(numerator, denominator)
nan_mask = torch.isnan(elementwise_smape)
loss = elementwise_smape[~nan_mask].mean()
assert ~torch.isnan(loss) # loss = 0.023207199
loss.backward()
But when I ran my code, I received the following error: RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.
I used autograd.detect_anomaly()
to track down the error as stemming from elementwise_smape = torch.div(numerator, denominator)
. I tried the suggested solutions I could find online (checking for NaN values, adding hooks, printing gradients) and nothing helped. So I decided to write the tensors to disk and write an independent script:
import torch
model_outputs = torch.load('model_outputs.pt')
target_outputs = torch.load('target_outputs.pt')
print(model_outputs)
print(target_outputs)
model_outputs.requires_grad = True
target_outputs.requires_grad = False
numerator = torch.abs(model_outputs - target_outputs)
denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
elementwise_smape = torch.div(numerator, denominator)
nan_mask = torch.isnan(elementwise_smape)
loss = elementwise_smape[~nan_mask].mean()
assert ~torch.isnan(loss)
print(loss.detach().numpy())
loss.backward()
print('Success!')
This worked completely fine! To test that the problem doesn't originate from any of the variables elementwise_smape
depends on (model_outputs
, numerator
, denominator
), I tried replacing the loss function with each of the dependent variables, but all worked!
nan_mask = torch.isnan(elementwise_smape)
loss = model_outputs[~nan_mask].mean()
loss.backward() # succeeds
nan_mask = torch.isnan(elementwise_smape)
loss = numerator[~nan_mask].mean()
loss.backward() # succeeds
nan_mask = torch.isnan(elementwise_smape)
loss = denominator[~nan_mask].mean()
loss.backward() # succeeds
So what the hell is going on with the element-wise division that causes a problem when model_outputs
are generated by the model, but not when model_outputs
are loaded from disk?
Update #1: Adding complete error:
Traceback of forward call that caused the error:
File "/home/rylan/tsml/main.py", line 116, in <module>
main(experiment_number)
File "/home/rylan/tsml/main.py", line 23, in main
loss_per_model_per_step=loss_per_model_per_step)
File "/home/rylan/tsml/main.py", line 47, in run
sequence_lengths)
File "/home/rylan/tsml/main.py", line 80, in step
sequence_lengths=sequence_lengths)
File "/home/rylan/tsml/models.py", line 153, in loss
model_output_log_probs=None)
File "/home/rylan/tsml/losses.py", line 96, in calculate_mean_smape
target_outputs=target_outputs)
File "/home/rylan/tsml/losses.py", line 117, in calculate_elementwise_smape
elementwise_smape = torch.div(numerator, denominator)
Traceback (most recent call last):
File "/home/rylan/tsml/main.py", line 116, in <module>
main(experiment_number)
File "/home/rylan/tsml/main.py", line 23, in main
loss_per_model_per_step=loss_per_model_per_step)
File "/home/rylan/tsml/main.py", line 47, in run
sequence_lengths)
File "/home/rylan/tsml/main.py", line 89, in step
total_loss.backward()
File "/home/rylan/.local/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/rylan/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.
Process finished with exit code 1
Update #2: Finomnis suggested writing the tensors to disk, reading them back and comparing them to the original:
torch.save(model_outputs, 'model_outputs.pt')
torch.save(target_outputs, 'target_outputs.pt')
model_outputs_2 = torch.load('model_outputs.pt')
target_outputs_2 = torch.load('target_outputs.pt')
(model_outputs == model_outputs_2).all() # tensor(1, dtype=torch.uint8)
(target_outputs == target_outputs_2).all() # tensor(0, dtype=torch.uint8)
I suspected that (target_outputs == target_outputs_2).all()
evaluates to False
due to NaNs
. This is correct:
(torch.isnan(target_outputs) == torch.isnan(target_outputs_2)).all() # tensor(1, dtype=torch.uint8)
(target_outputs[~torch.isnan(target_outputs)] == target_outputs_2[~torch.isnan(target_outputs_2)]).all() # tensor(1, dtype=torch.uint8)
Update #3: I tried the next suggestion of detaching model_outputs
to ensure that the problem isn't originating from elsewhere:
def calculate_elementwise_smape(model_outputs,
target_outputs):
model_outputs = model_outputs.detach()
model_outputs.requires_grad = True
target_outputs = target_outputs.detach()
numerator = torch.abs(model_outputs - target_outputs)
denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
elementwise_smape = torch.div(numerator, denominator)
return elementwise_smape
but this produced the same error RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.
at the same location elementwise_smape = torch.div(numerator, denominator)
To reiterate what I said earlier, this problem does not occur if I load the tensors from disk and make the exact same sequence of operations.
来源:https://stackoverflow.com/questions/57013705/runtimeerror-divbackward0-nan-values-in-its-0th-output-but-works-when-tensors