RuntimeError 'DivBackward0' nan values in its 0th output, but works when tensors loaded from disk?

问题

I'm trying to implement a particular loss function in PyTorch called SMAPE (commonly used in time series forecasting). I have two variables, model_outputs and target_outputs, and the formula for computing the element-wise SMAPE is straight-forward:

numerator = torch.abs(model_outputs - target_outputs)
denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
elementwise_smape = torch.div(numerator, denominator)
nan_mask = torch.isnan(elementwise_smape)
loss = elementwise_smape[~nan_mask].mean()
assert ~torch.isnan(loss)  # loss = 0.023207199
loss.backward()

But when I ran my code, I received the following error: RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

I used autograd.detect_anomaly() to track down the error as stemming from elementwise_smape = torch.div(numerator, denominator). I tried the suggested solutions I could find online (checking for NaN values, adding hooks, printing gradients) and nothing helped. So I decided to write the tensors to disk and write an independent script:

import torch


model_outputs = torch.load('model_outputs.pt')
target_outputs = torch.load('target_outputs.pt')
print(model_outputs)
print(target_outputs)

model_outputs.requires_grad = True
target_outputs.requires_grad = False

numerator = torch.abs(model_outputs - target_outputs)
denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
elementwise_smape = torch.div(numerator, denominator)
nan_mask = torch.isnan(elementwise_smape)
loss = elementwise_smape[~nan_mask].mean()
assert ~torch.isnan(loss)
print(loss.detach().numpy())
loss.backward()
print('Success!')

This worked completely fine! To test that the problem doesn't originate from any of the variables elementwise_smape depends on (model_outputs, numerator, denominator), I tried replacing the loss function with each of the dependent variables, but all worked!

nan_mask = torch.isnan(elementwise_smape)
loss = model_outputs[~nan_mask].mean()
loss.backward()  # succeeds

nan_mask = torch.isnan(elementwise_smape)
loss = numerator[~nan_mask].mean()
loss.backward()  # succeeds

nan_mask = torch.isnan(elementwise_smape)
loss = denominator[~nan_mask].mean()
loss.backward()  # succeeds

So what the hell is going on with the element-wise division that causes a problem when model_outputs are generated by the model, but not when model_outputs are loaded from disk?

Update #1: Adding complete error:

Traceback of forward call that caused the error:
  File "/home/rylan/tsml/main.py", line 116, in <module>
    main(experiment_number)
  File "/home/rylan/tsml/main.py", line 23, in main
    loss_per_model_per_step=loss_per_model_per_step)
  File "/home/rylan/tsml/main.py", line 47, in run
    sequence_lengths)
  File "/home/rylan/tsml/main.py", line 80, in step
    sequence_lengths=sequence_lengths)
  File "/home/rylan/tsml/models.py", line 153, in loss
    model_output_log_probs=None)
  File "/home/rylan/tsml/losses.py", line 96, in calculate_mean_smape
    target_outputs=target_outputs)
  File "/home/rylan/tsml/losses.py", line 117, in calculate_elementwise_smape
    elementwise_smape = torch.div(numerator, denominator)


Traceback (most recent call last):
  File "/home/rylan/tsml/main.py", line 116, in <module>
    main(experiment_number)
  File "/home/rylan/tsml/main.py", line 23, in main
    loss_per_model_per_step=loss_per_model_per_step)
  File "/home/rylan/tsml/main.py", line 47, in run
    sequence_lengths)
  File "/home/rylan/tsml/main.py", line 89, in step
    total_loss.backward()
  File "/home/rylan/.local/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/rylan/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

Process finished with exit code 1

Update #2: Finomnis suggested writing the tensors to disk, reading them back and comparing them to the original:

torch.save(model_outputs, 'model_outputs.pt')
torch.save(target_outputs, 'target_outputs.pt')
model_outputs_2 = torch.load('model_outputs.pt')
target_outputs_2 = torch.load('target_outputs.pt')
(model_outputs == model_outputs_2).all()  # tensor(1, dtype=torch.uint8)
(target_outputs == target_outputs_2).all()  # tensor(0, dtype=torch.uint8)

I suspected that (target_outputs == target_outputs_2).all() evaluates to False due to NaNs. This is correct:

(torch.isnan(target_outputs) == torch.isnan(target_outputs_2)).all()  # tensor(1, dtype=torch.uint8)
(target_outputs[~torch.isnan(target_outputs)] == target_outputs_2[~torch.isnan(target_outputs_2)]).all()  # tensor(1, dtype=torch.uint8)

Update #3: I tried the next suggestion of detaching model_outputs to ensure that the problem isn't originating from elsewhere:

def calculate_elementwise_smape(model_outputs,
                                target_outputs):
    model_outputs = model_outputs.detach()
    model_outputs.requires_grad = True

    target_outputs = target_outputs.detach()

    numerator = torch.abs(model_outputs - target_outputs)
    denominator = torch.abs(model_outputs) + torch.abs(target_outputs)
    elementwise_smape = torch.div(numerator, denominator)
    return elementwise_smape

but this produced the same error RuntimeError: Function 'DivBackward0' returned nan values in its 0th output. at the same location elementwise_smape = torch.div(numerator, denominator)

To reiterate what I said earlier, this problem does not occur if I load the tensors from disk and make the exact same sequence of operations.

来源：https://stackoverflow.com/questions/57013705/runtimeerror-divbackward0-nan-values-in-its-0th-output-but-works-when-tensors

标签

python

python-3.6

pytorch