How to interpret caffe log with debug_info?

匿名 (未验证) 提交于 2019-12-03 01:49:02

问题:

When facing difficulties during training (nans, loss does not converge, etc.) it is sometimes useful to look at more verbose training log by setting debug_info: true in the 'solver.prototxt' file.

The training log then looks something like:

I1109 ...]     [Forward] Layer data, top blob data data: 0.343971     I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037 I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114 I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0 I1109 ...]     [Forward] Layer relu1, top blob conv1 data: 0.0337982 I1109 ...]     [Forward] Layer conv2, top blob conv2 data: 0.0249297 I1109 ...]     [Forward] Layer conv2, param blob 0 data: 0.00875855 I1109 ...]     [Forward] Layer conv2, param blob 1 data: 0 I1109 ...]     [Forward] Layer relu2, top blob conv2 data: 0.0128249 .  . . I1109 ...]     [Forward] Layer fc1, top blob fc1 data: 0.00728743 I1109 ...]     [Forward] Layer fc1, param blob 0 data: 0.00876866 I1109 ...]     [Forward] Layer fc1, param blob 1 data: 0 I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85 I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506 I1109 ...]     [Backward] Layer fc1, bottom blob conv6 diff: 0.00107067 I1109 ...]     [Backward] Layer fc1, param blob 0 diff: 0.483772 I1109 ...]     [Backward] Layer fc1, param blob 1 diff: 4079.72 . . . I1109 ...]     [Backward] Layer conv2, bottom blob conv1 diff: 5.99449e-06 I1109 ...]     [Backward] Layer conv2, param blob 0 diff: 0.00661093 I1109 ...]     [Backward] Layer conv2, param blob 1 diff: 0.10995 I1109 ...]     [Backward] Layer relu1, bottom blob conv1 diff: 2.87345e-06 I1109 ...]     [Backward] Layer conv1, param blob 0 diff: 0.0220984 I1109 ...]     [Backward] Layer conv1, param blob 1 diff: 0.0429201 E1109 ...]     [Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07) 

What does it mean?

回答1:

At first glance you can see this log section divided into two: [Forward] and [Backward]. Recall that neural network training is done via forward-backward propagation:
A training example (batch) is fed to the net and a forward pass outputs the current prediction.
Based on this prediction a loss is computed. The loss is then derived, and a gradient is estimated and propagated backward using the chain rule.

Caffe Blob data structure
Just a quick re-cap. Caffe uses Blob data structure to store data/weights/parameters etc. For this discussion it is important to note that Blob has two "parts": data and diff. The values of the Blob are stored in the data part. The diff part is used to store element-wise gradients for the backpropagation step.

Forward pass

You will see all the layers from bottom to top listed in this part of the log. For each layer you'll see:

I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037 I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114 I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0 

Layer "conv1" is a convolution layer that has 2 param blobs: the filters and the bias. Consequently, the log has three lines. The filter blob (param blob 0) has data

 I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114 

That is the current L2 norm of the convolution filter weights is 0.00899.
The current bias (param blob 1):

 I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0 

meaning that currently the bias is set to 0.

Last but not least, "conv1" layer has an output, "top" named "conv1" (how original...). The L2 norm of the output is

 I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037 

Note that all L2 values for the [Forward] pass are reported on the data part of the Blobs in question.

Loss and gradient
At the end of the [Forward] pass comes the loss layer:

I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85 I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506 

In this example the batch loss is 2031.85, the gradient of the loss w.r.t. fc1 is computed and passed to diff part of fc1 Blob. The L2 magnitude of the gradient is 0.1245.

Backward pass
All the rest of the layers are listed in this part top to bottom. You can see that the L2 magnitudes reported now are of the diff part of the Blobs (params and layers' inputs).

Finally
The last log line of this iteration:

[Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07) 

reports the total L1 and L2 magnitudes of both data and gradients.

What should I look for?

  1. If you have nans in your loss, see at what point your data or diff turns into nan: at which layer? at which iteration?

  2. Look at the gradient magnitude, they should be reasonable. IF you are starting to see values with e+8 your data/gradients are starting to blow up. Decrease your learning rate!

  3. See that the diffs are not zero. Zero diffs mean no gradients = no updates = no learning. If you started from random weights, consider generating random weights with higher variance.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!