XOR neural network error stops decreasing during training

我是研究僧i 提交于 2019-11-30 07:09:05

Yes, neural networks can get stuck in local minima, depending on the error surface. However this abstract suggests that there are no local minima in the error surface of the XOR problem. However I cannot get to the full text, so I cannot verify what the authors did to proove this and how it applies to your problem.

There also might be other factors leading to this problem. For example if you descend very fast at some steep valley, if you just use a first order gradient descent, you might get to the opposite slope and bounce back and forth all the time. You could try also giving the average change over all weights at each iteration, to test if you realy have a "stuck" network, or rather one, which just has run into a limit cycle.

You should first try fiddling with your parameters (learning rate, momentum if you implemented it etc). If you can make the problem go away, by changing parameters, your algorithm is probably ok.

Poor gradient descent with excessively large steps as described by LiKao is one possible problem. Another is that there are very flat regions of the XOR error landscape which means that it takes a very long time to converge, and in fact the gradient may be so weak that descent algorithm doesn't pull you in the right direction.

These two papers look at 2-1-1 and 2-2-1 XOR landscapes. One uses a "cross entropy" error function which I don't know. In the first they declare there are no local minima but in the second they say there are local minima at infinity - basically when weights run off to very large values. So for the second case, their results suggest if you don't start off near "enough" true minima you may get trapped at the infinite points. They also say that other analyses of 2-2-1 XOR networks that show no local minima are not contradicted by their results because of particular definitions.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4770

http://www.ncbi.nlm.nih.gov/pubmed/12662806

I encountered the same issue and found that using the activation function 1.7159*tanh(2/3*x) described in LeCun's "Efficient Backprop" paper helps. This is presumably because that function does not saturate around the target values {-1, 1}, whereas regular tanh does.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!