How does the back-propagation algorithm deal with non-differentiable activation functions?

∥☆過路亽.° 提交于 2019-11-28 18:46:10

To understand how backpropagation is even possible with functions like ReLU you need to understand what is the most important property of derivative that makes backpropagation algorithm works so well. This property is that :

f(x) ~ f(x0) + f'(x0)(x - x0)

If you treat x0 as actual value of your parameter at the moment - you can tell (knowing value of a cost function and it's derivative) how the cost function will behave when you change your parameters a little bit. This is most crucial thing in backpropagation.

Because of the fact that computing cost function is crucial for a cost computation - you will need your cost function to satisfy the property stated above. It's easy to check that ReLU satisfy this property everywhere except a small neighbourhood of 0. And this is the only problem with ReLU - the fact that we cannot use this property when we are close to 0.

To overcome that you may choose the value of ReLU derivative in 0 to either 1 or 0. On the other hand most of researchers don't treat this problem as serious simply because of the fact, that being close to 0 during ReLU computations is relatively rare.

From the above - of course - from the pure mathematical point of view it's not plausible to use ReLU with backpropagation algorithm. On the other hand - in practice it usually doesn't make any difference that it has this weird behaviour around 0.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!