Intermediate layer makes tensorflow optimizer to stop working

佐手、 提交于 2019-11-28 08:43:21

TL;DR: the deeper the neural network becomes, the more you should pay attention to the gradient flow (see this discussion of "vanishing gradients"). One particular case is variables initialization.


Problem analysis

I've added tensorboard summaries for the variables and gradients into both of your scripts and got the following:

2-layer network

3-layer network

The charts show the distributions of W:0 variable (the first layer) and how they are changed from 0 epoch to 1000 (clickable). Indeed, we can see, the rate of change is much higher in a 2-layer network. But I'd like to pay attention to the gradient distribution, which is much closer to 0 in a 3-layer network (first variance is around 0.005, the second one is around 0.000002, i.e. 1000 times smaller). This is the vanishing gradient problem.

Here's the helper code if you're interested:

for g, v in grads_and_vars:
  tf.summary.histogram(v.name, v)
  tf.summary.histogram(v.name + '_grad', g)

merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph())

...

_, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1})
if i % 10 == 0:
  writer.add_summary(summary, global_step=i)

Solution

All deep networks suffer from this to some extent and there is no universal solution that will auto-magically fix any network. But there are some techniques that can push it in the right direction. Initialization is one of them.

I replaced your normal initialization with:

W_init = tf.contrib.layers.xavier_initializer()
b_init = tf.constant_initializer(0.1)

There are lots of tutorials on Xavier init, you can take a look at this one, for example. Note that I set the bias init to be slightly positive to make sure that ReLu outputs are positive for the most of neurons, at least in the beginning.

This changed the picture immediately:

The weights are still not moving quite as fast as before, but they are moving (note the scale of W:0 values) and the gradients distribution became much less peaked at 0, thus much better.

Of course, it's not the end. To improve it further, you should implement the full autoencoder, because currently the loss is affected by the [0,0] element reconstruction, so most outputs aren't used in optimization. You can also play with different optimizers (Adam would be my choice) and the learning rates.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!