Delayed echo of sin - cannot reproduce Tensorflow result in Keras

问题

I am experimenting with LSTMs in Keras with little to no luck. At some moment I decided to scale back to the most basic problems in order finally achieve some positive result.
However, even with simplest problems I find that Keras is unable to converge while the implementation of the same problem in Tensorflow gives stable result.

I am unwilling to just switch to Tensorflow without understanding why Keras keeps diverging on any problem I attempt.

My problem is a many-to-many sequence prediction of delayed sin echo, example below:
Blue line is a network input sequence, red dotted line is an expected output.
The experiment was inspired by this repo and workable Tensorflow solution was also created from it too. The relevant excerpts from the my code are below, and full version of my minimal reproducible example is available here.

Keras model:

model = Sequential()
model.add(LSTM(n_hidden,
               input_shape=(n_steps, n_input),
               return_sequences=True))
model.add(TimeDistributed(Dense(n_input, activation='linear')))
model.compile(loss=custom_loss,
              optimizer=keras.optimizers.Adam(lr=learning_rate),
              metrics=[])

Tensorflow model:

x = tf.placeholder(tf.float32, [None, n_steps, n_input])
y = tf.placeholder(tf.float32, [None, n_steps])

weights = {
    'out': tf.Variable(tf.random_normal([n_hidden, n_steps], seed = SEED))
}
biases = {
    'out': tf.Variable(tf.random_normal([n_steps], seed = SEED))
}
lstm = rnn.LSTMCell(n_hidden, forget_bias=1.0)
outputs, states = tf.nn.dynamic_rnn(lstm, inputs=x,
                                    dtype=tf.float32,
                                    time_major=False)

h = tf.transpose(outputs, [1, 0, 2])
pred = tf.nn.bias_add(tf.matmul(h[-1], weights['out']), biases['out'])
individual_losses = tf.reduce_sum(tf.squared_difference(pred, y),
                                  reduction_indices=1)
loss = tf.reduce_mean(individual_losses)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) \
  .minimize(loss)

I claim that other parts of code (data_generation, training) are completely identical. But learning progress with Keras stalls early and yields unsatisfactory predictions. Graphs of logloss for both libraries and example predictions are attached below:

Logloss for Tensorflow-trained model:

Logloss for Keras-trained model: It's not easy to read from graph, but Tensorflow reaches target_loss=0.15 and stops early after about 10k batches. But Keras uses up all 13k batches reaching loss about only 1.5. In a separate experiment where Keras was running for 100k batches it went no further stalling around 1.0.

Figures below contain: black line - model input signal, green dotted line - ground truth output, red line - acquired model output.

Predictions of Tensorflow-trained model:
Predictions of Keras-trained model: Thank you for suggestions and insights, dear colleagues!

回答1:

Ok, I have managed to solve this. Keras implementation now converges steadily to a sensible solution too:

The models were in fact not identical. You may inspect with extra caution the Tensorflow model version from the question and verify for yourself that actual Keras equivalent is listed below, and isn't what stated in the question:

model = Sequential()
model.add(LSTM(n_hidden,
               input_shape=(n_steps, n_input),
               return_sequences=False))
model.add(Dense(n_steps, input_shape=(n_hidden,), activation='linear'))
model.compile(loss=custom_loss,
              optimizer=keras.optimizers.Adam(lr=learning_rate),
              metrics=[])

I will elaborate. Workable solution here uses that last column of size n_hidden spat out by LSTM as an intermediate activation then fed to the Dense layer.
So, in a way, the actual prediction here is made by the regular perceptron.

One extra take away note - source of mistake in the original Keras solution is already evident from the inference examples attached to question. We see there that earlier timestamps fail utterly, while later timestamps are near perfect. These earlier timestamps correspond to the states of LSTM when it were just initialized on new window and clueless of context.

来源：https://stackoverflow.com/questions/46937898/delayed-echo-of-sin-cannot-reproduce-tensorflow-result-in-keras

标签

tensorflow

keras

lstm

rnn