Tensorflow Checkpoint not giving issues while used on another system. Python2/3

问题

I am working with the repository of Stanford related to Pointer Generated network using Tensorflow. The repository is available here:
https://github.com/abisee/pointer-generator

I have asked for a demo checkpoint in the issue section of this repository and one of the person name Joy has replied me with the check point of his experiment. You can see it here:
https://github.com/abisee/pointer-generator/issues/12#issuecomment-320558080

Now when I am running the checkpoint with the code, I am getting the following exceptions or errors:

INFO:tensorflow:Loading checkpoint log/directory/myexperiment/train/model.ckpt-44550
INFO:tensorflow:Restoring parameters from log/directory/myexperiment/train/model.ckpt-44550
2017-08-08 11:40:45.466505: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/decoder/attention_decoder/lstm_cell/bias not found in checkpoint
2017-08-08 11:40:45.468174: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/decoder/attention_decoder/lstm_cell/kernel not found in checkpoint
2017-08-08 11:40:45.475224: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/encoder/bidirectional_rnn/bw/lstm_cell/kernel not found in checkpoint
2017-08-08 11:40:45.475255: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/encoder/bidirectional_rnn/bw/lstm_cell/bias not found in checkpoint
2017-08-08 11:40:45.488580: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/encoder/bidirectional_rnn/fw/lstm_cell/bias not found in checkpoint
2017-08-08 11:40:45.489057: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/encoder/bidirectional_rnn/fw/lstm_cell/kernel not found in checkpoint
INFO:tensorflow:Failed to load checkpoint from log/directory/myexperiment/train. Sleeping for 10 secs...
INFO:tensorflow:Loading checkpoint log/directory/myexperiment/train/model.ckpt-44550
INFO:tensorflow:Restoring parameters from log/directory/myexperiment/train/model.ckpt-44550
2017-08-08 11:40:55.630779: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/decoder/attention_decoder/lstm_cell/kernel not found in checkpoint
2017-08-08 11:40:55.631279: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/decoder/attention_decoder/lstm_cell/bias not found in checkpoint
2017-08-08 11:40:55.645013: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/encoder/bidirectional_rnn/bw/lstm_cell/bias not found in checkpoint
2017-08-08 11:40:55.651307: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/encoder/bidirectional_rnn/fw/lstm_cell/bias not found in checkpoint
2017-08-08 11:40:55.654461: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/encoder/bidirectional_rnn/bw/lstm_cell/kernel not found in checkpoint
2017-08-08 11:40:55.661814: W tensorflow/core/framework/op_kernel.cc:1158] Not found: Key seq2seq/encoder/bidirectional_rnn/fw/lstm_cell/kernel not found in checkpoint

So I thought of checking whether the cause of issue is the GPU or CPU one so I used the solution from the answers of the question:
1) Tensorflow: Model trained(checkpoint files) on GPU can be converted to CPU running model?
2) Can a model trained on gpu used on cpu for inference and vice versa?

I tried to make the changes in the file run_summarization.py as:

def setup_training(model, batcher):
  """Does setup before starting training (run_training)"""
  train_dir = os.path.join(FLAGS.log_root, "train")
  if not os.path.exists(train_dir): os.makedirs(train_dir)

  default_device = tf.device('/gpu:0') #tf.device('/cpu:0')
  with default_device:
    model.build_graph() # build the graph
    if FLAGS.convert_to_coverage_model:
      assert FLAGS.coverage, "To convert your non-coverage model to a coverage model, run with convert_to_coverage_model=True and coverage=True"
      convert_to_coverage_model()
    saver = tf.train.Saver(max_to_keep=1) # only keep 1 checkpoint at a time

But it didn't work. Kindly, let me know what I need to do to make the tensorflow checkpoint run on any system with the available code. If possible no change in the code.

回答1:

I was able to load the checkpoint into a session using the following code, hope it may help. All the variables and corresponding trained values are restored.

import tensorflow as tf

config = tf.ConfigProto(allow_soft_placement=True)
with tf.Session(config=config) as sess:
    new_saver = tf.train.import_meta_graph(
        '/path/to/model.ckpt-44550.meta')
    new_saver.restore(sess, '/path/to/model.ckpt-44550')
    g = tf.get_default_graph()
print('model_loaded')

回答2:

You seem to be running the eval mode. That needs to be run in parallel to the train mode. INFO:tensorflow:Restoring parameters from log/directory/myexperiment/train/model.ckpt-44550

This line means that your check point was loaded and the saver restored the session. You can look at the load_ckpt function in the utils.py file for this. Are you running the same model for both train and eval? Using cpu or gpu doesn't affect it in this case. If nothing is mentioned in the pointer-generator FLAG then it runs the baseline seq2seq model

来源：https://stackoverflow.com/questions/45560776/tensorflow-checkpoint-not-giving-issues-while-used-on-another-system-python2-3

标签

python

python-2.7

python-3.x

tensorflow

stanford-nlp