As stated in this question:
The tensorflow documentation does not provide any example of how to perform a periodic evaluation of the model on an evaluation set
The accepted answer suggested the use of Experiment (which is deprecated according to this README).
All I found on online points towards using the train_and_evaluate method. However, I still do not see how to switch between the two processes (train and evaluate). I have tried the following:
estimator = tf.estimator.Estimator( model_fn=model_fn, params=hparams, model_dir=model_dir, config = tf.estimator.RunConfig( save_checkpoints_steps = 2000, save_summary_steps = 100, keep_checkpoint_max=5 ) ) train_input_fn = lambda: input_fn( train_file, #a .tfrecords file train=True, batch_size=70, num_epochs=100 ) eval_input_fn = lambda: input_fn( val_file, # another .tfrecords file train=False, batch_size=70, num_epochs=1 ) train_spec = tf.estimator.TrainSpec( train_input_fn, max_steps=125 ) eval_spec = tf.estimator.EvalSpec( eval_input_fn, steps=30, name='validation', start_delay_secs=150, throttle_secs=200 ) tf.logging.info("start experiment...") tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec )
Here is what I think my code should be doing:
Train the model for 100 epochs using a batch size of 70; save checkpoints every 2000 batches; save summaries every 100 batches; keep at most 5 checkpoints; after 150 batches on the training set, compute the validation error using 30 batches of validation data
However, I get the following logs:
INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt. INFO:tensorflow:loss = 39.55082, step = 1 INFO:tensorflow:global_step/sec: 178.622 INFO:tensorflow:loss = 1.0455043, step = 101 (0.560 sec) INFO:tensorflow:Saving checkpoints for 150 into /output/model.ckpt. INFO:tensorflow:Loss for final step: 0.8327793. INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Starting evaluation at 2018-04-02-22:49:15 INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from /projects/MNIST-GCP/output/model.ckpt-150 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Evaluation [3/30] INFO:tensorflow:Evaluation [6/30] INFO:tensorflow:Evaluation [9/30] INFO:tensorflow:Evaluation [12/30] INFO:tensorflow:Evaluation [15/30] INFO:tensorflow:Evaluation [18/30] INFO:tensorflow:Evaluation [21/30] INFO:tensorflow:Evaluation [24/30] INFO:tensorflow:Evaluation [27/30] INFO:tensorflow:Evaluation [30/30] INFO:tensorflow:Finished evaluation at 2018-04-02-22:49:15 INFO:tensorflow:Saving dict for global step 150: accuracy = 0.8552381, global_step =150, loss = 0.95031387
From the logs, it seems that the training stops after the first evaluation step. What am I missing from the documentation? Could you explain me how I should have implemented what I think my code is doing?
I sincerely appreciate your help!
EDIT: after running experiments I realize that max_steps controls the number of steps of the whole training procedure, not just the amount of steps before computing the metrics on the test set. Reading tf.estimator.Estimator.train, I see it has a steps argument, which works incrementally and is bounded by max_steps; however, tf.estimator.TrainSpec does not have the steps argument, which means I cannot control the number of steps to take before computing metrics on the validation set.