How to control when to compute evaluation vs training using the Estimator API of tensorflow?

匿名 (未验证) 提交于 2019-12-03 07:50:05

问题:

As stated in this question:

The tensorflow documentation does not provide any example of how to perform a periodic evaluation of the model on an evaluation set

The accepted answer suggested the use of Experiment (which is deprecated according to this README).

All I found on online points towards using the train_and_evaluate method. However, I still do not see how to switch between the two processes (train and evaluate). I have tried the following:

estimator = tf.estimator.Estimator(     model_fn=model_fn,     params=hparams,     model_dir=model_dir,     config = tf.estimator.RunConfig(         save_checkpoints_steps = 2000,         save_summary_steps = 100,         keep_checkpoint_max=5     ) )  train_input_fn = lambda: input_fn(     train_file, #a .tfrecords file     train=True,     batch_size=70,     num_epochs=100 )  eval_input_fn = lambda: input_fn(     val_file, # another .tfrecords file     train=False,     batch_size=70,     num_epochs=1 ) train_spec = tf.estimator.TrainSpec(     train_input_fn,     max_steps=125 )      eval_spec = tf.estimator.EvalSpec(     eval_input_fn,     steps=30,     name='validation',     start_delay_secs=150,     throttle_secs=200 )  tf.logging.info("start experiment...") tf.estimator.train_and_evaluate(     estimator,     train_spec,     eval_spec ) 

Here is what I think my code should be doing:

Train the model for 100 epochs using a batch size of 70; save checkpoints every 2000 batches; save summaries every 100 batches; keep at most 5 checkpoints; after 150 batches on the training set, compute the validation error using 30 batches of validation data

However, I get the following logs:

INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt. INFO:tensorflow:loss = 39.55082, step = 1 INFO:tensorflow:global_step/sec: 178.622 INFO:tensorflow:loss = 1.0455043, step = 101 (0.560 sec) INFO:tensorflow:Saving checkpoints for 150 into /output/model.ckpt. INFO:tensorflow:Loss for final step: 0.8327793. INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Starting evaluation at 2018-04-02-22:49:15 INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from /projects/MNIST-GCP/output/model.ckpt-150 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Evaluation [3/30] INFO:tensorflow:Evaluation [6/30] INFO:tensorflow:Evaluation [9/30] INFO:tensorflow:Evaluation [12/30] INFO:tensorflow:Evaluation [15/30] INFO:tensorflow:Evaluation [18/30] INFO:tensorflow:Evaluation [21/30] INFO:tensorflow:Evaluation [24/30] INFO:tensorflow:Evaluation [27/30] INFO:tensorflow:Evaluation [30/30] INFO:tensorflow:Finished evaluation at 2018-04-02-22:49:15 INFO:tensorflow:Saving dict for global step 150: accuracy = 0.8552381, global_step =150, loss = 0.95031387 

From the logs, it seems that the training stops after the first evaluation step. What am I missing from the documentation? Could you explain me how I should have implemented what I think my code is doing?

I sincerely appreciate your help!

EDIT: after running experiments I realize that max_steps controls the number of steps of the whole training procedure, not just the amount of steps before computing the metrics on the test set. Reading tf.estimator.Estimator.train, I see it has a steps argument, which works incrementally and is bounded by max_steps; however, tf.estimator.TrainSpec does not have the steps argument, which means I cannot control the number of steps to take before computing metrics on the validation set.

回答1:

In fact each 200 secs or when your training finished, the estimator will switch from the training phase to the evaluation one.

However, we can see in your code that your are able to achieve the 125 steps before the evaluation started, it means that your training finished. The max_steps is the number of time your training will be repeated before stopping, there are any link with the number of epochs (cause it is not using in tf.estimator.train_and_evaluate). And during your training your evaluation metrics will occure each throttle_secs (=200 here).

About the metrics you can add these inside your model with :

predict = tf.nn.softmax(logits, name="softmax_tensor") classes = tf.cast(tf.argmax(predict, 1), tf.uint8)  def conv_model_eval_metrics(classes, labels, mode):     if mode == tf.estimator.ModeKeys.TRAIN or mode == tf.estimator.ModeKeys.EVAL:         return {             'accuracy': tf.metrics.accuracy(classes, labels),             'precision': tf.metrics.precision(classes, labels),             'recall': tf.metrics.recall(classes, labels),         }     else:         return None  eval_metrics = conv_model_eval_metrics(classes, labels, mode) with tf.variable_scope("performance_metrics"):     #Accuracy is the most intuitive performance measure and it is simply a         #ratio of correctly predicted observation to the total observations.     tf.summary.scalar('accuracy', eval_metrics['accuracy'][1])      #How many selected items are relevant     #Precision is the ratio of correctly predicted positive observations to         #the total predicted positive observations.     tf.summary.scalar('precision', eval_metrics['precision'][1])      #How many relevant items are selected     #Recall is the ratio of correctly predicted positive observations to         #the all observations in actual class     tf.summary.scalar('recall', eval_metrics['recall'][1]) 

It is working pretty well to follow on tensorboard the precision, recall and accuracy during your training and evaluation.

PS : Sorry, it is my first answer, that's why it is quite disgusting to read it ^^



回答2:

One can control the repetitions by the tf.data.Dataset.repeat(num_epochs) one sets in the input_fn(). The training function will run until the number of epochs is consumed, then the evaluation function will run, then the training function will run again until the number of epochs, and so on; finally, the train_and_eval method will stop when the max_steps define in TrainSpec is reached.

This is a conclusion I draw from a few experiments, corrections are welcome.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!