I'm trying to understand the difference between using tf.Session and tf.train.MonitoredTrainingSession, and where I might prefer one over the other. It seems that when I use the latter, I can avoid many "chores" such as initializing variables, starting queue runners, or setting up file writers for summary operations. On the other hand, with a monitored training session, I cannot specify the computation graph I want to use explicitly. All of this seems rather mysterious to me. Is there some underlying philosophy behind how these classes were created that I'm not understanding?
问题:
回答1:
I can't give some insights on how these classes were created, but here are a few things which I think are relevants on how you could use them.
The tf.Session is a low level object in the python TensorFlow API while, as you said, the tf.train.MonitoredTrainingSession comes with a lot of handy features, especially useful in most of the common cases.
Before describing some of the benefits of tf.train.MonitoredTrainingSession, let me answer the question about the graph used by the session. You can specify the tf.Graph used by the MonitoredTrainingSession by using a context manager with your_graph.as_default():
from __future__ import print_function import tensorflow as tf def example(): g1 = tf.Graph() with g1.as_default(): # Define operations and tensors in `g`. c1 = tf.constant(42) assert c1.graph is g1 g2 = tf.Graph() with g2.as_default(): # Define operations and tensors in `g`. c2 = tf.constant(3.14) assert c2.graph is g2 # MonitoredTrainingSession example with g1.as_default(): with tf.train.MonitoredTrainingSession() as sess: print(c1.eval(session=sess)) # Next line raises # ValueError: Cannot use the given session to evaluate tensor: # the tensor's graph is different from the session's graph. try: print(c2.eval(session=sess)) except ValueError as e: print(e) # Session example with tf.Session(graph=g2) as sess: print(c2.eval(session=sess)) # Next line raises # ValueError: Cannot use the given session to evaluate tensor: # the tensor's graph is different from the session's graph. try: print(c1.eval(session=sess)) except ValueError as e: print(e) if __name__ == '__main__': example() So, as you said, the benefits of using MonitoredTrainingSession are that, this object takes care of
- initialising variables,
- starting queue runner as well as
- setting up the file writers,
but it has also the benefit of making your code easy to distribute as it also works differently depending if you specified the running process as a master or not.
For example you could run something like:
def run_my_model(train_op, session_args): with tf.train.MonitoredTrainingSession(**session_args) as sess: sess.run(train_op) that you would call in a non-distributed way:
run_my_model(train_op, {})` or in a distributed way (see the distributed doc for more information on the inputs):
run_my_model(train_op, {"master": server.target, "is_chief": (FLAGS.task_index == 0)}) On the other hand, the benefit of using the raw tf.Session object is that, you don't have the extra benefits of tf.train.MonitoredTrainingSession, which can be useful if you don't plan to use them or if you want to get more control (for example on how the queues are started).
EDIT (as per comment): For the op initialisation, you would have to do something like (cf. official doc:
# Define your graph and your ops init_op = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init_p) sess.run(your_graph_ops,...) For the QueueRunner, I would refer you to the official doc where you will find more complete examples.
EDIT2:
The main concept to understand to get a sense on how tf.train.MonitoredTrainingSession works is the _WrappedSession class:
This wrapper is used as a base class for various session wrappers that provide additional functionality such as monitoring, coordination, and recovery.
The tf.train.MonitoredTrainingSession works (as of version 1.1) this way:
- It first checks if it is a chief or a worker (cf. the distributed doc for lexical question).
- It begins the hooks which have been provided (for example,
StopAtStepHookwould just retrieve theglobal_steptensor at this stage. - It creates a session which is a
Chief(orWorkersession) wrapped into a_HookedSessionwrapped into a_CoordinatedSessionwrapped into a_RecoverableSession.
TheChief/Workersessions are in charge of running initialising ops provided by theScaffold.scaffold: A `Scaffold` used for gathering or building supportive ops. If not specified a default one is created. It's used to finalize the graph. - The
chiefsession also takes care of all the checkpoint parts: e.g. restoring from checkpoints using theSaverfrom theScaffold. - The
_HookedSessionis basically there to decorate therunmethod: it calls the_call_hook_before_runandafter_runmethods when relevant. - At creation the
_CoordinatedSessionbuilds aCoordinatorwhich starts the queue runners and will be responsible of closing them. - The
_RecoverableSessionwill insures that there is retry in case oftf.errors.AbortedError.
In conclusion, the tf.train.MonitoredTrainingSession avoids a lot of boiler plate code while being easily extendable with the hooks mechanism.