TensorFlow Execution on a single (multi-core) CPU Device

问题

I have some questions regarding the execution model of TensorFlow in the specific case in which there is only a CPU device and the network is used only for inference, for instance using the Image Recognition(https://www.tensorflow.org/tutorials/image_recognition) C++ Example with a multi-core platform.

In the following, I will try to summarize what I understood, while asking some questions.

Session->Run() (file direct_session.cc) calls ExecutorState::RynAsynch, which initializes the TensorFlow ready queue with the roots nodes.

Then, the instruction

runner_([=]() { Process(tagged_node, scheduled_usec); }); (executor.cc, function ScheduleReady, line 2088)

assigns the node (and hence the related operation) to a thread of the inter_op pool. However, I do not fully understand how it works. For instance, in the case in which ScheduleReady is trying to assign more operations than the size of the inter_op pool, how operations are enqueued?(FIFO Order?) Each thread of a pool has a queue of operation or there is a single shared queue? Where can I found this in the code? Where can I found the body of each thread of the pools?

Another question regards the nodes managed by inline_ready. How the execution of these (inexpensive or dead) nodes, differs from the one of the other nodes?

Then, (still, to my understanding) the execution flow continues from ExecutorState::Process, which executes the operation, distinguishing between synchronous and asynchronous operations. How synchronous and asynchronous operations differs in terms of execution?

When the operation is executed, then PropagateOutputs (which calls ActivateNodes) adds to the ready queue the node of every successor which is become ready thanks to the execution of the current node(predecessor).

Finally, NodeDone() calls ScheduleReady() which process the nodes currently in the TensorFlow ready queue.

Conversely, how the intra_op thread pool is managed depends on the specific kernel, right? It is possible that a kernel requests more operations than the intra_op thread pool size? If yes, with which kind of ordering they are enqueued? (FIFO?)

Once operations are assigned to threads of the pool, then their scheduling is left to the underlying operating system or TensorFlow enforces some kind of scheduling policy?

I'm asking here because I didn't find almost anything about this part of the execution model in the documentation, if I missed some documents please point me to all of them.

回答1:

Re ThreadPool: When Tensorflow uses DirectSession (as it does in your case), it uses Eigen's ThreadPool. I could not get a web link to the official version of Eigen used in TensorFlow, but here is a link to the thread pool code. This thread pool is using this queue implementation RunQueue. There is one queue per thread.

Re inline_ready: Executor:Process is scheduled in some Eigen Thread. When it runs it executes some nodes. As these nodes are done, they make other nodes (tensorflow operations) ready. Some of these nodes are not expensive. They are added to inline_ready and executed in the same thread, without yielding. Other nodes are expensive and are not executed "immediately" in the same thread. Their execution is scheduled through the Eigen thread pool.

Re sync/async kernels: Tensorflow operations can be backed by synchronous (most CPU kernels) or asynchronous kernels (most GPU kernels). Synchronous kernels are executed in the thread running Process. Asynchronous kernels are dispatched to their device (usually GPU) to be executed. When asynchronous kernels are done, they invoke NodeDone method.

Re Intra Op ThreadPool: The intra op thread pool is made available to kernels to run their computation in parallel. Most CPU kernels don't use it (and GPU kernels just dispatch to GPU) and run synchronously in the thread that called the Compute method. Depending on configuration there is either one intra op thread pool shared by all devices (CPUs), or each device has its own. Kernels simply schedule their work on this thread pool. Here is an example of one such kernel. If there are more tasks than threads, they are scheduled and executed in unspecified order. Here is the ThreadPool interface exposed to kernels.

I don't know of any way tensorflow influences the scheduling of OS threads. You can ask it to do some spinning (i.e. not immediately yield the thread to OS) to minimize latency (from OS scheduling), but that is about it.

These internal details are not documented on purpose as they are subject to change. If you are using tensorflow through Python API, all you should need to know that your ops will execute when their inputs are ready. If you want to enforce some order beyond this, you should use:

with tf.control_dependencies(<tensors_that_you_want_computed_before_the_ops_inside_this_block>):
  tf.foo_bar(...)

If you are writing a custom CPU kernel and want to do parallelism inside it (usually needed rarely for very expensive kernels), the thread pool interface linked above is what you can rely on.

来源：https://stackoverflow.com/questions/47416445/tensorflow-execution-on-a-single-multi-core-cpu-device

标签

tensorflow

threadpool