Why is the per sample prediction time on Tensorflow (and Keras) lower when predicting on batches than on individual samples?

问题

I am using my trained model to make predictions (CPU only). I observe that both on Tensorflow and Keras with Tensorflow backend, the prediction time per sample is much lower when a batch of samples is used as compared to an individual sample. Moreover, the time per sample seems to go down with increasing batch size up to the limits imposed by memory. As an example, on pure Tensorflow, prediction of a single sample takes ~ 1.5 seconds , on 100 samples it is ~ 17 seconds (per sample time ~ 0.17s) on 1000 samples it is ~ 93 seconds (the per sample time ~ 0.093s).

Is this normal behavior? If so, is there an intuitive explanation for this? I guess it might have something to do with initializing the graph, but I need some clarification. Also, why does the per sample time go down as we increase the number of samples for prediction? In my use case, I have to predict on individual samples as and when they become available. So, obviously, I would be losing quite a bit in terms of speed if this is the way things work.

Thanks in advance for your help.

Edit: I am adding a minimal working example. I have one image input and 4 vector inputs to my model, which produces 4 outputs. I am initializing all inputs to 0 for speed test (I guess the actual values don't matter much for speed?). The initialization time and inference time are calculated separately. I find that the initialization time is a fraction of the inference time (~0.1s for 100 samples).

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import time
import numpy as np
import tensorflow as tf

t00=time.time()
graph = tf.Graph()
graph_def = tf.GraphDef()

with open("output_graph.pb", "rb") as f:
    graph_def.ParseFromString(f.read())
with graph.as_default():
    tf.import_graph_def(graph_def)

# One image and 4 auxiliary scalar inputs
img_input_layer ="input"
qp4_input_layer ="qp4"
qp3_input_layer ="qp3"
qp2_input_layer ="qp2"
qp1_input_layer ="qp1"
input_name = "import/" + img_input_layer
qp4_input_name = "import/" + qp4_input_layer
qp3_input_name = "import/" + qp3_input_layer
qp2_input_name = "import/" + qp2_input_layer
qp1_input_name = "import/" + qp1_input_layer



input_operation_img = graph.get_operation_by_name(input_name)
input_operation_qp4 = graph.get_operation_by_name(qp4_input_name)
input_operation_qp3 = graph.get_operation_by_name(qp3_input_name)
input_operation_qp2 = graph.get_operation_by_name(qp2_input_name)
input_operation_qp1 = graph.get_operation_by_name(qp1_input_name)

output_operation=[]

for i in range(4):
    output_operation.append(graph.get_operation_by_name("import/" + "output_"+str(i)).outputs)

#Initializing dummy inputs
n=100 # Number of samples for inference
img=np.zeros([n,64, 64,1])
qp4=np.zeros([n,1, 1,1])
qp3=np.zeros([n,2, 2,1])
qp2=np.zeros([n,4, 4,1])
qp1=np.zeros([n,8, 8,1])
t01=time.time()
print("Iniialization time",t01-t00)

t0=time.time()
with tf.Session(graph=graph) as sess:
    results = sess.run(output_operation,
                       {input_operation_img.outputs[0]: img, input_operation_qp4.outputs[0]: qp4, input_operation_qp3.outputs[0]: qp3,  input_operation_qp2.outputs[0]: qp2,  input_operation_qp1.outputs[0]: qp1})

    # print(results)
t1 = time.time()
print("Inference time", t1-t0)

回答1:

This depends very much on the model instrumentation, deployment method, and interface -- none of which you've provided, or even described. In my practice, the common reasons include:

Model initialization time: do you "wake up" the model in some way for each batch? If, as you've suggested, you reinitialize the model for each request, then I'm somewhat surprised that the overhead isn't a larger proportion of your time.
Interface overhead: how do the samples get to/from the model? Is this in an HTTP request, where you suffer communication costs per request, rather than per sample?
Simple model I/O time: If your model reads the entire batch at once, then the lag time to open and access the input channel could be the main factor in the lag.

You have some work to do to diagnose the root causes. You have only three data points so far; I recommend gathering a few more. What can you fit to a graph of the times? Are there any jumps that suggest a system limitation, such as input buffer size? Can you instrument your model with some profiling code to find out what lags are in your model, and which are in the system overhead?

Try deploying your model as a service; what happens to the times when it's already loaded into memory, initialized, and merely waiting for the next input? What lag time do you have in your request interface?

The results of these investigations will show you where you might gain from a design change in your usage model.

回答2:

Yes, its completely normal. This happens because as you do inference on GPU (or even multicore CPUs), increasing the batch size allows better use of the GPUs parallel computation resources, decreasing time per sample in the batch. If you use a small batch size, then you are wasting computational resources that are available in the GPU.

This paper describes the same effect, and one of the Figures contains a plot that shows inference time per image versus batch size. It shows the same effect as you are seeing.

来源：https://stackoverflow.com/questions/52765373/why-is-the-per-sample-prediction-time-on-tensorflow-and-keras-lower-when-predi

标签

python

performance

tensorflow

keras

runtime