问题
Consider the following tensorflow code snippet:
import time
import numpy as np
import tensorflow as tf
def fn(i):
# do some junk work
for _ in range(100):
i ** 2
return i
n = 1000
n_jobs = 8
stuff = np.arange(1, n + 1)
eager = False
t0 = time.time()
if eager:
tf.enable_eager_execution()
res = tf.map_fn(fn, stuff, parallel_iterations=n_jobs)
if not eager:
with tf.Session() as sess:
res = sess.run(res)
print(sum(res))
else:
print(sum(res))
dt = time.time() - t0
print("(eager=%s) Took %ims" % (eager, dt * 1000))
If run with eager = True
it is 10x slower than when run with eager = False
. I did some prints, and found out that in eager = True
mode, the map_fn
call is running sequential, instead of spawning 8 parallel threads.
Question
So my question is how to use map_fn
(with parallel_iterations > 1) in eager execution mode ?
回答1:
More than an answer to OP's question, this is an extension of it, showing why the other answers are not addressing the real problem, because tf.function
is not enough to force parallelism.
First, using tf.function
does not force parallelization. It force tracing, and the construction of a graph, this happens just once, so, the time.sleep()
used in other answers runs only the first time the tracing is necessary, that's why you see a speed up with tf.function
. But you still don't see a difference when changing parallel_iterations
.
Let's use a py_fuction
to see the difference:
def op(x):
time.sleep(1)
return 2 * x.numpy()
def op_tf(x):
print('Tracing')
return tf.py_function(op, [x], Tout=tf.int32)
Without using the decorator (or calling directly) tf.function
any call to op_tf
will always print "Tracing" (though in this cases is not tracing)
In [57]: op_tf(1)
Tracing
Out[57]: <tf.Tensor: shape=(), dtype=int32, numpy=2>
In [58]: op_tf(1)
Tracing
Out[58]: <tf.Tensor: shape=(), dtype=int32, numpy=2>
With tf.function
we see Tracing just once (if we use the same arguments):
In [67]: @tf.function
...: def op_tf(x):
...: print("Tracing")
...: return tf.py_function(op, [x], Tout=tf.int32)
...:
In [68]: op_tf(1)
Tracing
Out[68]: <tf.Tensor: shape=(), dtype=int32, numpy=2>
In [69]: op_tf(2)
Tracing
Out[69]: <tf.Tensor: shape=(), dtype=int32, numpy=4>
In [70]: op_tf(3)
Tracing
Out[70]: <tf.Tensor: shape=(), dtype=int32, numpy=6>
In [71]: op_tf(3)
Out[71]: <tf.Tensor: shape=(), dtype=int32, numpy=6>
This happens because the function has to build a new graph for every new argument, if we pass directly a signature we avoid that happening:
In [73]: @tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.int32)])
...: def op_tf(x):
...: print("Tracing")
...: return tf.py_function(op, [x], Tout=tf.int32)
...:
...:
In [74]: op_tf(1)
Tracing
Out[74]: <tf.Tensor: shape=(), dtype=int32, numpy=2>
In [75]: op_tf(2)
Out[75]: <tf.Tensor: shape=(), dtype=int32, numpy=4>
In [76]: op_tf(3)
Out[76]: <tf.Tensor: shape=(), dtype=int32, numpy=6>
The same happens if we first call the method get_concrete_function
:
In [79]: @tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.int32)])
...: def op_tf(x):
...: print("Tracing")
...: return tf.py_function(op, [x], Tout=tf.int32)
...:
...:
In [80]: op_tf = op_tf.get_concrete_function()
Tracing
In [81]: op_tf(1)
Out[81]: <tf.Tensor: shape=(), dtype=int32, numpy=2>
In [82]: op_tf(2)
Out[82]: <tf.Tensor: shape=(), dtype=int32, numpy=4>
Then, the answers claiming that just by adding tf.function
is enough to get parallel execution is not fully correct:
In [84]: def op(x):
...: print("sleep")
...: time.sleep(0.1)
...: return 1.
...:
In [85]: x = tf.ones(shape=(10,))
In [86]: _ = tf.map_fn(op, x, parallel_iterations=10)
sleep
sleep
sleep
sleep
sleep
sleep
sleep
sleep
sleep
sleep
In [87]: @tf.function
...: def my_map(*args, **kwargs):
...: return tf.map_fn(*args, **kwargs)
...:
In [88]: my_map(op, x, parallel_iterations=10)
sleep
Out[88]: <tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32)>
In comparison, if the python instructions for sleep and print are inside of a py_function they will be always called:
In [96]: x = tf.ones(shape=(10,), dtype=tf.int32)
In [97]: def op(x):
...: print("sleep")
...: time.sleep(0.1)
...: return 2 * x.numpy()
...:
In [98]: @tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.int32)])
...: def op_tf(x):
...: print("Tracing")
...: return tf.py_function(op, [x], Tout=tf.int32)
...:
In [99]: _ = my_map(op_tf, x, parallel_iterations=1)
Tracing
sleep
sleep
sleep
sleep
sleep
sleep
sleep
sleep
sleep
sleep
Now, that we have somehow clear that Tracing of a function is giving us some confusions, let's remove the prints to time the calls:
In [106]: def op(x):
...: time.sleep(0.1)
...: return 2 * x.numpy()
...:
In [107]: @tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.int32)])
...: def op_tf(x):
...: return tf.py_function(op, [x], Tout=tf.int32)
...:
In [108]: %timeit tf.map_fn(op_tf, x, parallel_iterations=1)
1.02 s ± 554 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [109]: %timeit tf.map_fn(op_tf, x, parallel_iterations=10)
1.03 s ± 509 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Running the following script and using tensorboard we can see exactly what's happening:
import tensorflow as tf
import time
from datetime import datetime
stamp = datetime.now().strftime("%Y%m%d-%H%M%S")
logdir = 'logs/func/%s' % stamp
# Start tracing.
options = tf.profiler.experimental.ProfilerOptions(
host_tracer_level=3, python_tracer_level=1, device_tracer_level=1, delay_ms=None
)
tf.profiler.experimental.start(logdir, options = options)
def op(x):
x = x.numpy()
start = time.time()
while time.time() < start + x / 100:
x = (2 * x) % 123
return x
@tf.function(input_signature=[tf.TensorSpec([], tf.int32)])
def op_tf(x):
return tf.py_function(op, [x], Tout=tf.int32, name='op')
@tf.function(input_signature=[tf.TensorSpec([None], tf.int32)])
def my_map(x):
return tf.map_fn(op_tf, x, parallel_iterations=16)
x = tf.ones(100, tf.int32)
print(my_map(x))
tf.profiler.experimental.stop()
We get the following in Tensorboard:
The py_function
is effectively using several threads but not in parallel. With parallel_iterations=1
we obtain something similar
If we add at the beginning of the script the following
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)
we are forcing TF to use a single thread for all the graph computations:
So, at this moment we can only get some form of parallel execution if we set the inter/intra threads right.
If we disable Eager execution completely:
import time
from datetime import datetime
import numpy as np
import tensorflow as tf
tf.compat.v1.disable_eager_execution()
tf.config.threading.set_inter_op_parallelism_threads(128)
tf.config.threading.set_intra_op_parallelism_threads(128)
stamp = datetime.now().strftime("%Y%m%d-%H%M%S")
logdir = f'logs/func/{stamp}'
tf.profiler.experimental.start(logdir)
def op(x):
x = x.numpy()
start = time.time()
while time.time() < start + x / 100:
x = (2 * x) % 123
return x
@tf.function(input_signature=[tf.TensorSpec([], tf.int32)])
def op_tf(x):
return tf.py_function(op, [x], Tout=tf.int32, name='op')
# Create a placeholder.
x = tf.compat.v1.placeholder(tf.int32, shape=[None])
with tf.compat.v1.Session() as sess:
writer = tf.summary.create_file_writer(logdir)
#tf.profiler.experimental.start(logdir, options = options)
tf.summary.trace_on(graph=True, profiler=True)
print(
sess.run(
[tf.map_fn(op_tf, x, parallel_iterations=16)],
feed_dict={
x: np.ones(4, dtype=np.int)
}
)
)
tf.profiler.experimental.stop()
we can see now parallel executions in Tensorboard:
And if we set to 1 the intra/inter threads and parallel_iterations we get the previous behaviuour back:
I hope this helps to clarify the role of tf.function
in checking full parallelism.
回答2:
Crudely speaking, tf.map_fn(fn, data)
is essentially shorthand for:
for e in data:
fn(e)
When eager execution is enabled, operations are executed as the Python interpreter encounters them, and thus there are no opportunities for "whole program optimizations".
When executing TensorFlow graphs, the TensorFlow runtime sees the complete computation to be executed and can thus apply optimizations such as "execute operations in fn
from multiple iterations in the loop in parallel". This is one of the benefits of expressing the computation as a graph.
When eager execution in TensorFlow is enabled, you can still selectively apply graph optimizations to portions of your program using tf.contrib.eager.defun.
For example (where most of the code is the same as yours above, and then a one line change to use tf.contrib.eager.defun
to get graph optimization benefits):
import time
import numpy as np
import tensorflow as tf
tf.enable_eager_execution()
def fn(i):
# do some junk work
for _ in range(100):
i ** 2
return i
n = 1000
n_jobs = 8
stuff = np.arange(1, n + 1)
def my_computation(x):
return tf.map_fn(fn, x, parallel_iterations=n_jobs)
t0 = time.time()
my_computation(stuff)
dt = time.time() - t0
print("my_computation took %ims" % (dt * 1000))
my_computation = tf.contrib.eager.defun(my_computation)
# On the very first call, a graph is constructed, so let's discount
# graph construction time
_ = my_computation(stuff)
# And then time it
t0 = time.time()
my_computation(stuff)
dt = time.time() - t0
print("my_computation took %ims" % (dt * 1000))
Some additional things of note:
In the particular example you've provided above, the TensorFlow runtime would probably also detect that
fn(i)
reduces toreturn i
and can optimize away the unnecessary loop ofrange(100)
since that does not affect the output. So the contrast in performance is quite a bit (as when executingfn(i)
eagerly, the Python interpreter has no way of knowing that thefor
loop in there is useless, so it will execute that).If you change the computation in
fn()
to be something more meaningful, say:def fn(i): for _ in range(2): i = i ** 2 return i
then you'll see a less stark difference.
Note that not everything that can be expressed in Python can be "defun"ed. See documentation for
tf.contrib.eager.defun
for some detail and a more detailed spec and implementation is proposed for TensorFlow 2.0 (see RFC)
Hope that helps.
回答3:
Update here for TF2.0 users. You can parallelize the call to tf.map_fn inner operator by wrapping it into a tf.function decorator:
import tensorflow as tf
import time
x = tf.ones(shape=(10,))
def op(x):
time.sleep(0.1)
return 1.
_ = tf.map_fn(op, x, parallel_iterations=10) # will take 1 sec along with the
# warning message.
# Now wrap tf.map_fn inside tf.function
@tf.function
def my_map(*args, **kwargs):
return tf.map_fn(*args, **kwargs)
_ = my_map(op, x, parallel_iterations=10) # will take 0.1 sec along with no
# warning message.
回答4:
Let me udpate this question's answer on TF2.1.
Since TF2.x, when the computing graph description statements could be executed in eager mode, some of the tf functions could be executed in parallel as its naturally run in session mode.
One simple solution is using tf.function to convert those eager mode running python function to tf_function running mode(session) without channging the whole programming mode( from eager mode to session mode).
@Rémy Dubois solustion works fine in TF2.1.
@tf.function
def my_map(*args, **kwargs):
return tf.map_fn(*args, **kwargs)
_ = my_map(op, x, parallel_iterations=10) # will take 0.1 sec along with no
# warning message.
And we also could dynamically change the my_map function by converting it with tf.function(), e.g.
def my_map(*args, **kwargs):
return tf.map_fn(*args, **kwargs)
my_map=tf.function(my_map)
_ = my_map(op, x, parallel_iterations=10) # will take 0.1 sec along with no
# warning message.
Both above 2 solutions could work correctly in TF2.1 now.
And the warning message in TF2.x about the tf.map_fn() for parallel_iterations is wrong as:
Setting parallel_iterations > 1 has no effect when executing eagerly. Consider calling map_fn with tf.contrib.eager.defun to execute fn in parallel.
since tf.contrib.eager.defun is already changed to tf.function.
来源:https://stackoverflow.com/questions/52774351/how-to-run-parallel-map-fn-when-eager-execution-enabled