TensorFlow 2-gpu slower then single gpu

前端 未结 2 1764
我在风中等你
我在风中等你 2020-12-11 07:57

I have two gpu (TitanX (Pascal) and GTX 1080). I am trying to run single-thread graph computation. The graph is two separate matrix multiplication chains (each assigned to c

相关标签:
2条回答
  • 2020-12-11 08:12

    Isn't it because you need to transfer data between the GPUs when you compute C ? Can you try putting C on cpu ?

    with tf.device('/cpu:0'):
      C = tf.matmul(B1, B2)
    
    0 讨论(0)
  • 2020-12-11 08:15

    There's significant delay when launching kernel for the first time on a GPU, possibly caused by PTXAS compilation. This delay can be on the order of seconds and accumulates when you use more than 1 GPUs, so in your case the run is slower because time is dominated by an extra "initial kernel launch". One way to benchmark pure computation time is to to "pre-warming" by executing each cuda operation at least once on each GPU. I've observed the same slowness by running your benchmark on 2 TitanX cards, but this delay disappeared when I "pre-warmed" the kernels.

    Here's before pre-warming:

    Here's after pre-warming: Below is your code modified to do pre-warming, and also to remove any TensorFlow<->Python transfers.

    import tensorflow as tf
    
    from tensorflow.python.ops import init_ops
    from tensorflow.python.client import timeline
    import logging, time
    import numpy as np
    
    def test():
        n = 5000
    
        with tf.device('/gpu:0'):
            A1 = tf.Variable(tf.ones_initializer(shape=[n, n]), name='A1')
            B1 = A1
            for l in xrange(10):
                B1 = tf.matmul(A1, B1, name="chain1")
    
        with tf.device('/gpu:1'):
            A2 = tf.Variable(tf.ones_initializer(shape=[n, n]), name='A2')
            B2 = A2
            for l in xrange(10):
                B2 = tf.matmul(A2, B2, name="chain2")
            C = tf.matmul(B1, B2)
    
        run_metadata = tf.RunMetadata()
        start = time.time()
        logging.info('started')
        sess = tf.InteractiveSession(config=tf.ConfigProto(allow_soft_placement=False, log_device_placement=True))
        sess.run(tf.initialize_all_variables())
        # do warm-run
        sess.run([C.op],
                 options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
                 run_metadata=run_metadata)
        run_metadata = tf.RunMetadata()
        sess.run([C.op],
                 options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
                 run_metadata=run_metadata)
        logging.info('writing trace')
        trace = timeline.Timeline(step_stats=run_metadata.step_stats)
        trace_file = open('timeline.ctf.json', 'w')
        trace_file.write(trace.generate_chrome_trace_format(show_memory=True))
        logging.info('trace written')
        end = time.time()
        logging.info('computed')
        logging.info(end - start)
    
    
    if __name__ == "__main__":
        logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
        test()
    
    0 讨论(0)
提交回复
热议问题