Benchmark of HowTo: Reading Data

北战南征 提交于 2019-12-17 11:09:07

问题


I'm using tensorflow 0.10 and I was benchmarking the examples found in the official HowTo on reading data. This HowTo illustrates different methods to move data to tensorflow, using the same MNIST example.

I was surprised by the results and I was wondering if anyone has enough low-level understanding to explain what is happening.

In the HowTo there are basically 3 methods to read in data:

  • Feeding: building the mini-batch in python and passing it with sess.run(..., feed_dict={x: mini_batch})
  • Reading from files: use tf operations to open the files and create mini-batches. (Bypass handling data in python.)
  • Preloaded data: load all the data in either a single tf variable or constant and use tf functions to break that up in mini-batches. The variable or constant is pinned to the cpu, not gpu.

The scripts I used to run my benchmarks are found within tensorflow:

  • Feeding: examples/tutorials/mnist/fully_connected_feed.py
  • Reading from files: examples/how_tos/reading_data/convert_to_records.py and examples/how_tos/reading_data/fully_connected_reader.py
  • Preloaded data (constant): examples/how_tos/reading_data/fully_connected_preloaded.py
  • Preloaded data (variable): examples/how_tos/reading_data/fully_connected_preloaded_var.py

I ran those scripts unmodified, except for the last two because they crash --for version 0.10 at least-- unless I add an extra sess.run(tf.initialize_local_variables()).

Main Question

The time to execute 100 mini-batches of 100 examples running on a GTX1060:

  • Feeding: ~0.001 s
  • Reading from files: ~0.010 s
  • Preloaded data (constant): ~0.010 s
  • Preloaded data (variable): ~0.010 s

Those results are quite surprising to me. I would have expected Feeding to be the slowest since it does almost everything in python, while the other methods use lower-level tensorflow/C++ to carry similar operations. It is the complete opposite of what I expected. Does anyone understand what is going on?

Secondary question

I have access to another machine which has a Titan X and older NVidia drivers. The relative results were roughly in line with the above, except for Preloaded data (constant) which was catastrophically slow, taking many seconds for a single mini-batch.

Is this some known issue that performance can vary greatly with hardware/drivers?


回答1:


Update Oct 9 the slowness comes because the computation runs too fast for Python to pre-empt the computation thread and to schedule the pre-fetching threads. Computation in main thread takes 2ms and apparently that's too little for the pre-fetching thread to grab the GIL. Pre-fetching thread has larger delay and hence can always be pre-empted by computation thread. So the computation thread runs through all of the examples, and then spends most of the time blocked on GIL as some prefetching thread gets scheduled and enqueues a single example. The solution is to increase number of Python threads, increase queue size to fit the entire dataset, start queue runners, and then pause main thread for a couple of seconds to give queue runners to pre-populate the queue.

Old stuff

That's surprisingly slow.

This looks some kind of special cases making the last 3 examples unnecessarily slow (most effort went into optimizing large models like ImageNet, so MNIST didn't get as much attention).

You can diagnose the problems by getting timelines, as described here

Here are 3 of those examples with timeline collection enabled.

Here's the timeline for feed_dict implementation

The important thing to notice is that matmul takes a good chunk of the time, so the reading overhead is not significant

Now here's the timeline for reader implementation

You can see that operation is bottlenecked on QueueDequeueMany which takes whopping 45ms.

If you zoom in, you'll see a bunch of tiny MEMCPY and Cast operations, which is a sign of some op being CPU only (parse_single_example), and the dequeue having to schedule multiple independent CPU->GPU transfers

For the var example below with GPU disabled, I don't see tiny little ops, but QueueDequeueMany still takes over 10ms. The timing seems to scale linearly with batch size, so there's some fundamental slowness there. Filed #4740




回答2:


Yaroslav nails the problem well. With small models you'll need to speed up the data import. One way to do this is with the Tensorflow function, tf.TFRecordReader.read_up_to, that reads multiple records in each session.run() call, thereby removing the excess overhead caused by multiple calls.

enqueue_many_size = SOME_ENQUEUE_MANY_SIZE
reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
_, queue_batch = reader.read_up_to(filename_queue, enqueue_many_size)
batch_serialized_example = tf.train.shuffle_batch(
    [queue_batch],
    batch_size=batch_size,
    num_threads=thread_number,
    capacity=capacity,
    min_after_dequeue=min_after_dequeue,
    enqueue_many=True)

This was also addressed in this SO question.




回答3:


The main question is that why the example with the preloaded data (constant) examples/how_tos/reading_data/fully_connected_preloaded.py is significantly slower than other data loading example codes when using GPU.

I had the same problem, that fully_connected_preloaded.py is unexpectedly slow on my Titan X. The problem was that the whole dataset was pre-loaded on CPU, not GPU.

First, let me share my initial attempts. I applied the following performance tips by Yaroslav.

  • set capacity=55000 for tf.train.slice_input_producer.(55000 is the size of MNIST training set in my case)
  • set num_threads=5 for tf.train.batch.
  • set capacity=500 for tf.train.batch.
  • put time.sleep(10) after tf.train.start_queue_runners.

However, the average speed per each batch stays the same. I tried timeline visualization for profiling, and still got QueueDequeueManyV2 dominating.

The problem was the line 65 of fully_connected_preloaded.py. The following code loads entire dataset to CPU, still providing a bottleneck for CPU-GPU data transmission.

with tf.device('/cpu:0'):
    input_images = tf.constant(data_sets.train.images)
    input_labels = tf.constant(data_sets.train.labels)

Hence, I switched the device allocation.

with tf.device('/gpu:0')

Then I got x100 speed-up per each batch.

Note:

  1. This was possible because Titan X has enough memory space to preload entire dataset.
  2. In the original code(fully_connected_preloaded.py), the comment in the line 64 says "rest of pipeline is CPU-only". I am not sure about what this comment intended.


来源:https://stackoverflow.com/questions/39840323/benchmark-of-howto-reading-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!