Benchmark of HowTo: Reading Data

后端未结

关注

 3  1342

庸人自扰 2020-12-03 08:41

I\'m using tensorflow 0.10 and I was benchmarking the examples found in the official HowTo on reading data. This HowTo illustrates different methods to move data to tensorf

3条回答

醉话见心 (楼主)

2020-12-03 09:19

Update Oct 9 the slowness comes because the computation runs too fast for Python to pre-empt the computation thread and to schedule the pre-fetching threads. Computation in main thread takes 2ms and apparently that's too little for the pre-fetching thread to grab the GIL. Pre-fetching thread has larger delay and hence can always be pre-empted by computation thread. So the computation thread runs through all of the examples, and then spends most of the time blocked on GIL as some prefetching thread gets scheduled and enqueues a single example. The solution is to increase number of Python threads, increase queue size to fit the entire dataset, start queue runners, and then pause main thread for a couple of seconds to give queue runners to pre-populate the queue.

Old stuff

That's surprisingly slow.

This looks some kind of special cases making the last 3 examples unnecessarily slow (most effort went into optimizing large models like ImageNet, so MNIST didn't get as much attention).

You can diagnose the problems by getting timelines, as described here

Here are 3 of those examples with timeline collection enabled.

Here's the timeline for feed_dict implementation

The important thing to notice is that matmul takes a good chunk of the time, so the reading overhead is not significant

Now here's the timeline for reader implementation

You can see that operation is bottlenecked on QueueDequeueMany which takes whopping 45ms.

If you zoom in, you'll see a bunch of tiny MEMCPY and Cast operations, which is a sign of some op being CPU only (parse_single_example), and the dequeue having to schedule multiple independent CPU->GPU transfers

For the var example below with GPU disabled, I don't see tiny little ops, but QueueDequeueMany still takes over 10ms. The timing seems to scale linearly with batch size, so there's some fundamental slowness there. Filed #4740

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...