What are the scenarios for which the various TensorFlow data loading idioms apply?

流过昼夜 提交于 2019-12-06 07:28:35

Yaroslav already told you about feeding, queues and touched upon datasets. Just a few of my own thoughts:

  • if you just want to learn TF or want to quickly experiment with your various models, feed_dict provides you a quick way to do this. There is performance downside and this is why there are queues
  • queues allow you to specify TF ops which bypass python -> native_TF -> python loop and GIL. The big problem with queues is that they are hard to use (I always struggled a lot before being able to use my data correctly). Many other people struggled and you can see some examples of problems here

Newly introduced Datasets (for some reason there is no link from official website, probably will be added with TF 1.3) solve many of the problems. They are very easy to use (check examples at the end of the page) and the code is very simple and short. Here is an example:

def parser(record):
    # parse the record with tf.parse_single_example

iterator = tf.contrib.data.TFRecordDataset(
    glob.glob("data/tfrecords/training_shard_*.tfrecord")
).map(parser).batch(batch_size).shuffle(shuffle_num).repeat(repeat_num).make_initializable_iterator()
next_element = iterator.get_next()

...
with tf.Session() as sess:
    sess.run(iterator.initializer)

    for i in xrange(100000):
        sess.run(next_element)

These few lines were able to substitute X4 lines with queues. Also making it works is easier than queues (almost as easy as feed_dict). So now my opinion is that there is no place for queues any more. Either use feed_dict or datasets.

Feeding data as numpy arrays is part of official API so it is appropriate to rely on it. Official convolutional MNIST example feeds data as numpy arrays, and there's no speed benefit in moving to queues. This is the first data-loading idiom added to TensorFlow.

Python runtime has GIL and other features that make it perform poorly in multicore environment, and that becomes a bottleneck with large volumes of data to ingest. This is solved by moving Python bits (ie, opening files) into native TensorFlow ops, so those operations can be dispatched by the parallel TensorFlow runtime, rather than by Python runtime.

This pipeline approach would move all operations into TensorFlow ops, decoupled through "Queues" stages, and use Python threads to issue session.run calls to fill the queues. This is the second data loading idiom added to TensorFlow.

This removed a lot of Python bits, but for high performance applications, the remaining Python parts were still a bottleneck (ie, examples here and here), so to solve these problems the next generation of ops were introduced (StageOp/Dataset), which removed the need for the extra Python threads. This is the latest data loading idiom introduced.

As a concrete example, to reproduce the official 60x speed-up on 64 GPUs on ImageNet, you'd have to use the latest generation input loading, but for less intensive tasks you could use second generation, or first generation idioms.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!