What are the scenarios for which the various TensorFlow data loading idioms apply?

瘦欲@ 提交于 2019-12-08 02:15:42

问题


I have a TensorFlow deep learning workflow in which I have a fairly simple data reading and feeding pipeline built using regular NumPy; but I see that TensorFlow offers a large number of functions for loading data and building a data pipeline. I wonder though what scenarios these target. It seems there are two:

  1. learning that involves very large real world datasets, and
  2. networks built with the with the high-level TensorFlow API.

It seems that the benefits of using "reading" as opposed to "feeding" (e.g. functions such as tf.train.shuffle_batch, but even simple helpers like tf.one_hot) apply to the former, while much of the documentation for things like input functions, seems targeted exclusively at the latter.

In my case, where I'm working with files that can easily be read using standard Python and can efficiently be loaded into memory all at once, feeding simply using np.arrays works fine, and the other approaches either seem far too complex (reading pipelines; which are actually quite slow, to the extent I've tired them) or inappropriate (the "high-level" API, given that I mostly use the "low level" API).

Am I correct in concluding that given that I already use TensorFlow proper (not the low level learning API) and that feeding NumPy arrays meets my needs, there's no good reason to bother with either of the alternative approaches? Is it fair to say that the intended targets for these approaches are different from mine?

Or is there a another taxonomy that better factors the various TensorFlow data loading idioms, and the scenarios to which they apply?


回答1:


Yaroslav already told you about feeding, queues and touched upon datasets. Just a few of my own thoughts:

  • if you just want to learn TF or want to quickly experiment with your various models, feed_dict provides you a quick way to do this. There is performance downside and this is why there are queues
  • queues allow you to specify TF ops which bypass python -> native_TF -> python loop and GIL. The big problem with queues is that they are hard to use (I always struggled a lot before being able to use my data correctly). Many other people struggled and you can see some examples of problems here

Newly introduced Datasets (for some reason there is no link from official website, probably will be added with TF 1.3) solve many of the problems. They are very easy to use (check examples at the end of the page) and the code is very simple and short. Here is an example:

def parser(record):
    # parse the record with tf.parse_single_example

iterator = tf.contrib.data.TFRecordDataset(
    glob.glob("data/tfrecords/training_shard_*.tfrecord")
).map(parser).batch(batch_size).shuffle(shuffle_num).repeat(repeat_num).make_initializable_iterator()
next_element = iterator.get_next()

...
with tf.Session() as sess:
    sess.run(iterator.initializer)

    for i in xrange(100000):
        sess.run(next_element)

These few lines were able to substitute X4 lines with queues. Also making it works is easier than queues (almost as easy as feed_dict). So now my opinion is that there is no place for queues any more. Either use feed_dict or datasets.




回答2:


Feeding data as numpy arrays is part of official API so it is appropriate to rely on it. Official convolutional MNIST example feeds data as numpy arrays, and there's no speed benefit in moving to queues. This is the first data-loading idiom added to TensorFlow.

Python runtime has GIL and other features that make it perform poorly in multicore environment, and that becomes a bottleneck with large volumes of data to ingest. This is solved by moving Python bits (ie, opening files) into native TensorFlow ops, so those operations can be dispatched by the parallel TensorFlow runtime, rather than by Python runtime.

This pipeline approach would move all operations into TensorFlow ops, decoupled through "Queues" stages, and use Python threads to issue session.run calls to fill the queues. This is the second data loading idiom added to TensorFlow.

This removed a lot of Python bits, but for high performance applications, the remaining Python parts were still a bottleneck (ie, examples here and here), so to solve these problems the next generation of ops were introduced (StageOp/Dataset), which removed the need for the extra Python threads. This is the latest data loading idiom introduced.

As a concrete example, to reproduce the official 60x speed-up on 64 GPUs on ImageNet, you'd have to use the latest generation input loading, but for less intensive tasks you could use second generation, or first generation idioms.



来源:https://stackoverflow.com/questions/45393594/what-are-the-scenarios-for-which-the-various-tensorflow-data-loading-idioms-appl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!