tensorflow-datasets | 易学教程

TensorFlow - tf.data.Dataset reading large HDF5 files

阅读更多关于 TensorFlow - tf.data.Dataset reading large HDF5 files

I am setting up a TensorFlow pipeline for reading large HDF5 files as input for my deep learning models. Each HDF5 file contains 100 videos of variable size length stored as a collection of compressed JPG images (to make size on disk manageable). Using tf.data.Dataset and a map to tf.py_func , reading examples from the HDF5 file using custom Python logic is quite easy. For example: def read_examples_hdf5(filename, label): with h5py.File(filename, 'r') as hf: # read frames from HDF5 and decode them from JPG return frames, label filenames = glob.glob(os.path.join(hdf5_data_path, "*.h5")) labels

Oversampling functionality in Tensorflow dataset API

阅读更多关于 Oversampling functionality in Tensorflow dataset API

问题 I would like to ask if current API of datasets allows for implementation of oversampling algorithm? I deal with highly imbalanced class problem. I was thinking that it would be nice to oversample specific classes during dataset parsing i.e. online generation. I've seen the implementation for rejection_resample function, however this removes samples instead of duplicating them and its slows down batch generation (when target distribution is much different then initial one). The thing I would

Tensorflow 1.10 TFRecordDataset - recovering TFRecords

阅读更多关于 Tensorflow 1.10 TFRecordDataset - recovering TFRecords

问题 Notes: this question extends upon a previous question of mine. In that question I ask about the best way to store some dummy data as Example and SequenceExample seeking to know which is better for data similar to dummy data provided. I provide both explicit formulations of the Example and SequenceExample construction as well as, in the answers, a programatic way to do so. Because this is still a lot of code, I am providing a Colab (interactive jupyter notebook hosted by google) file where you

Dataset API 'flat_map' method producing error for same code which works with 'map' method

阅读更多关于 Dataset API 'flat_map' method producing error for same code which works with 'map' method

问题 I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here. folder_name = './data/power_data/' file_names = os.listdir(folder_name) def _get_data

tf.contrib.data.Dataset seems does not support SparseTensor

阅读更多关于 tf.contrib.data.Dataset seems does not support SparseTensor

问题 I generated a pascal voc 2007 tfrecords file using the code in tensorflow object detection API. I use tf.contrib.data.Dataset API to read data from tfrecords. I tried mehtod without tf.contrib.data.Dataset API, and the code can run without any error, but when changed to tf.contrib.data.Dataset API it can not work correctly. The code without tf.contrib.data.Dataset : import tensorflow as tf if __name__ == '__main__': slim_example_decoder = tf.contrib.slim.tfexample_decoder features = {"image

TensorFlow: “Cannot capture a stateful node by value” in tf.contrib.data API

阅读更多关于 TensorFlow: “Cannot capture a stateful node by value” in tf.contrib.data API

For transfer learning, one often uses a network as a feature extractor to create a dataset of features, on which another classifier is trained (e.g. a SVM). I want to implement this using the Dataset API ( tf.contrib.data ) and dataset.map() : # feature_extractor will create a CNN on top of the given tensor def features(feature_extractor, ...): dataset = inputs(...) # This creates a dataset of (image, label) pairs def map_example(image, label): features = feature_extractor(image, trainable=False) # Leaving out initialization from a checkpoint here... return features, label dataset = dataset

Feature Columns Embedding lookup

阅读更多关于 Feature Columns Embedding lookup

问题 I have been working with the datasets and feature_columns in tensorflow(https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html). I see they have categorical features and a way to create embedding features from categorical features. But when working on nlp tasks, how do we create a single embedding lookup? For eg: Consider text classification task. Every data point would have a lot of textual columns but they would not be separate categories. How do we create and

In Tensorflow's Dataset API how do you map one element into multiple elements?

阅读更多关于 In Tensorflow's Dataset API how do you map one element into multiple elements?

In the tensorflow Dataset pipeline I'd like to define a custom map function which takes a single input element (data sample) and returns multiple elements (data samples). The code below is my attempt, along with the desired results. I could not follow the documentation on tf.data.Dataset().flat_map() well enough to understand if it was applicable here or not. import tensorflow as tf input = [10, 20, 30] def my_map_func(i): return [[i, i+1, i+2]] # Fyi [[i], [i+1], [i+2]] throws an exception ds = tf.data.Dataset.from_tensor_slices(input) ds = ds.map(map_func=lambda input: tf.py_func( func=my

TensorFlow - tf.data.Dataset reading large HDF5 files

阅读更多关于 TensorFlow - tf.data.Dataset reading large HDF5 files

问题 I am setting up a TensorFlow pipeline for reading large HDF5 files as input for my deep learning models. Each HDF5 file contains 100 videos of variable size length stored as a collection of compressed JPG images (to make size on disk manageable). Using tf.data.Dataset and a map to tf.py_func , reading examples from the HDF5 file using custom Python logic is quite easy. For example: def read_examples_hdf5(filename, label): with h5py.File(filename, 'r') as hf: # read frames from HDF5 and decode

Parallelism isn't reducing the time in dataset map

阅读更多关于 Parallelism isn't reducing the time in dataset map

TF Map function supports parallel calls . I'm seeing no improvements passing num_parallel_calls to map. With num_parallel_calls=1 and num_parallel_calls=10 , there is no improvement in performance run time. Here is a simple code import time def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False, batch_size=1, repeat=1, num_iterations=10): tf.reset_default_graph() start = time.time() dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func( squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat) if batch: dataset_x = dataset_x.batch(batch