tensorflow-datasets

TensorFlow - tf.data.Dataset reading large HDF5 files

混江龙づ霸主 提交于 2019-11-30 04:50:51
I am setting up a TensorFlow pipeline for reading large HDF5 files as input for my deep learning models. Each HDF5 file contains 100 videos of variable size length stored as a collection of compressed JPG images (to make size on disk manageable). Using tf.data.Dataset and a map to tf.py_func , reading examples from the HDF5 file using custom Python logic is quite easy. For example: def read_examples_hdf5(filename, label): with h5py.File(filename, 'r') as hf: # read frames from HDF5 and decode them from JPG return frames, label filenames = glob.glob(os.path.join(hdf5_data_path, "*.h5")) labels

Oversampling functionality in Tensorflow dataset API

别来无恙 提交于 2019-11-30 04:22:22
问题 I would like to ask if current API of datasets allows for implementation of oversampling algorithm? I deal with highly imbalanced class problem. I was thinking that it would be nice to oversample specific classes during dataset parsing i.e. online generation. I've seen the implementation for rejection_resample function, however this removes samples instead of duplicating them and its slows down batch generation (when target distribution is much different then initial one). The thing I would

Tensorflow 1.10 TFRecordDataset - recovering TFRecords

♀尐吖头ヾ 提交于 2019-11-30 03:26:28
问题 Notes: this question extends upon a previous question of mine. In that question I ask about the best way to store some dummy data as Example and SequenceExample seeking to know which is better for data similar to dummy data provided. I provide both explicit formulations of the Example and SequenceExample construction as well as, in the answers, a programatic way to do so. Because this is still a lot of code, I am providing a Colab (interactive jupyter notebook hosted by google) file where you

Dataset API 'flat_map' method producing error for same code which works with 'map' method

孤人 提交于 2019-11-29 23:23:55
问题 I am trying to create a create a pipeline to read multiple CSV files using TensorFlow Dataset API and Pandas. However, using the flat_map method is producing errors. However, if I am using map method I am able to build the code and run it in session. This is the code I am using. I already opened #17415 issue in TensorFlow Github repository. But apparently, it is not an error and they asked me to post here. folder_name = './data/power_data/' file_names = os.listdir(folder_name) def _get_data

tf.contrib.data.Dataset seems does not support SparseTensor

牧云@^-^@ 提交于 2019-11-29 21:08:10
问题 I generated a pascal voc 2007 tfrecords file using the code in tensorflow object detection API. I use tf.contrib.data.Dataset API to read data from tfrecords. I tried mehtod without tf.contrib.data.Dataset API, and the code can run without any error, but when changed to tf.contrib.data.Dataset API it can not work correctly. The code without tf.contrib.data.Dataset : import tensorflow as tf if __name__ == '__main__': slim_example_decoder = tf.contrib.slim.tfexample_decoder features = {"image

TensorFlow: “Cannot capture a stateful node by value” in tf.contrib.data API

梦想与她 提交于 2019-11-29 10:50:17
For transfer learning, one often uses a network as a feature extractor to create a dataset of features, on which another classifier is trained (e.g. a SVM). I want to implement this using the Dataset API ( tf.contrib.data ) and dataset.map() : # feature_extractor will create a CNN on top of the given tensor def features(feature_extractor, ...): dataset = inputs(...) # This creates a dataset of (image, label) pairs def map_example(image, label): features = feature_extractor(image, trainable=False) # Leaving out initialization from a checkpoint here... return features, label dataset = dataset

Feature Columns Embedding lookup

帅比萌擦擦* 提交于 2019-11-29 10:19:33
问题 I have been working with the datasets and feature_columns in tensorflow(https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html). I see they have categorical features and a way to create embedding features from categorical features. But when working on nlp tasks, how do we create a single embedding lookup? For eg: Consider text classification task. Every data point would have a lot of textual columns but they would not be separate categories. How do we create and

In Tensorflow's Dataset API how do you map one element into multiple elements?

≯℡__Kan透↙ 提交于 2019-11-29 09:36:20
In the tensorflow Dataset pipeline I'd like to define a custom map function which takes a single input element (data sample) and returns multiple elements (data samples). The code below is my attempt, along with the desired results. I could not follow the documentation on tf.data.Dataset().flat_map() well enough to understand if it was applicable here or not. import tensorflow as tf input = [10, 20, 30] def my_map_func(i): return [[i, i+1, i+2]] # Fyi [[i], [i+1], [i+2]] throws an exception ds = tf.data.Dataset.from_tensor_slices(input) ds = ds.map(map_func=lambda input: tf.py_func( func=my

TensorFlow - tf.data.Dataset reading large HDF5 files

自古美人都是妖i 提交于 2019-11-29 02:19:45
问题 I am setting up a TensorFlow pipeline for reading large HDF5 files as input for my deep learning models. Each HDF5 file contains 100 videos of variable size length stored as a collection of compressed JPG images (to make size on disk manageable). Using tf.data.Dataset and a map to tf.py_func , reading examples from the HDF5 file using custom Python logic is quite easy. For example: def read_examples_hdf5(filename, label): with h5py.File(filename, 'r') as hf: # read frames from HDF5 and decode

Parallelism isn't reducing the time in dataset map

二次信任 提交于 2019-11-28 11:14:20
TF Map function supports parallel calls . I'm seeing no improvements passing num_parallel_calls to map. With num_parallel_calls=1 and num_parallel_calls=10 , there is no improvement in performance run time. Here is a simple code import time def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False, batch_size=1, repeat=1, num_iterations=10): tf.reset_default_graph() start = time.time() dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func( squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat) if batch: dataset_x = dataset_x.batch(batch