Number of examples in each tfrecord

问题

Running the sample.sh script in Google Cloud Shell to call the below preprocess on set of images following the steps of flowers example.

https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/trainer/preprocess.py

Preprocess was successfully on both eval set and train set. But the generated .tfrecord.gz files does not seem matching the image numbers in eval/train_set.csv.

i.e. eval-00000-of-00157.tfrecord.gz says there are 158 tfrecord while there are 35227 rows in eval_set.csv. Each record include a valid image_url (all of them are uploaded to Storage), each record has valid label tagged.

Would like to know if there is a way to monitor and control the number of images per tfrecord in preproces.py config.

Thanks

Update, got this work out right:

import tensorflow as tf 
import os
from tensorflow.python.lib.io import file_io

options = tf.python_io.TFRecordOptions(
    compression_type=tf.python_io.TFRecordCompressionType.GZIP)

sum(1 for f in file_io.get_matching_files(os.path.join(url/path, '*.tfrecord.gz'))
    for example in tf.python_io.tf_record_iterator(f, options=options))

回答1:

The filename eval-00000-of-00157.tfrecord.gz means that this is the first file out of 158. There should be 157 similarly named files. Within each file, there can be any number of records.

If you want to manually count each record, try something like:

import tensorflow as tf
from tensorflow.python.lib.io import file_io

files = os.path.join('gs://my_bucket/my_dir', 'eval-*.tfrecord.gz')
print(sum(1 for f in tf.python_io.file_io.get_matching_files(files)
            for tf.python_io.tf_record_iterator(f)))

Note that there is no guarantee from Dataflow as to the relationship between the number of files and ordering of records (inter- and intra-file) between input files and output files. However, the counts should be the same.

来源：https://stackoverflow.com/questions/42799007/number-of-examples-in-each-tfrecord

标签

google-cloud-ml