How to *actually* read CSV data in TensorFlow?

前端 未结 5 1195
感动是毒
感动是毒 2020-11-28 22:32

I\'m relatively new to the world of TensorFlow, and pretty perplexed by how you\'d actually read CSV data into a usable example/label tensors in Te

相关标签:
5条回答
  • 2020-11-28 23:12

    I think you are mixing up imperative and graph-construction parts here. The operation tf.train.shuffle_batch creates a new queue node, and a single node can be used to process the entire dataset. So I think you are hanging because you created a bunch of shuffle_batch queues in your for loop and didn't start queue runners for them.

    Normal input pipeline usage looks like this:

    1. Add nodes like shuffle_batch to input pipeline
    2. (optional, to prevent unintentional graph modification) finalize graph

    --- end of graph construction, beginning of imperative programming --

    1. tf.start_queue_runners
    2. while(True): session.run()

    To be more scalable (to avoid Python GIL), you could generate all of your data using TensorFlow pipeline. However, if performance is not critical, you can hook up a numpy array to an input pipeline by using slice_input_producer. Here's an example with some Print nodes to see what's going on (messages in Print go to stdout when node is run)

    tf.reset_default_graph()
    
    num_examples = 5
    num_features = 2
    data = np.reshape(np.arange(num_examples*num_features), (num_examples, num_features))
    print data
    
    (data_node,) = tf.slice_input_producer([tf.constant(data)], num_epochs=1, shuffle=False)
    data_node_debug = tf.Print(data_node, [data_node], "Dequeueing from data_node ")
    data_batch = tf.batch([data_node_debug], batch_size=2)
    data_batch_debug = tf.Print(data_batch, [data_batch], "Dequeueing from data_batch ")
    
    sess = tf.InteractiveSession()
    sess.run(tf.initialize_all_variables())
    tf.get_default_graph().finalize()
    tf.start_queue_runners()
    
    try:
      while True:
        print sess.run(data_batch_debug)
    except tf.errors.OutOfRangeError as e:
      print "No more inputs."
    

    You should see something like this

    [[0 1]
     [2 3]
     [4 5]
     [6 7]
     [8 9]]
    [[0 1]
     [2 3]]
    [[4 5]
     [6 7]]
    No more inputs.
    

    The "8, 9" numbers didn't fill up the full batch, so they didn't get produced. Also tf.Print are printed to sys.stdout, so they show up in separately in Terminal for me.

    PS: a minimal of connecting batch to a manually initialized queue is in github issue 2193

    Also, for debugging purposes you might want to set timeout on your session so that your IPython notebook doesn't hang on empty queue dequeues. I use this helper function for my sessions

    def create_session():
      config = tf.ConfigProto(log_device_placement=True)
      config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
      config.operation_timeout_in_ms=60000   # terminate on long hangs
      # create interactive session to register a default session
      sess = tf.InteractiveSession("", config=config)
      return sess
    

    Scalability Notes:

    1. tf.constant inlines copy of your data into the Graph. There's a fundamental limit of 2GB on size of Graph definition so that's an upper limit on size of data
    2. You could get around that limit by using v=tf.Variable and saving the data into there by running v.assign_op with a tf.placeholder on right-hand side and feeding numpy array to the placeholder (feed_dict)
    3. That still creates two copies of data, so to save memory you could make your own version of slice_input_producer which operates on numpy arrays, and uploads rows one at a time using feed_dict
    0 讨论(0)
  • 2020-11-28 23:14

    Or you could try this, the code loads the Iris dataset into tensorflow using pandas and numpy and a simple one neuron output is printed in the session. Hope it helps for a basic understanding.... [ I havent added the way of one hot decoding labels].

    import tensorflow as tf 
    import numpy
    import pandas as pd
    df=pd.read_csv('/home/nagarjun/Desktop/Iris.csv',usecols = [0,1,2,3,4],skiprows = [0],header=None)
    d = df.values
    l = pd.read_csv('/home/nagarjun/Desktop/Iris.csv',usecols = [5] ,header=None)
    labels = l.values
    data = numpy.float32(d)
    labels = numpy.array(l,'str')
    #print data, labels
    
    #tensorflow
    x = tf.placeholder(tf.float32,shape=(150,5))
    x = data
    w = tf.random_normal([100,150],mean=0.0, stddev=1.0, dtype=tf.float32)
    y = tf.nn.softmax(tf.matmul(w,x))
    
    with tf.Session() as sess:
        print sess.run(y)
    
    0 讨论(0)
  • 2020-11-28 23:17

    2.0 Compatible Solution: This Answer might be provided by others in the above thread but I will provide additional links which will help the community.

    dataset = tf.data.experimental.make_csv_dataset(
          file_path,
          batch_size=5, # Artificially small to make examples easier to show.
          label_name=LABEL_COLUMN,
          na_value="?",
          num_epochs=1,
          ignore_errors=True, 
          **kwargs)
    

    For more information, please refer this Tensorflow Tutorial.

    0 讨论(0)
  • 2020-11-28 23:20

    You can use latest tf.data API :

    dataset = tf.contrib.data.make_csv_dataset(filepath)
    iterator = dataset.make_initializable_iterator()
    columns = iterator.get_next()
    with tf.Session() as sess:
       sess.run([iteator.initializer])
    
    0 讨论(0)
  • 2020-11-28 23:20

    If anyone came here searching for a simple way to read absolutely large and sharded CSV files in tf.estimator API then , please see below my code

    CSV_COLUMNS = ['ID','text','class']
    LABEL_COLUMN = 'class'
    DEFAULTS = [['x'],['no'],[0]]  #Default values
    
    def read_dataset(filename, mode, batch_size = 512):
        def _input_fn(v_test=False):
    #         def decode_csv(value_column):
    #             columns = tf.decode_csv(value_column, record_defaults = DEFAULTS)
    #             features = dict(zip(CSV_COLUMNS, columns))
    #             label = features.pop(LABEL_COLUMN)
    #             return add_engineered(features), label
    
            # Create list of files that match pattern
            file_list = tf.gfile.Glob(filename)
    
            # Create dataset from file list
            #dataset = tf.data.TextLineDataset(file_list).map(decode_csv)
            dataset = tf.contrib.data.make_csv_dataset(file_list,
                                                       batch_size=batch_size,
                                                       column_names=CSV_COLUMNS,
                                                       column_defaults=DEFAULTS,
                                                       label_name=LABEL_COLUMN)
    
            if mode == tf.estimator.ModeKeys.TRAIN:
                num_epochs = None # indefinitely
                dataset = dataset.shuffle(buffer_size = 10 * batch_size)
            else:
                num_epochs = 1 # end-of-input after this
    
            batch_features, batch_labels = dataset.make_one_shot_iterator().get_next()
    
            #Begins - Uncomment for testing only -----------------------------------------------------<
            if v_test == True:
                with tf.Session() as sess:
                    print(sess.run(batch_features))
            #End - Uncomment for testing only -----------------------------------------------------<
            return add_engineered(batch_features), batch_labels
        return _input_fn
    

    Example usage in TF.estimator:

    train_spec = tf.estimator.TrainSpec(input_fn = read_dataset(
                                                    filename = train_file,
                                                    mode = tf.estimator.ModeKeys.TRAIN,
                                                    batch_size = 128), 
                                          max_steps = num_train_steps)
    
    0 讨论(0)
提交回复
热议问题