Streaming large training and test files into Tensorflow's DNNClassifier

后端 未结 2 567
半阙折子戏
半阙折子戏 2020-11-29 02:10

I have a huge training CSV file (709M) and a large testing CSV file (125M) that I want to send into a DNNClassifier in the context of using the high-level Tenso

2条回答
  •  悲&欢浪女
    2020-11-29 02:39

    I agree with DomJack about using the Dataset API, except the need to read the whole csv file and then convert to TfRecord. I am hereby proposing to emply TextLineDataset - a sub-class of the Dataset API to directly load data into a TensorFlow program. An intuitive tutorial can be found here.

    The code below is used for the MNIST classification problem for illustration and hopefully, answer the question of the OP. The csv file has 784 columns, and the number of classes is 10. The classifier I used in this example is a 1-hidden-layer neural network with 16 relu units.

    Firstly, load libraries and define some constants:

    # load libraries
    import tensorflow as tf
    import os
    
    # some constants
    n_x = 784
    n_h = 16
    n_y = 10
    
    # path to the folder containing the train and test csv files
    # You only need to change PATH, rest is platform independent
    PATH = os.getcwd() + '/' 
    
    # create a list of feature names
    feature_names = ['pixel' + str(i) for i in range(n_x)]
    

    Secondly, we create an input function reading a file using the Dataset API, then provide the results to the Estimator API. The return value must be a two-element tuple organized as follows: the first element must be a dict in which each input feature is a key, and then a list of values for the training batch, and the second element is a list of labels for the training batch.

    def my_input_fn(file_path, batch_size=32, buffer_size=256,\
                    perform_shuffle=False, repeat_count=1):
        '''
        Args:
            - file_path: the path of the input file
            - perform_shuffle: whether the data is shuffled or not
            - repeat_count: The number of times to iterate over the records in the dataset.
                        For example, if we specify 1, then each record is read once.
                        If we specify None, iteration will continue forever.
        Output is two-element tuple organized as follows:
            - The first element must be a dict in which each input feature is a key,
            and then a list of values for the training batch.
            - The second element is a list of labels for the training batch.
        '''
        def decode_csv(line):
            record_defaults = [[0.]]*n_x # n_x features
            record_defaults.insert(0, [0]) # the first element is the label (int)
            parsed_line = tf.decode_csv(records=line,\
                                        record_defaults=record_defaults)
            label = parsed_line[0]  # First element is the label
            del parsed_line[0]  # Delete first element
            features = parsed_line  # Everything but first elements are the features
            d = dict(zip(feature_names, features)), label
            return d
    
        dataset = (tf.data.TextLineDataset(file_path)  # Read text file
                   .skip(1)  # Skip header row
                   .map(decode_csv))  # Transform each elem by applying decode_csv fn
        if perform_shuffle:
            # Randomizes input using a window of 256 elements (read into memory)
            dataset = dataset.shuffle(buffer_size=buffer_size)
        dataset = dataset.repeat(repeat_count)  # Repeats dataset this # times
        dataset = dataset.batch(batch_size)  # Batch size to use
        iterator = dataset.make_one_shot_iterator()
        batch_features, batch_labels = iterator.get_next()
    
        return batch_features, batch_labels
    

    Then, the mini-batch can be computed as

    next_batch = my_input_fn(file_path=PATH+'train1.csv',\
                             batch_size=batch_size,\
                             perform_shuffle=True) # return 512 random elements
    

    Next, we define the feature columns are numeric

    feature_columns = [tf.feature_column.numeric_column(k) for k in feature_names]
    

    Thirdly, we create an estimator DNNClassifier:

    classifier = tf.estimator.DNNClassifier(
        feature_columns=feature_columns,  # The input features to our model
        hidden_units=[n_h],  # One layer
        n_classes=n_y,
        model_dir=None)
    

    Finally, the DNN is trained using the test csv file, while the evaluation is performed on the test file. Please change the repeat_count and steps to ensure that the training meets the required number of epochs in your code.

    # train the DNN
    classifier.train(
        input_fn=lambda: my_input_fn(file_path=PATH+'train1.csv',\
                                     perform_shuffle=True,\
                                     repeat_count=1),\
                                     steps=None)    
    
    # evaluate using the test csv file
    evaluate_result = classifier.evaluate(
        input_fn=lambda: my_input_fn(file_path=PATH+'test1.csv',\
                                     perform_shuffle=False))
    print("Evaluation results")
    for key in evaluate_result:
        print("   {}, was: {}".format(key, evaluate_result[key]))
    

提交回复
热议问题