Read csv from Google Cloud storage to pandas dataframe

后端 未结 7 1157
时光说笑
时光说笑 2020-11-28 03:00

I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.

import pandas as pd
import matplotlib.pyplot as plt
import         


        
7条回答
  •  悲哀的现实
    2020-11-28 03:31

    Another option is to use TensorFlow which comes with the ability to do a streaming read from Google Cloud Storage:

    from tensorflow.python.lib.io import file_io
    with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
      df = pd.read_csv(f)
    

    Using tensorflow also gives you a convenient way to handle wildcards in the filename. For example:

    Reading wildcard CSV into Pandas

    Here is code that will read all CSVs that match a specific pattern (e.g: gs://bucket/some/dir/train-*) into a Pandas dataframe:

    import tensorflow as tf
    from tensorflow.python.lib.io import file_io
    import pandas as pd
    
    def read_csv_file(filename):
      with file_io.FileIO(filename, 'r') as f:
        df = pd.read_csv(f, header=None, names=['col1', 'col2'])
        return df
    
    def read_csv_files(filename_pattern):
      filenames = tf.gfile.Glob(filename_pattern)
      dataframes = [read_csv_file(filename) for filename in filenames]
      return pd.concat(dataframes)
    

    usage

    DATADIR='gs://my-bucket/some/dir'
    traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
    evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))
    

提交回复
热议问题