I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.
import pandas as pd
import matplotlib.pyplot as plt
import
Another option is to use TensorFlow which comes with the ability to do a streaming read from Google Cloud Storage:
from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
df = pd.read_csv(f)
Using tensorflow also gives you a convenient way to handle wildcards in the filename. For example:
Here is code that will read all CSVs that match a specific pattern (e.g: gs://bucket/some/dir/train-*) into a Pandas dataframe:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
import pandas as pd
def read_csv_file(filename):
with file_io.FileIO(filename, 'r') as f:
df = pd.read_csv(f, header=None, names=['col1', 'col2'])
return df
def read_csv_files(filename_pattern):
filenames = tf.gfile.Glob(filename_pattern)
dataframes = [read_csv_file(filename) for filename in filenames]
return pd.concat(dataframes)
DATADIR='gs://my-bucket/some/dir'
traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))