How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?

心已入冬 提交于 2019-12-11 12:09:05

问题


I try to run a DataFlow pipeline remotely which will use a pickle file. Locally, I can use the code below to invoke the file.

with open (known_args.file_path, 'rb') as fp:
     file = pickle.load(fp)

However, I find it not valid when the path is about cloud storage(gs://...):

IOError: [Errno 2] No such file or directory: 'gs://.../.pkl'

I kind of understand why it is not working but I cannot find the right way to do it.


回答1:


If you have pickle files in your GCS bucket, then you can load them as BLOBs and process them further like in your code (using pickle.load()):

class ReadGcsBlobs(beam.DoFn):
    def process(self, element, *args, **kwargs):
        from apache_beam.io.gcp import gcsio
        gcs = gcsio.GcsIO()
        yield (element, gcs.open(element).read())


# usage example:
files = (p
         | "Initialize" >> beam.Create(["gs://your-bucket-name/pickle_file_path.pickle"])
         | "Read blobs" >> beam.ParDo(ReadGcsBlobs())
        )



回答2:


open() is the standard Python library function that does not understand Google Cloud Storage paths. You need to use the Beam FileSystems API instead, which is aware of it and of other filesystems supported by Beam.



来源:https://stackoverflow.com/questions/47306715/how-to-read-blob-pickle-files-from-gcs-in-a-google-cloud-dataflow-job

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!