Importing a large CSV from Cloud Storage into App Engine Datastore

徘徊边缘 提交于 2019-12-23 13:34:06

问题


I have a large CSV file, on the order of 1 GB big, and want to create entities into the datastore, one entity per row.

That CSV file is currently residing in Google Cloud Storage. Is there a clean way to do this? All the examples I can find online seem to rely on having the CSV file locally, or don't look like they would scale very well. Ideally there's a streaming API that lets me read in small enough pieces from Cloud Storage to make update calls to the Datastore, but I haven't been able to find anything like that.


回答1:


The buffer you receive when you open a GCS file is a streaming buffer, which can be pickled. But GCS does not support the iterator protocol to read lines of the CSV. You have to write your own wrapper. Like:

with gcs.open('/app_default_bucket/csv/example.csv', 'r') as f:
        csv_reader = csv.reader(iter(f.readline, ''))
        for row in csv_reader:
            logging.info(' - '.join(row))

If you are familiair with the blobstore you can use it to read large CSV's from GCS using blobstore.create_gs_key( "/gs" + <gcs_file_name_here>). Example here




回答2:


Your best bet is going to be a mapreduce job using the CloudStorageInputReader: https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py#L2189

More on mapreduce for python here: https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/3-MapReduce-for-Python



来源:https://stackoverflow.com/questions/30949385/importing-a-large-csv-from-cloud-storage-into-app-engine-datastore

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!