Importing a large CSV from Cloud Storage into App Engine Datastore

问题

I have a large CSV file, on the order of 1 GB big, and want to create entities into the datastore, one entity per row.

That CSV file is currently residing in Google Cloud Storage. Is there a clean way to do this? All the examples I can find online seem to rely on having the CSV file locally, or don't look like they would scale very well. Ideally there's a streaming API that lets me read in small enough pieces from Cloud Storage to make update calls to the Datastore, but I haven't been able to find anything like that.

回答1:

The buffer you receive when you open a GCS file is a streaming buffer, which can be pickled. But GCS does not support the iterator protocol to read lines of the CSV. You have to write your own wrapper. Like:

with gcs.open('/app_default_bucket/csv/example.csv', 'r') as f:
        csv_reader = csv.reader(iter(f.readline, ''))
        for row in csv_reader:
            logging.info(' - '.join(row))

If you are familiair with the blobstore you can use it to read large CSV's from GCS using blobstore.create_gs_key( "/gs" + <gcs_file_name_here>). Example here

回答2:

Your best bet is going to be a mapreduce job using the CloudStorageInputReader: https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py#L2189

More on mapreduce for python here: https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/3-MapReduce-for-Python

来源：https://stackoverflow.com/questions/30949385/importing-a-large-csv-from-cloud-storage-into-app-engine-datastore

标签

python

csv

google-app-engine

google-cloud-storage

google-cloud-datastore