问题
I have a large CSV file, on the order of 1 GB big, and want to create entities into the datastore, one entity per row.
That CSV file is currently residing in Google Cloud Storage. Is there a clean way to do this? All the examples I can find online seem to rely on having the CSV file locally, or don't look like they would scale very well. Ideally there's a streaming API that lets me read in small enough pieces from Cloud Storage to make update calls to the Datastore, but I haven't been able to find anything like that.
回答1:
The buffer you receive when you open a GCS file is a streaming buffer, which can be pickled. But GCS does not support the iterator protocol to read lines of the CSV. You have to write your own wrapper. Like:
with gcs.open('/app_default_bucket/csv/example.csv', 'r') as f:
csv_reader = csv.reader(iter(f.readline, ''))
for row in csv_reader:
logging.info(' - '.join(row))
If you are familiair with the blobstore you can use it to read large CSV's from GCS using blobstore.create_gs_key( "/gs" + <gcs_file_name_here>)
.
Example here
回答2:
Your best bet is going to be a mapreduce job using the CloudStorageInputReader: https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/input_readers.py#L2189
More on mapreduce for python here: https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/3-MapReduce-for-Python
来源:https://stackoverflow.com/questions/30949385/importing-a-large-csv-from-cloud-storage-into-app-engine-datastore