Can you upload to S3 using a stream rather than a local file?

问题

I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.

Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:

import csv
import csv
import io
import boto
from boto.s3.key import Key


conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'

fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())

I received this error: BotoClientError: s3 does not support chunked transfer

UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:

conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'

testDict = [{
    "fieldA": "8",
    "fieldB": None,
    "fieldC": "888888888888"},
    {
    "fieldA": "9",
    "fieldB": None,
    "fieldC": "99999999999"}]

f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())

for row in testDict:
    writer.writerow(row)
    k.set_contents_from_string(f.getvalue())

f.close()

Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:

f.seek(0)
f.truncate(0)

to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?

回答1:

I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.

import smart_open
import io
import csv

testDict = [{
    "fieldA": "8",
    "fieldB": None,
    "fieldC": "888888888888"},
    {
    "fieldA": "9",
    "fieldB": None,
    "fieldC": "99999999999"}]

fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    fout.write(f.getvalue())

    for row in testDict:
        f.seek(0)
        f.truncate(0)
        writer.writerow(row)
        fout.write(f.getvalue())

f.close()

回答2:

According to docs it's possible

s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))

so we can use StringIO in ordinary way

Update: smart_open lib from @inquiring minds answer is better solution

来源：https://stackoverflow.com/questions/31031463/can-you-upload-to-s3-using-a-stream-rather-than-a-local-file

标签

python

csv

amazon-s3

boto

buffering