Memory-efficient large dataset streaming to S3

孤人 提交于 2019-12-04 17:32:50

I'm assuming that by "make these parts work together" you mean you want a single file in S3 instead of just parts? All you need to do is to create a file object that, when read, will issue a query for the next batch and buffer that. We can make use of python's generators:

def _generate_chunks(engine):
    with engine.begin() as conn:
        conn = conn.execution_options(stream_results=True)
        results = conn.execute("")
        while True:
            chunk = results.fetchmany(10000)
            if not chunk:
                break
            csv_buffer = StringIO()
            csv_writer = csv.writer(csv_buffer, delimiter=';')
            csv_writer.writerows(chunk)
            yield csv_buffer.getvalue().encode("utf-8")

This is a stream of chunks of your file, so all we need to do is to stitch these together (lazily, of course) into a file object:

class CombinedFile(io.RawIOBase):
    def __init__(self, strings):
        self._buffer = ""
        self._strings = iter(strings)

    def read(self, size=-1):
        if size < 0:
            return self.readall()
        if not self._buffer:
            try:
                self._buffer = next(self._strings)
            except StopIteration:
                pass
        if len(self._buffer) > size:
            ret, self._buffer = self._buffer[:size], self._buffer[size:]
        else:
            ret, self._buffer = self._buffer, b""
        return ret

chunks = _generate_chunks(engine)
file = CombinedFile(chunks)
upload_file_object_to_s3(file)

Streaming the file object to S3 is left as an exercise for the reader. (You can probably use put_object.)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!