Upload and parse csv file with “universal newline” in python on Google App Engine

后端未结

关注

 2  2077

再見小時候

I\'m uploading a csv/tsv file from a form in GAE, and I try to parse the file with python csv module.

Like describe here, uploaded files in GAE are strings.
So I

相关标签:

2条回答

青春惊慌失措

2020-12-16 18:03

How about:

file = self.request.get('catalog')
file  = '\n'.join(file.splitlines())
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)

or as pointed out in the comments, csv.reader() supports input from a list, so:

file = self.request.get('catalog')
catalog = csv.reader(file.splitlines(),dialect=csv.excel_tab)

or if in the future request.get supports read modes:

file = self.request.get('catalog', 'rU')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)

0 讨论(0)

Happy的楠姐

2020-12-16 18:05

The solution described here should work. By defining an iterator class as follows, which loads the blob 1MB at a time, splits the lines using .splitlines() and then feeds lines to the CSV reader one at a time, the newlines can be handled without having to load the whole file into memory.

class BlobIterator:
    """Because the python csv module doesn't like strange newline chars and
    the google blob reader cannot be told to open in universal mode, then
    we need to read blocks of the blob and 'fix' the newlines as we go"""

    def __init__(self, blob_reader):
        self.blob_reader = blob_reader
        self.last_line = ""
        self.line_num = 0
        self.lines = []
        self.buffer = None

    def __iter__(self):
        return self

    def next(self):
        if not self.buffer or len(self.lines) == self.line_num + 1:
            self.buffer = self.blob_reader.read(1048576)  # 1MB buffer
            self.lines = self.buffer.splitlines()
            self.line_num = 0

            # Handle special case where our block just happens to end on a new line
            if self.buffer[-1:] == "\n" or self.buffer[-1:] == "\r":
                self.lines.append("")

        if not self.buffer:
            raise StopIteration

        if self.line_num == 0 and len(self.last_line) > 0:
            result = self.last_line + self.lines[self.line_num] + "\n"
        else:
            result = self.lines[self.line_num] + "\n"

        self.last_line = self.lines[self.line_num + 1]
        self.line_num += 1

        return result

Then call this like so:

blob_reader = blobstore.BlobReader(blob_key)
blob_iterator = BlobIterator(blob_reader)
reader = csv.reader(blob_iterator)

0 讨论(0)