Removing BOM from gzip'ed CSV in Python

前端 未结 1 2048
甜味超标
甜味超标 2020-12-06 07:44

I\'m using the following code to unzip and save a CSV file:

with gzip.open(filename_gz) as f:
    file = open(filename, \"w\");
    output = csv.writer(file,         


        
1条回答
  •  温柔的废话
    2020-12-06 08:32

    First, you need to decode the file contents, not encode them.

    Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8.

    Finally, csv.reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.

    So:

    csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())
    

    However, you might consider it simpler / more efficent just to remove the BOM manually:

    def remove_bom(line):
        return line[3:] if line.startswith(codecs.BOM_UTF8) else line
    
    csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';')
    

    That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with:

    def remove_bom_from_first(iterable):
        f = iter(iterable)
        firstline = next(f, None)
        if firstline is not None:
            yield remove_bom(firstline)
            for line in f:
                yield f
    

    0 讨论(0)
提交回复
热议问题