Removing BOM from gzip'ed CSV in Python

十年热恋 提交于 2019-11-28 00:31:10

First, you need to decode the file contents, not encode them.

Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8.

Finally, csv.reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.

So:

csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())

However, you might consider it simpler / more efficent just to remove the BOM manually:

def remove_bom(line):
    return line[3:] if line.startswith(codecs.BOM_UTF8) else line

csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';')

That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with:

def remove_bom_from_first(iterable):
    f = iter(iterable)
    firstline = next(f, None)
    if firstline is not None:
        yield remove_bom(firstline)
        for line in f:
            yield f
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!