How to open an unicode text file inside a zip?

后端 未结 3 1706
不知归路
不知归路 2020-12-19 06:04

I tried

with zipfile.ZipFile(\"5.csv.zip\", \"r\") as zfile:
    for name in zfile.namelist():
        with zfile.open(name, \'rU\') as readFile:
                    


        
相关标签:
3条回答
  • 2020-12-19 06:34

    edit For Python 3, using io.TextIOWrapper as this answer describes is the best choice. The answer below could still be helpful for 2.x. I don't think anything below is actually incorrect even for 3.x, but io.TestIOWrapper is still better.

    If the file is utf-8, this will work:

    # the rest of the code as above, then:
    with zfile.open(name, 'rU') as readFile:
        line = readFile.readline().decode('utf8')
        # etc
    

    If you're going to be iterating over the file you can use codecs.iterdecode, but that won't work with readline().

    with zfile.open(name, 'rU') as readFile:
        for line in codecs.iterdecode(readFile, 'utf8'):
            print line
            # etc
    

    Note that neither approach is necessarily safe for multibyte encodings. For example, little-endian UTF-16 represents the newline character with the bytes b'\x0A\x00'. A non-unicode aware tool looking for newlines will split that incorrectly, leaving the null bytes on the following line. In such a case you'd have to use something that doesn't try to split the input by newlines, such as ZipFile.read, and then decode the whole byte string at once. This is not a concern for utf-8.

    0 讨论(0)
  • 2020-12-19 06:36

    To convert a byte stream into Unicode stream, you could use io.TextIOWrapper():

    encoding = 'utf-8'
    with zipfile.ZipFile("5.csv.zip") as zfile:
        for name in zfile.namelist():
            with zfile.open(name) as readfile:
                for line in io.TextIOWrapper(readfile, encoding):
                    print(repr(line))
    

    Note: TextIOWrapper() uses universal newline mode by default. rU mode in zfile.open() is deprecated since version 3.4.

    It avoids issues with multibyte encodings described in @Peter DeGlopper's answer.

    0 讨论(0)
  • 2020-12-19 06:42

    The reason why you're seeing that error is because you are trying to mix bytes with unicode. The argument to split must also be byte-string:

    >>> line = b'$0.0\t1822\t1\t1\t1\n'
    >>> line.split(b'\t')
    [b'$0.0', b'1822', b'1', b'1', b'1\n']
    

    To get a unicode string, use decode:

    >>> line.decode('utf-8')
    '$0.0\t1822\t1\t1\t1\n'
    
    0 讨论(0)
提交回复
热议问题