Reading utf-8 characters from a gzip file in python

后端 未结 5 1128
执念已碎
执念已碎 2020-12-14 07:38

I am trying to read a gunzipped file (.gz) in python and am having some trouble.

I used the gzip module to read it but the file is encoded as a utf-8 text file so ev

相关标签:
5条回答
  • 2020-12-14 07:50

    The above produced tons of decoding errors. I used this:

    for line in io.TextIOWrapper(io.BufferedReader(gzip.open(filePath)), encoding='utf8', errors='ignore'):
        ...
    
    0 讨论(0)
  • 2020-12-14 07:54

    I don't see why this should be so hard.

    What are you doing exactly? Please explain "eventually it reads an invalid character".

    It should be as simple as:

    import gzip
    fp = gzip.open('foo.gz')
    contents = fp.read() # contents now has the uncompressed bytes of foo.gz
    fp.close()
    u_str = contents.decode('utf-8') # u_str is now a unicode string
    

    EDITED

    This answer works for Python2 in Python3, please see @SeppoEnarvi 's answer at https://stackoverflow.com/a/19794943/610569 (it uses the rt mode for gzip.open.

    0 讨论(0)
  • 2020-12-14 07:57

    Maybe

    import codecs
    zf = gzip.open(fname, 'rb')
    reader = codecs.getreader("utf-8")
    contents = reader( zf )
    for line in contents:
        pass
    
    0 讨论(0)
  • 2020-12-14 07:59

    In pythonic form (2.5 or greater)

    from __future__ import with_statement # for 2.5, does nothing in 2.6
    from gzip import open as gzopen
    
    with gzopen('foo.gz') as gzfile:
        for line in gzfile:
          print line.decode('utf-8')
    
    0 讨论(0)
  • 2020-12-14 08:08

    This is possible in Python 3.3:

    import gzip
    gzip.open('file.gz', 'rt', encoding='utf-8')
    

    Notice that gzip.open() requires you to explicitly specify text mode ('t').

    0 讨论(0)
提交回复
热议问题