Reading utf-8 characters from a gzip file in python

假如想象 提交于 2019-12-29 04:11:32

问题


I am trying to read a gunzipped file (.gz) in python and am having some trouble.

I used the gzip module to read it but the file is encoded as a utf-8 text file so eventually it reads an invalid character and crashes.

Does anyone know how to read gzip files encoded as utf-8 files? I know that there's a codecs module that can help but I can't understand how to use it.

Thanks!

import string
import gzip
import codecs

f = gzip.open('file.gz','r')

engines = {}
line = f.readline()
while line:
    parsed = string.split(line, u'\u0001')

    #do some things...

    line = f.readline()
for en in engines:
  print(en)

回答1:


I don't see why this should be so hard.

What are you doing exactly? Please explain "eventually it reads an invalid character".

It should be as simple as:

import gzip
fp = gzip.open('foo.gz')
contents = fp.read() # contents now has the uncompressed bytes of foo.gz
fp.close()
u_str = contents.decode('utf-8') # u_str is now a unicode string

EDITED

This answer works for Python2 in Python3, please see @SeppoEnarvi 's answer at https://stackoverflow.com/a/19794943/610569 (it uses the rt mode for gzip.open.




回答2:


This is possible in Python 3.3:

import gzip
gzip.open('file.gz', 'rt', encoding='utf-8')

Notice that gzip.open() requires you to explicitly specify text mode ('t').




回答3:


Maybe

import codecs
zf = gzip.open(fname, 'rb')
reader = codecs.getreader("utf-8")
contents = reader( zf )
for line in contents:
    pass



回答4:


The above produced tons of decoding errors. I used this:

for line in io.TextIOWrapper(io.BufferedReader(gzip.open(filePath)), encoding='utf8', errors='ignore'):
    ...



回答5:


In pythonic form (2.5 or greater)

from __future__ import with_statement # for 2.5, does nothing in 2.6
from gzip import open as gzopen

with gzopen('foo.gz') as gzfile:
    for line in gzfile:
      print line.decode('utf-8')


来源:https://stackoverflow.com/questions/1883604/reading-utf-8-characters-from-a-gzip-file-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!