可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I downloaded a webpage in my python script. In most cases, this works fine.
However, this one had a response header: GZIP encoding, and when I tried to print the source code of this web page, it had all symbols in my putty.
How do decode this to regular text?
回答1:
I use zlib to decompress gzipped content from web.
import zlib ... # f=urllib2.urlopen(url) decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)
回答2:
Decompress your byte stream using the built-in gzip module.
If you have any problems, do show the exact minimal code that you used, the exact error message and traceback, together with the result of print repr(your_byte_stream[:100])
Further information
1. For an explanation of the gzip/zlib/deflate confusion, read the "Other uses" section of this Wikipedia article.
2. It can be easier to use the zlib module than the gzip module if you have a string rather than a file. Unfortunately the Python docs are incomplete/wrong:
""" zlib.decompress(string[, wbits[, bufsize]]) ... The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. The default value is 15. When wbits is negative, the standard gzip header is suppressed; this is an undocumented feature of the zlib library, used for compatibility with unzip‘s compression file format. """
Firstly, 8
arg == log2_window_size means assume string is in zlib format (RFC 1950; what the HTTP 1.1 RFC 2616 confusingly calls "deflate").
arg == -log2_window_size means assume string is in deflate format (RFC 1951; what people who didn't read the HTTP 1.1 RFC carefully actually implemented)
arg == 16 + log_2_window_size means assume string is in gzip format (RFC 1952). So you can use 31.
The above information is documented in the zlib C library manual ... Ctrl-F search for windowBits
.
回答3:
I use something like that:
f = urllib2.urlopen(request) data = f.read() try: from cStringIO import StringIO from gzip import GzipFile data2 = GzipFile('', 'r', 0, StringIO(data)).read() data = data2 except: #print "decompress error %s" % err pass return data
回答4:
for python3
try out this
import gzip fetch = opener.open(request) # basically get a response object data = gzip.decompress(fetch.read()) data = str(data,'utf-8')
回答5:
Similar to Shatu's answer for python3, but arranged a little differently:
import gzip s = Request("https://someplace.com", None, headers) r = urlopen(s, None, 180).read() try: r = gzip.decompress(r) except OSError: pass result = json_load(r.decode())
This method allows for wrapping the gzip.decompress() in a try/except to capture and pass the OSError that results in situations where you may get mixed compressed and uncompressed data. Some small strings actually get bigger if they are encoded, so the plain data is sent instead.
回答6:
You can use urllib3 to easily decode gzip.
urllib3.response.decode_gzip(response.data)