Why urllib returns garbage from some wikipedia articles?

>>> import urllib2

>>> good_article = 'http://en.wikipedia.org/wiki/Wikipedia'
>>> bad_article = 'http://en.wikipedia.org/wiki/India'

>>> req1 = urllib2.Request(good_article)
>>> req2 = urllib2.Request(bad_article)
>>> req1.add_header('User-Agent', 'Mozilla/5.0')
>>> req2.add_header('User-Agent', 'Mozilla/5.0')
>>> result1 = urllib2.urlopen(req1)
>>> result2 = urllib2.urlopen(req2)

>>> result1.readline()
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n'
>>> result2.readline()
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec\xfdi\x8f$I\x96\x18\x08~\xee\xfe\x15BO\x06+\x82\xeefn\xa7\x9b[D\x855<<\x8e\x8c\xcc8\x9c\xe1\x9e\x99l6{\x15bf\xeaf\x1a\xae\xa6j\xa5\x87{x\x12\x1cT-\xb0 \xb1\xc0\x00\x0b4\x81\x01wg?\x10S\xe4\xee\x92\x98\x9d\x9ec\x01\x12\x8b]\x02\xdd5\x1f8\x1c\xf07\xd4\xd4\x1f\xd8\xbf\xb0\xef\x10\x11\x155\x15\xb5\xc3#\xb2"\xbaf\xea\x087\x95KEE\x9e<y\xf7\xfb\xf9\xdfz\xfa\xf6\xf4\xe2O\xcf\x9e\x89y\xb6\x08\xc5\xd9wO^\xbd<\x15{\x8d\xc3\xc3\x1f\xba\xa7\x87\x87O/\x9e\x8a\xbf\xff\xf5\xc5\xebW\xa2\xddl\x89\x8bDFi\x90\x05q$\xc3\xc3\xc3go\xf6\xc4\xde<\xcb\x96\x0f\x0f\x0fonn\x9a7\xddf\x9c\xcc\x0e/\xde\x1d~\xc0\xb1\xda\xd8Y\xfdldV\xcf\xe64\x9b\xee\x8d\xfe\xf8\xe7\xf4\xc2PF\xb3\xc7{~\xb4\'\xa6A\xf2x/\xcc\x92=\xf1a\x11F\xe9c\xc7\xd0\xed\xe1p\xc8#R\x7f_N\xe1O\x16d\xa1?z\x19M\x03)\x1a\xe2\x87\xe0*X\xfa\xf0\xfb@ds_\\&\xbe/\xfchr;\tc*\xfe\xf9!\xb7\xff\xe3\x9f/\xfcL\n'

Seems that the reason is not in headers because I tried exactly the same headers as my browser sends and urllib2 still returns this garbage.

Most of the pages returned normally

It's not an environment, locale, or encoding problem. The offending stream of bytes is gzip-compressed. The \x1f\x8B at the start is what you get at the start of a gzip stream with the default settings.

Looks as though the server is ignoring the fact that you didn't do

req2.add_header('Accept-encoding', 'gzip')

You should look at result.headers.getheader('Content-Encoding') and if necessary, decompress it yourself.

I think there is something else causing you a problem. That series of bytes looks some encoded content.

import urllib2
bad_article = 'http://en.wikipedia.org/wiki/India'
req = urllib2.Request(bad_article)
req.add_header('User-Agent', 'Mozilla/5.0')
result = urllib2.urlopen(req)
print result.readline()

resulted in this

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

which is correct.

Do a "curl -i" for both the links. If its coming fine, there is no environment problem.

来源：https://stackoverflow.com/questions/5131985/why-urllib-returns-garbage-from-some-wikipedia-articles

标签

python

urllib2